Personal assistants like Apple’s Siri accomplish tasks through natural language commands. However, their underlying components often rely on supervised machine learning algorithms requiring large amounts of hand-annotated training data. In an attempt to reduce the time and effort taken to collect this data, researchers at Apple developed a framework that leverages user engagement signals to automatically create data-augmenting labels. They report that when incorporated using strategies like multi-task learning and validation with an external knowledge base, the annotated data significantly improves accuracy in a production deep learning system.
“We believe this is the first use of user engagement signals to help generate training data for a sequence labeling task on a large scale, and can be applied in practical settings to speed up new feature deployment when little human-annotated data is available,” wrote the researchers in a preprint paper. “Moreover … user engagement signals can help us to identify where the digital assistant needs improvement by learning from its own mistakes.”
The researchers used a range of heuristics to identify behaviors indicating either positive or negative engagement. A few included tapping on content to engage with it further (a positive response), listening to a song for a long duration (another positive response), or interrupting content provided by an intelligent assistant and manually selecting different content (a negative response). Those signals were selectively harvested in a “privacy-preserving manner” to automatically produce ground truth annotations, and they were subsequently combined with coarse-grained labels provided by human annotators.
In order to incorporate the coarse-grained labels and the inferred fine-grained labels into an AI model, the paper’s coauthors devised a multi-task learning framework that treats coarse-grained and fine-grained entity labeling as two tasks. Additionally, they incorporated an external knowledge base validator consisting of entities and their relations. Given the prediction “something” as a music title and “the Beatles” as a music artist for the query “Play something by the Beatles,” the validator would perform a lookup for the top label alternatives and send them to a component that would re-rank the predictions and return the best alternative.
June 5th: The AI Audit in NYC
Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.
The researchers conducted two separate test sets to evaluate the tasks performed by the multi-task model, which they compiled by randomly sampling from the production system and hand-annotating with ground truth labels. They say that across 21 model runs, adding 260,000 training examples “consistently” reduced the coarse-grained entity error rate on a prediction task compared with the baseline for all amounts of human-annotated data. Moreover, they report that adding weakly supervised fine-grained data had a larger impact when there was a relatively small amount of human-annotated data (5,000 examples). Lastly, they report that for examples where any of the top model hypotheses passed the knowledge base validator, the fine-grained entity error rate dropped by around 50%.
In another experiment, the team sought to determine whether more granular representations of the user’s intent would increase the likelihood of the system selecting the correct action. They sampled roughly 5,000 “play music” commands containing references to multiple bands, artists, and songs and sent them through a system incorporating their framework, after which they asked annotators to grade the response returned by the system as “satisfactory” or “unsatisfactory.” The results produced by the enhanced system achieved a relative task error rate reduction of 24.64%, the researchers report.
They leave to future work exploring using individual users’ engagement behaviors to improve personalization.
“We observe that our model improves user-facing results especially for requests that contain difficult or unusual language patterns,” wrote the coauthors. “For example, the enhanced system correctly handles queries such as ‘Can you play Malibu from Miley Cyrus new album’ and ‘Play Humble through my music Kendrick Lamar.’ Also, the enhanced model identifies entities that users are more likely to refer to in cases of genuine linguistic ambiguity. For example, in ‘Play one by Metallica,’ ‘one’ could either be a non-entity token (meaning play any song by Metallica), or it refer specifically to the song called ‘One’ by ‘Metallica.’ Since most users listen to the song ‘One’ by the ‘Metallica’ whenever they say ‘Play one by Metallica,’ our model trained on engagement-annotated data will learn to predict ‘one’ as [the music title], thus better capturing trends and preferences in our user population.”
The work comes on the heels of a paper describing Apple’s Overton, an AI development tool whose models have processed “billions” of queries. Separately, the Cupertino company recently studied whether users preferred conversing with “chattier” AI assistants.