Watch all the Transform 2020 sessions on-demand here.
What do the world’s most popular virtual assistants — Google Assistant, Amazon’s Alexa, Microsoft’s Cortana, and Apple’s Siri — have in common? They perform much of their speech recognition in the cloud, where their natural language models take advantage of powerful servers with nearly limitless processing power. It’s amenable for the most part — typically, processing happens in milliseconds — but poses an obvious problem for users who find themselves without an internet connection.
Luckily, the Alexa Machine Learning team at Amazon recently made headway in bringing voice recognition models offline. They’ve developed navigation, temperature control, and music playback algorithms that can be performed locally, on-device.
The results of their research (“Statistical Model Compression for Small-Footprint Natural Language Understanding“) will be presented at this year’s Interspeech machine learning conference in Hyderabad, India.
It wasn’t easy. As the researchers explained, natural language processing models tend to have significant memory footprints. And the third-party apps that extend Alexa’s functionality — skills — are loaded on-demand, only when needed; storing them in memory adds significant latency to voice recognition.
June 5th: The AI Audit in NYC
Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.
“Alexa’s natural-language-understanding systems … use several different types of machine-learning (ML) models, but they all share some common traits,” wrote Grant Strimel, a lead author, in the blog post. “One is that they learn to extract ‘features’ — or strings of text with particular predictive value — from input utterances … Another common trait is that each feature has a set of associated ‘weights,’ which determine how large a role it should play in different types of computation. The need to store multiple weights for millions of features is what makes ML models so memory intensive.”
Eventually, they settled on a two-part solution: parameter quantization and perfect feature hashing.
Quantization — the process of converting a continuous range of values into a finite range of discrete values — is a conventional technique in algorithmic model compression. Here, the researchers divvied up the weights into 256 intervals, which allowed them to represent every weight in the model with a single byte of data. They rounded low weights to zero so that they could be discarded.
The researchers’ second technique leveraged hash functions — functions that, as Strimel wrote, “takes arbitrary inputs and scrambles them up … in such a way that the outputs (1) are of fixed size and (2) bear no predictable relationship to the inputs.” For example, if the output size was 16 bits with 65,536 possible hash values, a value of 1 might map to “Weezer,” while a value of 50 might correspond to “Elton John.”
The problem with hash functions, though, is that they tend to result in collisions, or related values (e.g., “Hank Williams, Jr.” and “Hank Williams”) that don’t map to the same coarse location in the list of hashes. The metadata required to distinguish between the values’ weights often requires more space in memory than the data it’s tagging.
To account for collisions, the team used a technique called perfect hashing, which maps a specific number of data items to the same number of memory slots.
“[T]he system can simply hash a string of characters and pull up the corresponding weights — no metadata required,” Strimel wrote.
In the end, the team said, quantization and hash functions resulted in a 14-fold reduction in memory usage compared to the online voice recognition models. And impressively, it didn’t affect accuracy — the offline algorithms performed “almost as well” as the baseline models, with error increases of less than 1 percent.
“We observed the methods sacrifice minimally in terms of model evaluation time and predictive performance for the substantial compression gains observed,” they wrote. “We aim to reduce … memory footprint to enable local voice-assistants and decrease latency of [natural language processing] models in the cloud.”