Skip to main content

Alexa researchers’ AI training technique improves speech recognition up to 15%

Watch all the Transform 2020 sessions on-demand here.


Thanks to a newly developed AI training method, Amazon’s Alexa assistant could become noticeably better at recognizing speech in new domains than it is currently. In a blog post published on the Alexa Blog this morning, Ankur Gandhe, a speech scientist in the Alexa Speech group, describes an artificially intelligent (AI) system that reduces recognition errors by up to 15 percent, partly by calculating the probability that specific grammar rules will produce a given string of words.

He and colleagues will present their work in a paper (“Scalable Language Model Adaptation for Spoken Dialogue Systems”) at the IEEE Spoken Language Technologies conference in Athens, Greece later this month.

As Gandhe explains, natural language processing (NLP) models that adapt to conversational context — i.e., systems that can distinguish between “Red Sox” and “red sauce” and, by extension, recognize that the former refers to a baseball team while the latter calls for a recipe — tend to work better than their generalizable counterparts. But they have to be retrained every time a new feature is introduced, which requires a lot of data, not to mention training time.

AI researchers often generate random samples of sentences from templates to perform said training, but Gandhe and team propose an algorithm that analyzes mathematical representations of a grammar’s rules. (In this context, “grammar” refers to strictures governing word or phrase substitutions.) They also lay out a technique for integrating newly generated and trained NLP models with existing systems in a way that doesn’t negatively affect their performance.


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


There’s more that goes into generating a language model than you might suspect. It starts with finite-state transducers, or FSTs, which can be represented as a graph — in other words, nodes connected by line segments. The graph’s line segments, or edges, are associated with a probability indicating the likelihood of a linguistic substitution, while the nodes serve as a sort of progress indicator of production of a text string. To generate sample sentences, the FST works its way through the graph and builds up a string of text one word or phrase at a time.

As Gandhe explains: “For instance, if a given node of the graph represents the text string ‘I want,’ it might have two edges, one representing the ‘need’/’want’ substitution and the other representing the ‘would like’/’want’ substitution.'”

The system first identifies every string of text encoded by the FST and every path through the graph that might lead to it, using the probabilities associated with the edges to compute the frequency with which the FST will produce a particular string. To integrate new language models with existing ones, it leverages an AI model that “infer[s] the optimal balance of the probabilities encoded in both.”

The team evaluated the system on three different NLP scenarios: looking up a stock price, finding and dictating a recipe, and booking airline tickets. At its peak, the flight-booking test saw the above-mentioned 15 percent error reduction.

The team believes its methods could improve speech recognition in newly introduced Alexa capabilities on day one, before larger datasets become available.

“It makes intuitive sense that, the more complex the grammar, the more training data would be required to produce an accurate language model, and the less reliable the data-sampling approach would be,” Gandhe wrote. “Consequently, we suspect that our method will provide greater gains on more challenging tasks.”