Skip to main content

Amazon AI techniques improve speech recognition and dialog state tracking

Amazon Alexa Echo Dot 2019
Image Credit: Amazon

Watch all the Transform 2020 sessions on-demand here.


In a pair of preprint papers published this week on Arxiv.org, Amazon researchers describe two novel systems — one that facilitates automatic speech recognition and another responsible for dialog state tracking — that they say achieve state-of-the-art performance compared with several baseline models. Assuming the claims have merit, their work could enhance the accuracy of AI agents in enterprise and consumer domains while reducing the amount of data required to train them.

Machine reading comprehension

Building an AI assistant like Alexa that can understand requests and complete tasks necessitates a system that tracks the state of dialogues in back-and-forth conversations — a dialogue state tracking system. The state is typically defined as a pair of variables — a “slot” and a slot value — that define how speech and entity data is recognized and handled. For example, a third-party Alexa app might use an “actor” slot type to query filmographies with the names of actors and actresses supplied by a user.

Few large training data sets are available for dialogue state tracking systems. To remedy this problem, a team of Alexa researchers turned to machine reading comprehension, a field of study that aims to evaluate how well AI can understand human language. While dialogue state trackers focus on the contextual understanding of requests and the state from the conversation, reading comprehension is concerned with the general understanding of the text regardless of its format. Formulating dialog state tracking system tasks as reading comprehension tasks, then, benefits the former by enabling them to use the abundant reading comprehension data available.

The team designed a question for each slot in the dialogue state and then divided the slots into two types — categorical and extractive — based on the number of slot values in the ontology. (Categorical slots took one of several values, while extractive slots accepted an unlimited number of possible values.) Next, they devised two machine reading comprehension models for dialogue state tracking: one that used multiple-choice reading comprehension models where an answer had to be chosen from a limited number of options (for categorical slots) and a second that applied a span-based reading comprehension where the answer can be found in the form of a span in the conversation (for extractive slots).


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


To evaluate their approach, the team first turned to MultiWoz, an open source dialogue corpora containing 10,000 dialogues with annotated states across 7 distinct domains. They used 5 domains in total — “attraction,” “restaurant,” “taxi,” “train,” and “hotel” — with a total of 30 domain-slot pairs, which they supplemented with question-answer data sets (RACE and DREAM) from the Machine Reading for Question Answering (MRQA) challenge. The categorical dialogue state tracking model trained on DREAM and RACE and the extractive dialogue state tracking model trained on the broader MRQA, and both were fine-tuned on MultiWoz.

The researchers report that in a few-shot setting where they pretrained the models and selected a limited amount of fine-tuning data, they achieved 45.91% joint goal accuracy with around 1% (20-30 dialogues) of “hotel” domain data compared with the previous best result of 19.73%. Perhaps more impressively, even without any state tracking data (i.e., a zero-shot scenario), the models managed greater than 90% average slot accuracy in 12 out of 30 slots in MultiWoz.

Automatic speech recognition

Speech recognition models typically need vast amounts of transcribed audio data to attain good performance. The emergence of semi-supervised learning methods has alleviated this, fortunately; in self-supervised learning, a smaller labeled set is used to train an initial seed model that’s applied to a larger amount of unlabeled data to generate a hypothesis. The unlabeled data with the most reliable hypotheses are then added to the training data for retraining.

An Amazon team took this one step further with a framework they call deep contextualized acoustic representations (numerical sequences), which learns efficient, context-aware acoustic representations using a large amount of unlabeled data and applies those representations to speech recognition tasks with a limited amount of labeled data. A family of AI models learns the representations using both past and future information, predicting slices of acoustic feature representations during active speech processing.

In a series of experiments, the researchers conducted tests on the open source LibriSpeech data sets and a popular Wall Street Journal corpus. They used 81 hours of labeled speech data from the Wall Street Journal set to train their models, and somewhere between 100 hours to 960 hours from LibriSpeech.

In a semi-supervised setting on a test set sourced from the Wall Street Journal corpus, the representations achieved a relative improvement of up to 42% compared with a baseline approach, according to the researchers. And on a LibriSpeech test set, the models outperformed all baselines, even with only 100 hours of transcribed audio.

“End-to-end [automatic speech recognition] models are more demanding in the amount of training data required when compared to traditional hybrid models,” wrote the coauthors, who plan to release the pretrained models and code online. “Our approach can drastically reduce the amount of labeled data required.”