Skip to main content

Amazon’s voice-synthesizing AI mimics shifts in tempo, pitch, and volume

Amazon Alexa
Amazon Alexa
Image Credit: Shutterstock

Watch all the Transform 2020 sessions on-demand here.


Voice assistants like Alexa convert written words into speech using text-to-speech systems, the most capable of which tap AI to verbalize from scratch rather than stringing together prerecorded snippets of sounds. Neural text-to-speech systems, or NTTS, tend to produce more natural-sounding speech than conventional models, but arguably their real value lies in their adaptability, as they’re able to mimic the prosody of a recording, or its shifts in tempo, pitch, and volume.

In a paper (“Fine-Grained Robust Prosody Transfer for Single-Speaker Neural Text-to-Speech”) presented at this year’s Interspeech conference in Graz, Austria, Amazon scientists investigated prosody transfer with a system that enabled them to choose voices in recordings while preserving the original inflections. They say it significantly improved on past attempts, which generally haven’t adapted well to input voices they haven’t encountered before.

To this end, the team’s system leveraged prosodic features that are easier to normalize than the raw spectrograms (representations of changes in signal frequency over time) typically ingested by neural text-to-speech networks. It aligned speech signals with text at the level of phonemes, the smallest units of speech, and extracted features such as changes in pitch or volume for each phoneme from the spectrograms.

Here’s one sample:


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


Here’s the sample, transferred:

And here’s the sample synthesized:

The technique worked as well with unreliable text as it did with clean transcripts, the team claims, because it incorporated an automatic speech recognizer that attempted to guess the phonemes sequences corresponding to a given input signal. The recognizer represented these guesses as probability distributions, and it methodically eliminated them using word sequence frequency information.

The system took the speech recognizer’s low-level phoneme-sequence probabilities as inputs, allowing it to learn general correlations between phonemes and prosodic features instead of forcing the acoustic data to align with potentially inaccurate transcriptions. The result? In experiments, the team says the difference between its outputs and a system trained using reliable transcripts was “statistically insignificant.”

In a separate but related study (“Toward Achieving Robust Universal Neural Vocoding“), the same research team sought to train a vocoder — a synthesizer that produces sounds from an analysis of speech input — to attain state-of-the-art quality on voices it hadn’t previously encountered. They say that trained on a data set containing 2,000 utterances from 74 speakers in 17 languages, it outperformed speaker-specific vocoders in a range of conditions (e.g., whispered or sung speech or speech with heavy background noise) even in instances when it hadn’t seen data from a particular speaker, topic, or language before.