Skip to main content

Google’s Translatotron is an end-to-end model that mimics human voices

Google AI
Google AI logo on screen at Google Event Center in Sunnyvale, California
Image Credit: Khari Johnson / VentureBeat

testsetset

Google AI today shared details about Translatotron, an experimental AI system capable of direct translations of a person’s voice into another language, an approach that allows synthesized translation of a person’s voice to keep the sound of the original speaker’s voice.

Traditionally, speech translation uses automatic speech recognition to convert speech to text, applies machine translation, then uses text-to-speech to produce a translation, but Translatotron is an end-to-end translation model. Translatotron can complete translations faster and with fewer complications than traditional cascaded models, researchers said.

“To the best of our knowledge, Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language. It is also able to retain the source speaker’s voice in the translated speech,” a blog post on the subject reads.

The BLEU score to measure machine translation quality found the experimental Translatotron to be lower quality than conventional cascade systems, but Translatotron achieved more accurate translations than baseline cascade translations.


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


The emergence of end-to-end models for machine translation began with a paper by French researchers accepted at NeurIPS in 2016.

To make Translatotron capable of carrying out end-to-end translations, researchers used a sequence-to-sequence model and spectrograms as input training data. A speaker encoder network is used to capture the character of the speaker’s voice, and multitask learning is used to predict words used by source and target speakers.

Translatotron is spelled out in more detail in a paper published today titled “Direct speech-to-speech translation with a sequence-to-sequence model.”

The release of Translatotron emerges a month after Google introduced SpecAugment, an AI model that uses computer vision and a variety of techniques to understand words from spectogram imagery.

Translatotron could be applied for things like Google Assistant’s Interpreter Mode, which made its debut for Home speakers in January. Interpreter Mode is capable of listening and providing speech-to-speech translation in 27 languages. Companies like Google and Microsoft are also using their language translation chops as a way to win over iOS users.

Translatotron is the latest advance in machine translation and language processing from Google.

Last week at Google’s I/O developer conference, Google shared that it shrunk its recurrent neural networks and language understanding models for on-device machine learning with smartphones, making Google Assistant up to 10 times faster. Google also introduced translations with Lens so your camera can translate more than 100 languages.