Skip to main content

Google boosts Cloud Speech API with word-level timestamps and support for 30 new languages

testsetset

Google has announced a number of notable updates to its Cloud Speech API, a product first unveiled as part of the company’s Cloud Machine Learning platform last year.

The Cloud Speech API, in a nutshell, allows third-party developers and companies to integrate Google’s speech recognition smarts into their own products. For example, contact centers may wish to use the API to automatically route calls to specific departments by “listening” to a caller’s commands. Earlier this year, Twilio tapped the API for its voice platform, enabling its own developer customers to transform speech into text within their products.

Now Google has announced three new updates to the Cloud Speech API. Top of the list, arguably, is word-level time offsets, or timestamps. These are particularly useful for longer audio files when the user may need to find a specific word in the audio. It basically allows the audio to be mapped directly to text, allowing anyone from researchers to reporters to find exactly where a word or phrase was used in, say, an interview. It will also enable text to be displayed in real time as the audio is playing.

“Our number one most requested feature has been providing timestamp information for each word in the transcript,” explained Google product manager Dan Aharon, in a blog post.


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


Somewhat related to this, Google has also now extended long-form audio support from 80 minutes to 180 minutes, and it may support longer files on a “case-by-base” basis upon request, according to Aharon.

The final piece of the Cloud Speech API update news today is that Google has expanded support from the original 89 languages to 30 new tongues, including Swahili and Amharic, which are spoken by millions in Africa, as well as Bengali, which claims more than 200 million native speakers (Bangladesh and India), Urdu (Pakistan and India), Gujarati (India), and Javanese (Indonesia). Combined, the new language support opens Google’s speech recognition technology to around one billion people globally.

It’s worth noting here that the language update also impacts Google’s own consumer products, such as the Gboard Android app and Voice Search smarts.

“Our new expanded language support helps Cloud Speech API customers reach more users in more countries for an almost global reach,” continued Aharon. “In addition, it enables users in more countries to use speech to access products and services that up until now have never been available to them.”

Your voice is your password

Global speech and voice recognition is estimated to be a $6.19 billion market in 2017 and is expected to rise to $18.3 billion by 2023, according to a Research and Markets report issued today.

At Google’s annual I/O developer conference back in May, CEO Sundar Pichai revealed that the company’s speech recognition technology now has a 4.9 percent word error rate, meaning that it transcribes only every 20th word incorrectly. That represented a major improvement on the 23 percent error rate the company reported in 2013 and the 8 percent error rate it shared at I/O in 2015.

Much of this improvement is a direct result of Google adding deep learning neural networks to its speech recognition platform back in 2012. This entails training its system using bucketloads of data, such as snippets of existing audio files, and then pushing the system to make inferences when it receives new data.

Google isn’t the only major tech company doubling down on its speech recognition efforts. Last year, Microsoft announced that its speech recognition technology is now on a par with humans. In fact, researchers reported that Microsoft’s NIST 2000 automated system recorded a lower error rate when compared to professional transcriptionists.

Earlier this year, Facebook unveiled one of its first speech recognition offerings via its virtual reality (VR) subsidiary Oculus, thus allowing Oculus Rift and Samsung Gear VR users to perform voice searches for games, apps, and more.