Google's speech recognition technology now has a 4.9% word error rate

testsetset

Google CEO Sundar Pichai today announced that the company’s speech recognition technology has now achieved a 4.9 percent word error rate. Put another way, Google transcribes every 20th word incorrectly. That’s a big improvement from the 23 percent the company saw in 2013 and the 8 percent it shared two years ago at I/O 2015.

The tidbit was revealed at Google’s I/O 2017 developer conference, where a big emphasis is on artificial intelligence. Deep learning, a type of AI, is used to achieve accurate image recognition and speech recognition. The method involves ingesting lots of data to train systems called neural networks, and then feeding new data to those systems in an attempt to make predictions.

“We’ve been using voice as an input across many of our products,” Pichai said onstage. “That’s because computers are getting much better at understanding speech. We have had significant breakthroughs, but the pace even since last year has been pretty amazing to see. Our word error rate continues to improve even in very noisy environments. This is why if you speak to Google on your phone or Google Home, we can pick up your voice accurately.”

For the sake of comparison, Microsoft declared in October 2016 that it had reached speech recognition parity with humans. Its word error rate at the time was 5.9 percent, though it’s not clear if the two companies are following the same standards of evaluation.

June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

Google has been touting its speech recognition improvements for a while now. Earlier this year, the company said it had slashed its speech recognition word error rate by more than 30 percent since 2012. The main reason for the drastic improvement? Google confirmed that it’s the use of neural networks.

Pichai also shared an interesting tidbit about Home’s development: “When we were shipping Google Home, we were originally planning to include eight microphones… But thanks to neural networks, using a technique called ‘neural beam forming’, we were able to ship it with just two microphones and achieve the same quality.”

So if you’re surprised at how well (or poorly) Google understands what you’re saying, this is why. Recognition is getting better and better, but there’s still room to get that word error rate closer to 0 percent.

The insights you need without the noise