Google open-sources AI that can distinguish between voices with 92% accuracy

testsetset

Diarization — the process of partitioning out a speech sample into distinctive, homogeneous segments according to who said what, when — doesn’t come as easy to machines as it does to humans, and training a machine learning algorithm to perform it is tougher than it sounds. A robust diarization system must be able to associate new individuals with speech segments that it hasn’t previously encountered.

But Google’s AI research division has made promising progress toward a performant model. In a new paper (“Fully Supervised Speaker Diarization“) and accompanying blog post, researchers describe a new artificially intelligent (AI) system that “makes use of supervised speaker labels in a more effective manner.”

The core algorithms, which the paper’s authors claim achieve an online diarization error rate (DER) low enough for real-time applications — 7.6 percent on the NIST SRE 2000 CALLHOME benchmark, compared to 8.8 percent DER from Google’s previous method — is available in open source on Github.

Above: Speaker diarization on streaming audio, with different colors in the bottom axis indicating different speakers.

Image Credit: Google

The Google researchers’ new approach models speakers’ embeddings (i.e., mathematical representations of words and phrases) by a recurrent neural network (RNN), a type of machine learning model that can use its internal state to process sequences of inputs. Each speaker starts with its own RNN instance, which keeps updating the RNN state given new embeddings, enabling the system to learn high-level knowledge shared across speakers and utterances.

June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

“Since all components of this system can be learned in a supervised manner, it is preferred over unsupervised systems in scenarios where training data with high quality time-stamped speaker labels are available,” the researchers wrote in the paper. “Our system is fully supervised and is able to learn from examples where time-stamped speaker labels are annotated.”

In future work, the team plans to refine the model so that it can integrate contextual information to perform offline decoding, which they expect will further reduce DER. They also hope to model acoustic features directly, so that the entire speaker diarization system can be trained end-to-end.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Is your ai infrastructure ready for what's next?