Despite decades of progress, an artificial intelligence (AI) platform capable of generating highly realistic speech remains elusive. There has been progress, however. In September 2016, London-based Google subsidiary DeepMind produced a deep neural network — layers of mathematical functions that loosely mimic the physiology of the human brain — that could sample human speech and directly model waveforms. Tests with U.S. English and Mandarin showed that it could outperform then state-of-the-art text-to-speech (TTS) systems, including Google’s own. Better yet, it took just two seconds for it to generate a sample.
Since then, Google and startups like Lyrebird have deployed WaveNet models in production (it’s been used to generate voices for the Google Assistant), but all implementations so far — including those from Facebook and Chinese search giant Baidu — have leveraged powerful cloud platforms and custom-designed application specific integrated circuits (ASICs) for processing. (Apple said in a blog post last year that it wasn’t feasible to use WaveNets in services like Siri yet because of their “extremely high computational cost.”) But Voysis, a Dublin startup, today announced that it’s developed WaveNet-based tech that can not only run offline, but on smartphones and other devices with mobile processors.
Voysis calls its solution ViEW, or Voysis Embedded WaveNet. ViEW, like other WaveNets, taps a convolutional neural network — a type of algorithm that takes raw signal as inputs and synthesizes an output one sample at a time — to process raw audio signals directly. It only needs 50MB to run — 10 times smaller than Apple’s Siri model, the company claims. It also takes advantage of graphics chips and other hardware acceleration where available and is available to Voysis clients starting today.
Here’s a voice sample produced by the model:
June 5th: The AI Audit in NYC
Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.
“ViEW, Voysis Embedded WaveNet, is the beginning of the next evolution of voice and conversational capability. This technology opens the door to having intelligent conversations with any and all devices. As consumer data is processed locally on-device, consumer privacy concerns are addressed; and business concerns around datacenter costs, uptime, and maintenance are also addressed,” said Voysis cofounder Dr. Peter Cahill.
Traditional offline, at-the-edge text-to-speech systems employ a method called concatenation for synthesis. In essence, they divvy databases of recorded speech up into small units — individual phones, diphones, half-phones, syllables, words, phrases, and sentences — that software intelligently stitches together. As a result of differences between natural variations in speech and shortcomings in automated waveform segmentation techniques, the results often sound unnatural.
WaveNets avoid the problem by generating novel speech.
Voysis claimed a breakthrough about a year ago in November, when it released eerily convincing speech samples produced entirely by an algorithm. “The new generation of speech technologies are going to emerge on the back of this,” Cahill told Forbes at the time.
Cahill, who’s spent the better part of 15 years working on voice recognition in academia, founded Voysis with the goal of tackling specific domains in natural language processing, like ecommerce and entertainment. Its Voysis Commerce platform allows retail clients to, for example, feed in a database of existing material, including copy written for advertisements and product pages, that informs a uniquely tailored voice model capable of tracking context. The algorithms improve over time and can be retrained with a single button press in Voysis’ cloud dashboard.
“Everything is reproducible by default, tasks are de-duplicated automatically, code can scale over thousands of machines without our scientists needing to write a single line of code,” Voysis writes on its website. Its proprietary speech recognition and deep learning tech is available in the form of a software development kit (SDK) for Android and iOS, in addition to APIs and JavaScript libraries that can be integrated into websites.
Voysis’ roughly 15-person team, which is spread among offices in Edinburgh, Scotland, and Boston, more than doubled to 40 in 2017, thanks to $8 million in Series A capital from Polaris Partners. The startup counts Ian Hodson — the former head of Google’s text-to-speech program who led efforts on Google Maps, Google Assistant, and Android — among the core team.
It’s in a lucrative segment. The market for text-to-speech applications is expected to grow to $3 billion by 2022, according to Research and Markets, and sales of digital assistants could hit $4 billion by the same year. Sales of smart speakers like Google Home and Amazon Echo are on the rise, too — a September study by Adobe projects that about half of all consumers in the U.S. will own an in-home device with voice recognition capabilities by the end of this year.