Watch all the Transform 2020 sessions on-demand here.
Nvidia today announced that it has trained the world’s largest language model, just the latest in a series of updates the GPU maker has aimed at advancing conversational AI.
To achieve this feat, Nvidia utilized model parallelism, splitting a neural network into pieces with a technique for creating models that are too big to fit within the memory of a single GPU. The model uses 8.3 billion parameters and is 24 times larger than BERT and 5 times larger than OpenAI’s GPT-2.
Nvidia also announced the fastest training and inference times of Bidirectional Encoder Representations (BERT), a popular model that was state of the art when it was open-sourced by Google in 2018.
Nvidia was able to train BERT-Large using optimized PyTorch software and a DGX-SuperPOD of more than 1,000 GPUs that is able to train BERT in 53 minutes.
June 5th: The AI Audit in NYC
Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.
“Without this kind of technology, it can take weeks to train one of these large language models,” Nvidia applied deep learning VP Bryan Catarazano said in a conversation with reporters and analysts.
Nvidia also claims it has achieved the fastest BERT inference time, dropping down to 2.2 milliseconds by running on a Tesla T4 GPU and TensorRT 5.1 optimized for datacenter inference. BERT inference takes up to 40 milliseconds when served by CPUs, while many conversational AI operations shoot for 10 milliseconds today, Catarazano said.
GPUs have also enabled gains for Microsoft’s Bing, which has used Nvidia hardware to cut latency time in half.
Each of the advances introduced today is meant to underline performance gains Nvidia’s GPU can provide for language understanding. Code for each of the above feats was open-sourced today to help AI practitioners and researchers explore the creation of large language models or speed training or inference with GPUs.
Alongside a sharp decline in word error rates, reduced latency has been a major enabler of adoption rates for popular AI assistants like Amazon’s Alexa, Google Assistant, and Baidu’s Duer.
Exchanges with little to no delay lead to machine-to-human conversations that feel more like human-to-human conversations, which generally happen at the speed of thought.
Like multi-turn dialogue features introduced for Microsoft’s Cortana, Alexa, and Google Assistant this year, real-time exchanges with an assistant make back-and-forth interactions feel more natural.
Evolution of the state of the art for conversational AI systems has largely revolved around the evolution of Google’s Transformer-based language model in 2017 and BERT in 2018.
Since then, BERT was surpassed by Microsoft’s MT-DNN, Google’s XLNet, and Baidu’s ERNIE, each of which builds on BERT. Facebook introduced RoBERTa –also derived from BERT — in July. RoBERTa is currently ranked atop the GLUE benchmark leaderboard, with best in four of 9 language tasks. Each of the models outperforms human baseline on GLUE tasks.