Skip to main content

Facebook AI’s RoBERTa improves Google’s BERT pretraining methods

Image Credit: Reuters / Dado Ruvic

Watch all the Transform 2020 sessions on-demand here.


Facebook AI and University of Washington researchers devised ways to enhance Google’s BERT language model and achieve performance on par or exceeding state-of-the-art results in GLUE, SQuAD, and RACE benchmark data sets. Researchers detailed how RoBERTa works in a paper published last week on arXiv.

Named RoBERTa for “Robustly Optimized BERT approach,” the model adopts many of the techniques used by Bidirectional Encoder Representations from Transformers (BERT), a novel natural language model open-sourced by Google last fall.

Part of what’s different about RoBERTa is that it relies on pretraining with larger batches of data and changes to the masking pattern of training data. While in pretraining, the original BERT uses masked language modeling and next-sentence prediction, but RoBERTa drops the next-sentence prediction approach.

Overall, RoBERTa achieves state-of-the-art results in 4 of 9 GLUE benchmark tasks and boasts an overall GLUE task performance on par with XLNet.


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


“We find that BERT was significantly undertrained and can match or exceed the performance of every model published after it,” the report reads. “Our training improvements show that masked language model pretraining, under the right design choices, is competitive with all other recently published methods.”

To make RoBERTa, researchers used 1,024 Nvidia V100 GPUs for roughly one day.

The original BERT is trained with the 16GB BookCorpus data set and English Wikipedia, but RoBERTa utilizes CommonCrawl (CC)-News, a 76GB data set with 63 million English news articles obtained between September 2016 and February 2019.

“Finally, we pretrain RoBERTa for significantly longer, increasing the number of pretraining steps from 100K to 300K, and then further to 500K. We again observe significant gains in downstream task performance, and the 300K and 500K step models outperform XLNet across most tasks,” the report reads.

The introduction of RoBERTa continues what’s been an active year for the massive language understanding AI systems OpenAI’s GPT-2, Google Brain’s XLNet, and Microsoft’s MT-DNN, each of which surpassed BERT in benchmark performance results.

The cost of training such models can be extremely expensive and carry a sizable carbon footprint.

Earlier this month at Transform 2019, Facebook AI VP Jérôme Pesenti said that compute demands for cutting-edge or robust systems are a challenge even for companies like Google and Facebook.