Watch all the Transform 2020 sessions on-demand here.
Microsoft announced today that its use of specialized hardware for AI computation allowed it to get more than 10 times faster performance for a machine learning model that powers functionality of its Bing search engine.
The system, which the company calls Brainwave, is designed to take a trained neural network and run it as quickly as possible, with minimal latency. The goal is to provide roughly real-time artificial intelligence predictions for applications like new Bing features. This announcement is another step toward Microsoft making acceleration available for its cloud customers to run their own dedicated hardware-powered AI models.
Bing also received a handful of feature updates today, including support for defining less-used words when users hover their mouse pointers over them and offering multiple answers to how-to questions. Those features are enabled by the additional power Brainwave provides.
Microsoft is using field-programmable gate arrays (FPGAs) from Intel to power its AI computation. FPGAs are essentially blank canvases that allow developers to deploy a wide variety of different circuits by sending fresh software. That provides an interesting combination of programmability and performance, since the resulting circuits are optimized for particular applications (like AI computation), but can be changed without building a new chip.
June 5th: The AI Audit in NYC
Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.
That hardware allowed Microsoft to not only create faster models but also to build more complex AI systems that would have required prohibitive amounts of compute capacity without the application of specialized hardware. For example, Bing’s Turing Prototype 1 model became 10 times more complex — compared to a version built for a CPU — as a result of the added computation capacity that came from using Brainwave. And while the Brainwave version is more complicated, Microsoft can also get results back from that model over 10 times faster.
Microsoft’s approach to AI computation differs from that of some of its peers, like Google, which created its own tensor processing unit (TPU) chips to provide similar functionality. Unlike FPGAs, Google’s TPUs can’t be reconfigured once they’re built. (Google deals with this by making its chips’ architecture as general as possible to deal with a wide variety of potential situations.)
The FPGAs Microsoft has deployed have dedicated digital signal processors on board that are optimized for performing some of the particular types of math needed for AI. This way the company is able to get some of the same benefits that come from building an application-specific integrated circuit (ASIC) such as a TPU.
Amazon has infrastructure-as-a-service instances that include attached FPGAs available through its cloud, though the company hasn’t discussed its own use of the hardware at length.
The Brainwave system is made up of a few components, starting with hundreds of thousands of FPGAs that Microsoft deployed in its datacenters around the world. Most of the company’s servers have an FPGA board attached that is connected to the top-of-rack network switch. That allows the servers to handle software-defined networking operations (as found in Azure Accelerated Networking) but also provides Microsoft with a pool of hardware-accelerated compute that isn’t necessarily tied to one FPGA per server.
For example, Brainwave can distribute a model across several FPGAs while tasking a smaller number of CPUs to support them. If a machine learning model requires the use of multiple FPGAs, Microsoft’s system bundles them up into what the company calls a “hardware microservice,” which then gets handed off to the Brainwave compiler for distributing a workload across the available silicon.
That compiler will take in a finished model, created using an AI programming framework like TensorFlow (which originated at Google) or Microsoft’s Cognitive Toolkit (also known as CNTK), and transform it into an intermediate representation that can then be split across multiple FPGAs for the best performance possible.
Rather than optimize every model for execution on an FPGA, Microsoft has created a soft processor it implements on its chips that is designed to provide a general purpose execution environment for machine learning inference. That way, developers don’t need to spend time optimizing individual gates, but they also get the benefits of hardware-accelerated compute.
It’s hard to compare Microsoft’s results to those other companies have reported since its paper only provides concrete performance metrics for proprietary neural networks. That’s not to say Microsoft is alone in this — Google’s paper laying out how its TPU performed only offers performance numbers for the company’s homegrown algorithms.
And Microsoft didn’t compare the results of its Brainwave tests against GPUs, which have become a popular choice for AI computation. The benefit FPGAs provide over those chips is that they don’t require extensive use of batched calculations. (For its part, Nvidia is working on improving its chips for use with AI in order to help overcome that limitation.)
One of the key innovations allowing Microsoft to post such high performance with the FPGAs is the use of new 8- and 9-bit floating point (i.e. numbers with a decimal point) data types. Microsoft found that those data types provided increased performance over fixed point data types like 8- or 16-bit integers, with minimal retraining necessary to take advantage of them.
Microsoft is also starting to deploy a special appliance in its datacenters that includes several FPGAs without all of the other server components. That way, the company will be better able to handle the load from Brainwave, since it can scale up the amount of soft-programmable hardware available without racking more servers. While more complex models will often require more than one CPU, they don’t require one CPU for every FPGA.
Microsoft is no stranger to using FPGAs to accelerate its AI computation. The company’s Bing team started working with the hardware in 2012 and has been increasing its use of those chips since then.
This news is also good for Intel, which acquired FPGA maker Altera in 2015. That deal, worth $16.7 billion, provided the chipmaker with the fuel to power Microsoft’s needs.
Now Microsoft is working on making Brainwave accessible to the outside. The FPGAs already power parts of its Cognitive Services portfolio of intelligent APIs that allow people to embed intelligent capabilities into their applications without AI expertise. The company is also planning to make the Brainwave-powered text comprehension capabilities it discussed today available to enterprise customers through its Bing for Business service.
Further down the road, it’s likely we’ll see a Brainwave service available through Microsoft Azure so customers can deploy their own models on top of Microsoft’s FPGAs.