Skip to main content

How Arm wants to bring machine learning to ordinary computing devices

Arm custom instructions
Arm custom instructions
Image Credit: Arm
Yep, it's like a supercomputer in this self-driving car.

Above: VSI Labs’ full stack of AV technology in the trunk of their research vehicle.

Image Credit: Dean Takahashi

VentureBeat: Intel theoretically too, but —

Roddy: Intel actually buys a lot from Arm. Architecturally, at the level of — Nvidia is a great example. They have their own NPU. They call it the NVDLA. They recognize that for training in the cloud, yeah, it’s GPUs. That’s their bulwark. But when they talk about edge devices, they’ve even said that not everyone can have a 50-watt GPU in their pocket. They have their own version of what we talked about here, NPUs that are fixed-point implementations of integer arithmetic in different sizes. Things from 4 square millimeters down to a square millimeter of silicon. This thing runs at less than a watt. That’s way better than a high-powered GPU.

If you have a relatively modern phone in your pocket, you have an NPU. If you have bought an $800 phone in the last couple of years, it has an NPU. Apple has one. Samsung has one. Huawei has a couple generations. They’ve all done their own. We would predict, over time, that the majority of those companies will not continue to develop their own hardware. A neural net is basically just a giant DSP filter. You have a giant set of coefficients in a big image, for example. I might have 16 million coefficients in my image classifier, and I have a 4-megapixel image. That’s just a giant multiplication. It’s multiply accumulate. That’s why we talk about the multiply accumulate performance of our CPUs. That’s why we build these NPUs that do nothing but multiply accumulate. It’s a giant filter.

The reality is, there’s only so much you can do to innovate around 8 x 8 multiply. The basic building block is what it is. It’s system design. We have a lot of stuff in our design around minimizing data movement. It’s being smart about data movement at the block level, at the system level. I would not expect that, 10 years from now, every cell phone vendor and every automotive vendor has their own dedicated NPU. It doesn’t make sense. Software and algorithms, absolutely. Architecture, yeah. But the building block engines will probably become licensed, just like CPUs and GPUs have.


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


No guarantee that we’re the ones that win it. We’d like to think so. Someone will be. There will probably be a couple of really good vendors that license NPUs, and most of the proprietary things will go away. We hope we’re one of the winners. We like to think we have the wherewithal to invest to win, even if our first ones aren’t market winners. Indications are it’s actually pretty good. We’d expect that to happen over the course of five or 10 years. At the system level, there are so many system design choices and system software choices. That’s going to be the key differentiator.

Dipti Vachani of Arm announces Autonomous Vehicle

Above: Dipti Vachani of Arm announces Autonomous Vehicle Computing Consortium.

Image Credit: Dean Takahashi

VentureBeat: On the level where you would compete, then, does it feel like Arm is playing some catch-up, or would you dispute that?

Roddy: It depends on what you’re looking at and what your impression is. If you literally sit down and say, “Snapshot, right now: How much AI is running in the world at this very moment and where is it running?” Arm is the hands-down winner. The vast majority of AI algorithms don’t actually require a dedicated NPU. Machine learning goes all the way down to things like — you have predictive text on your phone. Your phone is probably enabled with “Okay Google” or “Hello Siri.” That’s a machine learning thing. It’s probably not running on a GPU or an NPU. It’s probably just running on an M-class core.

If you look at cell phones in the market, how many smartphones are out there? In service, maybe 4 billion to 5 billion? About 15-20% of them have NPUs. That’s the last three generations of Apple phones, the last two to three generations of Samsung phones. Call it half a billion. Generously, maybe a billion. But everyone has Facebook. Everyone has Google predictive text. Everyone has a voice assistant. That’s a neural net, and it runs on the CPU with all those other systems. There’s no other choice.

If you take a quick snap and look at where most of the inference is running, it’s on a CPU and most of it is ARM. Even in the cloud, when you talk about where inference is running — not training, but the deployment — the vast majority of those are running on CPUs. Mostly Intel CPUs, obviously, but if you’re using Amazon, there are ARM servers.

What’s the classic one in finance? I want to have satellite photos of shopping centers analyzed so I can see traffic patterns at Home Depot and know whether I should short or go long on Home Depot stock. People actually do that. You need a bunch of satellite imagery to train on. You also need financial reports. You have the pictures of all the traffic around all the Home Depots or all the JC Penny’s and you correlate with the quarterly results over the last 15 years and you build a neural net. Now we think we have a model correlating traffic patterns to financial results. Let’s take a look at live shots from the satellite over the last three days at all the Home Depots in North America and come up with a prediction for what their earnings will be.

That actual prediction, that inference, runs on a CPU. It might take weeks of training on a GPU to build the model, but I have 1,000 photos. Each inference takes half a second. You don’t fire up a bunch of GPUs for that. You run it and 20 minutes later you’re done. You’ve done your prediction. The reality is we’re the leading implementer of neural net stuff. But when it comes to perception about the glamour NPU, it’s true. We don’t have one in the marketplace today. By that token, we’re behind.

But admittedly, we’re just introducing our family of NPUs for design now. We have three NPUs. We’ve licensed them all. They’re in the hands of our customers. They’re doing design. You’re not going to see silicon this year. Maybe late next year. I don’t have anyone lined up to make an announcement. It’s going to be another decade before the whole field settles down. Huawei has their own. Apple has their own. Samsung has their own. Qualcomm has their own. Nvidia has their own. Everyone has their own. Do they all really need to spend 100 people a year investing in hardware to do 8-bit multipliers? The answer is, probably not.

VentureBeat: I remember that Apple had — they described their latest chip at their event. They said the machine learning bit was 6 times more powerful than the previous one. The investment into that part of the chip made sense. That’s the part that gives you a lot of bang for your buck. When you’re looking at these larger system-on-chip things in the phones and other beefy devices, are you expecting that part to get magnified, blown up, doubled and tripled down within that larger silicon pie?

Roddy: Some things yes, some things no. We’re seeing a proliferation of machine learning functionality in a couple of different ways. One of the unanticipated ways is how it’s pushed more rapidly into lower-cost devices than history might have predicted. Screen sizes and camera sizes used to trickle down at a regular rate, generation after generation, from high-end to mid-range to low-end phones. We’ve seen a much more rapid proliferation, because you can do interesting things with an NPU that in some ways allow you to cut costs elsewhere in the system, or enable functionality that’s different than the rest of the system.

The great example in low-cost phones — take the notion of face unlock. Face unlock is typically a low-power camera, low resolution, has to discern my face from your face. That’s about all it needs to do. If you’re a teenager, your buddy can’t open your phone and start sending funny texts. That typically runs just in software on a CPU, typically an ARM CPU. That’s good enough to unlock a phone, whether it’s a $1,000 phone or a $100 phone.

But now you want to turn that $100 phone into a proxy banking facility for the unbanked in the developing world. You don’t want just a bad camera doing a quick selfie to determine who’s making a financial transaction. You need much more accurate 3D mapping of the face. You probably want a little iris scan along with it. If you can do that with a 20-, 30-, 40-cent addition to the applications processor by having a small dedicated NPU that only gets turned on to do the actual detailed facial analysis, that’s about the size of what people want from the smallest in our family of NPUs.

Suddenly, for the $100 cell phone it makes sense to put in a dedicated NPU, because it enables that phone to become a secure banking device. It wasn’t about making the selfies look better. The guy buying a $100 phone isn’t willing to pay to make selfies look better. But the banking companies are willing to subsidize that phone to get that stream of transactions, for sure, if they pick up a penny for every 80-cent microtransaction that occurs in Bangladesh or wherever. We’re seeing functionality that originally started as a vanity thing — make the Snapchat filter pretty, make my selfie look 20 years younger — now you can use it for different things.

VentureBeat: The machine learning percentage of the silicon budget, what would you say that is?

Roddy: It depends on the application. There are some product classes where they’re willing to put — what’s the state of the art these days? People putting 10 or 12 teraops into a cell phone. One thing we do is look at various types of functionality, and what’s the compute workload? How much of it is the neural net piece and how much of it is other forms of computation?

Something like speech processing, nah, forget it. It can run on M-class CPUs. You don’t need it to be able to do “OK Google” and “Hello Siri.” Go to the other end of the spectrum and look at something like video green screen, which would be me with my selfie saying, “Look at me! I’m on the beach!” It cuts me out and puts me in a beach scene, even though I’m really in a dull conference room. “Hi, honey, I’m still at the office,” even though I’m at the ball game. That’s a tremendous amount of horsepower.

But if you spend $1,200 on the latest phone and you want the coolest videos because you’re an Instagram influencer, sure. If it costs 5 bucks extra to pack a 20 teraop NPU in the phone, why not? Because it’s cool. It’s being driven by both ends. There’s some neat stuff you can do.