Understanding NPUs and Their Role in AI

This video is sponsored by Skillshare. If you've seen what feels like any tech announcement lately, you're probably already sick and tired of hearing about on-device AI. Apple Intelligence. Personal intelligence right where you need it. On-device AI. AI advancements from Google. Meta AI built in. Apple Intelligence, Microsoft Recall, and even Meta's AI supposedly all run machine learning models right on your device and suddenly Every tech company is talking about NPUs or Neural Processing Units, the chips in the machines that suddenly run all of this hot new AI. So we have to ask, are NPUs the real deal or are they the next Silicon Valley snake oil after NFTs and blockchains and metaverses have run out I guess? Well, I've done the research and it turns out it's a bit of both. This is a close-up picture of a modern high-end laptop chip from Apple called the M3 Max. It is an SoC or a system on a chip which means that it includes many different kinds of processors all packaged together. Here's the CPU, the central processing unit along with its various cores and caches, here's the GPU, the graphics processor unit, here are the specialized parts that handle your displays and your I.O., which is basically USB ports and thunderbolt ports, etc. And perhaps most interesting to us, here is the NPU, the neural processing unit, or as Apple likes to call it, the neural engine. Now straight away, for the amount of hype that all the tech companies have given these NPUs, I kind of expected them to be bigger, like physically. On the M3 Max even boring parts of the chip like the display engines and the I.O. take up roughly 5 times as much space as the much hyped Neural Engine. Interestingly the M3 and the M3 Pro use the same size NPU as the Max does so as the chips get smaller the NPU proportionally gets bigger and proportionally the NPU gets even bigger when we get to the really small chips like the A17 Pro used in the iPhone. Obviously size or what the industry calls die area is at an extreme premium in all of these devices and in the next clip You'll be able to hear how chip makers definitely think that allocating a lot of it to the NPU is a real trade-off You guys are always pushing us to give you more you're always saying more tops Lisa more tops That was AMD's CEO Lisa Su implying that Microsoft really pushed them to prioritize the NPU over adding Something else like more CPU cores, for example So to sum things up we can see two pretty clear decisions made by the industry so far. First, while MPUs are growing in importance, they are still relatively minor parts of our chips. And second, they are clearly a higher priority in smaller devices and a lower priority in the bigger ones. There is a direct correlation with size and this holds true beyond just the four Apple chips that we've looked at too. NPUs have been common in smartphones since 2017 already for seven years now and they can be found in loads of modern smartwatches and even in chips powering tiny devices like the Meta Ray Bans. Meanwhile in PCs, they are just now starting to roll out. That's a pretty clear correlation and now to understand why NPUs actually matter, let's take a look at the concept of accelerators. Your computer is a math machine. All it does is calculate things. Every pixel, every sound wave, every character of text at the end of the day is represented by your computer as a series of numbers which it then does maths on. In the early days all of this math was done on the CPU, a single general-purpose math machine. CPUs are built to be super precise and also flexible so they can do any type of calculation that your computer could need including really complex logic using huge numbers, etc. But computer scientists have pretty quickly realized that many calculations simply don't need this amount of flexibility or precision. Take graphics for example. On a 4K display, you have roughly 8 million pixels and you'd want to refresh these at least 60 times a second. That is around 500 million updates per second. Now the math for calculating each and every pixel is super simple, but doing it 500 million times every second is quite a challenge. And this math is exactly what GPUs or graphics processing units are good at with the best ones currently having over 10,000 compute units. These are much simpler than the big cores of your CPU, but they're designed to run in parallel, so they can do a ton of small calculations very quickly. Beside GPUs, there's also tons of other accelerators, such as video encoders and decoders, or image signal processors built into every phone to handle the math needed to make your cameras work, etc. The idea here is pretty simple. If a calculation has to be performed again and again and again, it might be more efficient to just build a brand new chip for it that is dedicated to this calculation than to have your CPU do it. And that brings us to the newest accelerator on the block, the NPU, an accelerator built to run the math behind neural networks. Neural networks have for years now been used especially on mobile devices to power everything from auto-complete on your keyboard, to detecting which parts of an image count as background to be blurred in portrait mode, but also for detecting faces in Snapchat so filters can be drawn on them, or for powering more accurate and faster voice dictation, or for turning sensor readings into useful information like I don't know, step counts or heart rates or crash detection, etc. Tim Cook, for example, has claimed that around 200 different models are running on the iPhone already today. To be clear, the vast majority of that actually predates the current generative AI hype, and while many marketing departments would now like you to believe that neural networks are somehow magic, the good news is that they're actually just magic. math. Or depending on how you see things, that might actually be the bad news, because we're now gonna have to do a little bit of math together. And I promise you, I kept everything as simple as possible. I even made a little neural network just for us. So welcome to the dumbest neural network possible. This is Martin's image recognition machine. Let's just call it MIM. Okay, so the point of this network is that you can feed it an image that consists of exactly four black and white pixels, and the network can then tell you if this image has a diagonal line either this way or that way. That's it, it's the tiniest, most primitive cousin of an image recognition model. The model consists of so-called neurons, which are really just memory cells that hold a number, and then these are arranged into layers which are then connected. And these connections all have specific numbers called weights associated with them, which I'll explain in a little bit. So in our 4x4 image, let's assign a value to each pixel. One if it is white, and zero if it is black. This way we can express any of our images with just four numbers. We can take these four numbers and then input them into the first layer of our model, which we'll call the input layer for very obvious reasons. That's right, we've now given the model our image as an input. Meanwhile the output layer has two neurons. I made it so that if the first one is at 1, it means that the image gets recognized as having a diagonal line like this. If the second one is at 1, then the diagonal line should go like this. And if anything else is shown, there is no diagonal line detected. So our two output neurons are configured to tell us if the diagonal line goes like this or like this. Hence, image recognition. And the calculation goes like this. Take the first value from layer 1, multiply it by its weight and add it to the first neuron on layer 2. Then you do the same multiplication and addition for every single neuron and connection in the network and you'll get a result. Our simple network accurately shows a 1 for this diagonal line here. If we take the other diagonal line as input into our system, it correctly shows a 1 in the second neuron. And if we pick any other combination of pixels, we get something other than a 1. So this network correctly identifies our images. Yay! Why does it do that? Because the weights are set correctly. And this being a very simple network, I simply chose those weights manually to incentivize the connections that I wanted and to disincentivize the ones that I didn't. And that's it. On the fundamental level, that is the basic math behind neural networks. Now of course real neural networks are many, many, many orders of magnitude more complicated than this. You might have billions of neurons and all of these would be arranged into multiple so-called hidden layers for more complex logic and so on. All of these features of a model are called parameters and complex models can have over a trillion of them. That's right, over a trillion parameters and figuring out all of the intricacies of these networks is done with a process called training and that typically happens on some gigantic remote server farm. Once we have a model trained though, we can start using it, potentially even on our own machines, and that's how we get to NPUs. And because you now know the basic math that this requires, you also know what the requirements for the NPU actually are. The calculation that you've already seen is called a multiply-accumulate calculation, because you first multiply two numbers, and then you add it to a node. Then you multiply two more numbers, and you add that to the same node, and so on. You multiply, and you accumulate, and that's basically it. In the real world you might also add a few other things like biases and activation functions, which I won't explain in this video. And you'd also structure these calculations as matrix multiplications. But a multiply accumulate calculation is the basis here and this task has some very obvious characteristics. First the calculations themselves are actually extremely simple and repetitive but we might have to do billions of them a second. So we really want highly specialized hardware with many simple compute units working in parallel. Second we need a ton of RAM because the model has to be loaded into memory with its potentially billions of parameters basically all at once. This is why Co-Pilot Plus PCs require at least 16GB of RAM and why iPhones with less than 8GB of RAM won't be eligible for Apple Intelligence at all. The amount of RAM directly impacts how smart of a model you can use on any given machine. Third, the NPU also needs its own ultra-fast cache to store the results of its billions of small calculations without having to talk to the main system RAM each time. And fourth, we can speed up these calculations dramatically by accepting lower precision. Precision in this case simply means how much memory a computer allocates to any given number. For example, the more decimals that you want the computer to store for the number, the memory it will need. Precision costs memory. And because we typically want neural networks to just output some probabilistic guess instead of a highly precise answer anyway, we can kind of get away with lower precision and that in turn speeds up our calculations massively. And there you have it, the four things that MPUs need to be optimized for. They need to be able to do many different calculations in parallel, so many little cores. they need a ton of RAM, they need a ton of cash, and they also need low precision. And if you think that this sounds an awful lot like what we've recently said about GPUs, then you're correct. This is why GPUs are very often used for running AI workloads. Since 2017, many GPUs have even had dedicated AI hardware built in, which NVIDIA, for example, calls Tensor Cores. A 4090 card has a whopping 512 of these Tensor cores dedicated to AI next to other graphics focused cores, and by today even integrated GPUs like the ones found on Interlunar Lake have plenty of AI hardware too. In fact on Lunar Lake, the GPU actually has higher peak AI performance than the much hyped NPU, and an Nvidia 4090 card absolutely blows any current NPU out of the water in terms of raw AI performance. It is not even close. So in short, your GPU already is your primary AI chip when peak performance is desired. It has dedicated hardware and software for running neural networks. And developers already primarily target GPUs for any workload that needs peak performance from a neural network, like say, tasks for video editing. NPUs on the other hand are much more about power efficiency than they are about peak performance. The NPU on the Snapdragon X Elite chip according to Microsoft consumes less than 5 watts of power even at full load. It is a low power chip designed to run permanently in the background without having to spin up the main CPU or the GPU. This by the way explains why NPUs first became popular in mobile devices. On PCs we've generally optimized for peak performance in the past, while mobile users have always had to prioritize power efficiency, especially given that they run tasks like crash detection, heart rate monitoring and so on. permanently in the background. So really the question for computer makers now is what kind of permanent, efficient neural networking workloads they can have that can run in the background that would really benefit from an NPU. Stuff like image generation or chatbot running locally is often demoed but I think this is actually kind of silly. Your GPU could probably run these tasks just as well as your NPU and I bet that they're only used because they are the only features that regular consumers can actually understand. More compelling is stuff like real-time captioning and translations of audio playing on a computer, or for example Windows Studio effects that can blur the background, isolate your voice, cancel echo, etc. on the fly. You could easily use these for hours and do so on a battery, so using an NPU for them totally makes sense. So that's nice, but I don't think we've seen any real game changers in this category just yet. Instead, the most compelling candidate for an NPU by far was, of course, Recall. A system on Windows that takes screenshots of what you do every few seconds. It runs image recognition and optical character recognition software on each frame. And it creates a searchable database of everything on the fly all running quietly in the background without absolutely nuking your battery. This is exactly the kind of task that an MPU would be exceptionally good for. Computation wise, this is the absolute sweet spot for the NPU. It is a pretty complicated math problem and yet it should happen quietly in the background basically all the time without the user even having to notice it. Now the question of course is whether people will, at least eventually, accept this kind of an intrusive system. Or whether developers will at least come up with other tools that will be just as good of a use of an NPU as this. For now I think the answer to both of those questions is not very clear and so I also think that the value of the NPU is is not overwhelmingly obvious either. Now I don't know about you, but for me seeing this wave of generative AI taking over everything has ironically only gotten me to care about creativity and actually creating things myself a lot more than before. Not prompt engineering, but like actually creating stuff. And for that I've been spending more and more time on Skillshare. I'm particularly obsessed with Affinity Photo, which is the program that I've been making my thumbnails in recently, and I was watching this fantastic six-hour Skillshare masterclass to get a lot better at it. You might prefer illustration or graphic design or god forbid drawing things by hand or here's an idea. Maybe you want to learn how to make really good video essays and would you look at that? Researching and writing for YouTube. How to make a great video essay. This is a Skillshare class with a complete breakdown of my personal thinking process and workflow and just to toot my own horn a little it's really damn well received. I genuinely really liked making this class. Whatever your creative field of interest is, Skillshare is the largest online learning community so they have the right class for you and the platform is custom designed to help you learn. There are no distractions, you get to submit class projects and discuss with fellow students, and recently they've also introduced learning paths which kind of chain together classes in a curated way to get you really deep into a topic. The summer is my favorite time to step back a little and to start working on either some hobby projects or my work-related skills that I've been neglecting over the years. And for both of those, Skillshare is Excellent. The first 500 people to use my link in the description will receive a one month free trial of Skillshare. So get started today, I hope you check out my class as well, and I'll see you in the next video.

On-device AI. AI advancements from Google. Meta AI built in. Apple Intelligence, Microsoft Recall, and even Meta's AI supposedly all run machine learning models right on your device and suddenly Every tech company is talking about NPUs or Neural Processing Units, the chips in the machines that suddenly run all of this hot new AI. So we have to ask, are NPUs the real deal or are they the next Silicon Valley snake oil after NFTs and blockchains and metaverses have run out I guess?

Well, I've done the research and it turns out it's a bit of both. This is a close-up picture of a modern high-end laptop chip from Apple called the M3 Max. It is an SoC or a system on a chip which means that it includes many different kinds of processors all packaged together.

Here's the CPU, the central processing unit along with its various cores and caches, here's the GPU, the graphics processor unit, here are the specialized parts that handle your displays and your I.O., which is basically USB ports and thunderbolt ports, etc. And perhaps most interesting to us, here is the NPU, the neural processing unit, or as Apple likes to call it, the neural engine. Now straight away, for the amount of hype that all the tech companies have given these NPUs, I kind of expected them to be bigger, like physically. On the M3 Max even boring parts of the chip like the display engines and the I.O. take up roughly 5 times as much space as the much hyped Neural Engine.

Interestingly the M3 and the M3 Pro use the same size NPU as the Max does so as the chips get smaller the NPU proportionally gets bigger and proportionally the NPU gets even bigger when we get to the really small chips like the A17 Pro used in the iPhone. Obviously size or what the industry calls die area is at an extreme premium in all of these devices and in the next clip You'll be able to hear how chip makers definitely think that allocating a lot of it to the NPU is a real trade-off You guys are always pushing us to give you more you're always saying more tops Lisa more tops That was AMD's CEO Lisa Su implying that Microsoft really pushed them to prioritize the NPU over adding Something else like more CPU cores, for example So to sum things up we can see two pretty clear decisions made by the industry so far. First, while MPUs are growing in importance, they are still relatively minor parts of our chips. And second, they are clearly a higher priority in smaller devices and a lower priority in the bigger ones. There is a direct correlation with size and this holds true beyond just the four Apple chips that we've looked at too.

NPUs have been common in smartphones since 2017 already for seven years now and they can be found in loads of modern smartwatches and even in chips powering tiny devices like the Meta Ray Bans. Meanwhile in PCs, they are just now starting to roll out. That's a pretty clear correlation and now to understand why NPUs actually matter, let's take a look at the concept of accelerators. Your computer is a math machine.

All it does is calculate things. Every pixel, every sound wave, every character of text at the end of the day is represented by your computer as a series of numbers which it then does maths on. In the early days all of this math was done on the CPU, a single general-purpose math machine.

CPUs are built to be super precise and also flexible so they can do any type of calculation that your computer could need including really complex logic using huge numbers, etc. But computer scientists have pretty quickly realized that many calculations simply don't need this amount of flexibility or precision. Take graphics for example. On a 4K display, you have roughly 8 million pixels and you'd want to refresh these at least 60 times a second. That is around 500 million updates per second.

Now the math for calculating each and every pixel is super simple, but doing it 500 million times every second is quite a challenge. And this math is exactly what GPUs or graphics processing units are good at with the best ones currently having over 10,000 compute units. These are much simpler than the big cores of your CPU, but they're designed to run in parallel, so they can do a ton of small calculations very quickly.

Beside GPUs, there's also tons of other accelerators, such as video encoders and decoders, or image signal processors built into every phone to handle the math needed to make your cameras work, etc. The idea here is pretty simple. If a calculation has to be performed again and again and again, it might be more efficient to just build a brand new chip for it that is dedicated to this calculation than to have your CPU do it. And that brings us to the newest accelerator on the block, the NPU, an accelerator built to run the math behind neural networks. Neural networks have for years now been used especially on mobile devices to power everything from auto-complete on your keyboard, to detecting which parts of an image count as background to be blurred in portrait mode, but also for detecting faces in Snapchat so filters can be drawn on them, or for powering more accurate and faster voice dictation, or for turning sensor readings into useful information like I don't know, step counts or heart rates or crash detection, etc.

Tim Cook, for example, has claimed that around 200 different models are running on the iPhone already today. To be clear, the vast majority of that actually predates the current generative AI hype, and while many marketing departments would now like you to believe that neural networks are somehow magic, the good news is that they're actually just magic. math.

Or depending on how you see things, that might actually be the bad news, because we're now gonna have to do a little bit of math together. And I promise you, I kept everything as simple as possible. I even made a little neural network just for us. So welcome to the dumbest neural network possible.

This is Martin's image recognition machine. Let's just call it MIM. Okay, so the point of this network is that you can feed it an image that consists of exactly four black and white pixels, and the network can then tell you if this image has a diagonal line either this way or that way.

That's it, it's the tiniest, most primitive cousin of an image recognition model. The model consists of so-called neurons, which are really just memory cells that hold a number, and then these are arranged into layers which are then connected. And these connections all have specific numbers called weights associated with them, which I'll explain in a little bit. So in our 4x4 image, let's assign a value to each pixel.

One if it is white, and zero if it is black. This way we can express any of our images with just four numbers. We can take these four numbers and then input them into the first layer of our model, which we'll call the input layer for very obvious reasons. That's right, we've now given the model our image as an input. Meanwhile the output layer has two neurons.

I made it so that if the first one is at 1, it means that the image gets recognized as having a diagonal line like this. If the second one is at 1, then the diagonal line should go like this. And if anything else is shown, there is no diagonal line detected.

So our two output neurons are configured to tell us if the diagonal line goes like this or like this. Hence, image recognition. And the calculation goes like this.

Take the first value from layer 1, multiply it by its weight and add it to the first neuron on layer 2. Then you do the same multiplication and addition for every single neuron and connection in the network and you'll get a result. Our simple network accurately shows a 1 for this diagonal line here. If we take the other diagonal line as input into our system, it correctly shows a 1 in the second neuron. And if we pick any other combination of pixels, we get something other than a 1. So this network correctly identifies our images. Yay!

Why does it do that? Because the weights are set correctly. And this being a very simple network, I simply chose those weights manually to incentivize the connections that I wanted and to disincentivize the ones that I didn't. And that's it.

On the fundamental level, that is the basic math behind neural networks. Now of course real neural networks are many, many, many orders of magnitude more complicated than this. You might have billions of neurons and all of these would be arranged into multiple so-called hidden layers for more complex logic and so on.

All of these features of a model are called parameters and complex models can have over a trillion of them. That's right, over a trillion parameters and figuring out all of the intricacies of these networks is done with a process called training and that typically happens on some gigantic remote server farm. Once we have a model trained though, we can start using it, potentially even on our own machines, and that's how we get to NPUs.

And because you now know the basic math that this requires, you also know what the requirements for the NPU actually are. The calculation that you've already seen is called a multiply-accumulate calculation, because you first multiply two numbers, and then you add it to a node. Then you multiply two more numbers, and you add that to the same node, and so on. You multiply, and you accumulate, and that's basically it.

In the real world you might also add a few other things like biases and activation functions, which I won't explain in this video. And you'd also structure these calculations as matrix multiplications. But a multiply accumulate calculation is the basis here and this task has some very obvious characteristics. First the calculations themselves are actually extremely simple and repetitive but we might have to do billions of them a second. So we really want highly specialized hardware with many simple compute units working in parallel.

Second we need a ton of RAM because the model has to be loaded into memory with its potentially billions of parameters basically all at once. This is why Co-Pilot Plus PCs require at least 16GB of RAM and why iPhones with less than 8GB of RAM won't be eligible for Apple Intelligence at all. The amount of RAM directly impacts how smart of a model you can use on any given machine. Third, the NPU also needs its own ultra-fast cache to store the results of its billions of small calculations without having to talk to the main system RAM each time. And fourth, we can speed up these calculations dramatically by accepting lower precision.

Precision in this case simply means how much memory a computer allocates to any given number. For example, the more decimals that you want the computer to store for the number, the memory it will need. Precision costs memory. And because we typically want neural networks to just output some probabilistic guess instead of a highly precise answer anyway, we can kind of get away with lower precision and that in turn speeds up our calculations massively.

And there you have it, the four things that MPUs need to be optimized for. They need to be able to do many different calculations in parallel, so many little cores. they need a ton of RAM, they need a ton of cash, and they also need low precision.

And if you think that this sounds an awful lot like what we've recently said about GPUs, then you're correct. This is why GPUs are very often used for running AI workloads. Since 2017, many GPUs have even had dedicated AI hardware built in, which NVIDIA, for example, calls Tensor Cores. A 4090 card has a whopping 512 of these Tensor cores dedicated to AI next to other graphics focused cores, and by today even integrated GPUs like the ones found on Interlunar Lake have plenty of AI hardware too.

In fact on Lunar Lake, the GPU actually has higher peak AI performance than the much hyped NPU, and an Nvidia 4090 card absolutely blows any current NPU out of the water in terms of raw AI performance. It is not even close. So in short, your GPU already is your primary AI chip when peak performance is desired. It has dedicated hardware and software for running neural networks. And developers already primarily target GPUs for any workload that needs peak performance from a neural network, like say, tasks for video editing.

NPUs on the other hand are much more about power efficiency than they are about peak performance. The NPU on the Snapdragon X Elite chip according to Microsoft consumes less than 5 watts of power even at full load. It is a low power chip designed to run permanently in the background without having to spin up the main CPU or the GPU. This by the way explains why NPUs first became popular in mobile devices. On PCs we've generally optimized for peak performance in the past, while mobile users have always had to prioritize power efficiency, especially given that they run tasks like crash detection, heart rate monitoring and so on.

permanently in the background. So really the question for computer makers now is what kind of permanent, efficient neural networking workloads they can have that can run in the background that would really benefit from an NPU. Stuff like image generation or chatbot running locally is often demoed but I think this is actually kind of silly.

Your GPU could probably run these tasks just as well as your NPU and I bet that they're only used because they are the only features that regular consumers can actually understand. More compelling is stuff like real-time captioning and translations of audio playing on a computer, or for example Windows Studio effects that can blur the background, isolate your voice, cancel echo, etc. on the fly. You could easily use these for hours and do so on a battery, so using an NPU for them totally makes sense. So that's nice, but I don't think we've seen any real game changers in this category just yet.

Instead, the most compelling candidate for an NPU by far was, of course, Recall. A system on Windows that takes screenshots of what you do every few seconds. It runs image recognition and optical character recognition software on each frame.

And it creates a searchable database of everything on the fly all running quietly in the background without absolutely nuking your battery. This is exactly the kind of task that an MPU would be exceptionally good for. Computation wise, this is the absolute sweet spot for the NPU. It is a pretty complicated math problem and yet it should happen quietly in the background basically all the time without the user even having to notice it. Now the question of course is whether people will, at least eventually, accept this kind of an intrusive system.

Or whether developers will at least come up with other tools that will be just as good of a use of an NPU as this. For now I think the answer to both of those questions is not very clear and so I also think that the value of the NPU is is not overwhelmingly obvious either. Now I don't know about you, but for me seeing this wave of generative AI taking over everything has ironically only gotten me to care about creativity and actually creating things myself a lot more than before. Not prompt engineering, but like actually creating stuff.

And for that I've been spending more and more time on Skillshare. I'm particularly obsessed with Affinity Photo, which is the program that I've been making my thumbnails in recently, and I was watching this fantastic six-hour Skillshare masterclass to get a lot better at it. You might prefer illustration or graphic design or god forbid drawing things by hand or here's an idea.

Maybe you want to learn how to make really good video essays and would you look at that? Researching and writing for YouTube. How to make a great video essay. This is a Skillshare class with a complete breakdown of my personal thinking process and workflow and just to toot my own horn a little it's really damn well received.

I genuinely really liked making this class. Whatever your creative field of interest is, Skillshare is the largest online learning community so they have the right class for you and the platform is custom designed to help you learn. There are no distractions, you get to submit class projects and discuss with fellow students, and recently they've also introduced learning paths which kind of chain together classes in a curated way to get you really deep into a topic.

The summer is my favorite time to step back a little and to start working on either some hobby projects or my work-related skills that I've been neglecting over the years. And for both of those, Skillshare is Excellent. The first 500 people to use my link in the description will receive a one month free trial of Skillshare.

So get started today, I hope you check out my class as well, and I'll see you in the next video.

Transcript for:Understanding NPUs and Their Role in AI

Transcript for:
Understanding NPUs and Their Role in AI