Understanding CUDA and NVIDIA GPUs

About 20 years ago, Nvidia had these GPUs for just rendering lots of pixels. They still have them. Um, and a guy called Ian Buck for his PhD said, "What if we use this for fluid mechanics?" And so he just his PhD was built on turning graphics processing into normal computing. And he came to Nvidia and built CUDA. So I I was one of the I don't know one of the first dozen people in CUDA, I think. So when I got there, it was barely programmable and sort of built bits and pieces as we went. Um but but really so what it tries to do is in any program you've got bits of your program that are normal no open files fetch an API from the internet you've also got now do some image processing and the parallel stuff like the image processing you can farm off to the GPU and do really efficiently and the internet stuff and the file stuff you leave to the CPU so it's it's this we call it heterogeneous computing it's this mix between um parallel computing and serial computing and CUDA's job is to sort of unify those The fact that it started out as a graphics card, it's obviously still great at graphics. There's ray tracing, there's rendering, there's all sorts of stuff. Fun fact, we started out with about 90% of the hardware of the GPU being fixed function hardware. Texture mappers and pixel m shaders. And then about 10% of it was programmable. Now it's the opposite. It's about 90% programmable and 10% fixed function hardware. Even for the graphics pipes, it turns out that everyone wants, you know, procedural textures and all sorts of things. But what's really interesting to me is that the the the graphics the way that graphics works is very very similar to the way that fluid mechanics works is very similar to the way that AI works. There's obviously different emphases on them. Um but the same set of problems are encountered by the same set of groups which I think is you know it it says something about the way that you know these numerical algorithms all work in general. And so, you know, the AI world is probably the newest of the party. We've been, we started out CUDA working with supercomputing. Um, you know, supercomputing is a very varied space. They do everything from weather simulation to quantum mechanics to, you know, all sorts of things in between, electronamics, everything else. Um, but the fundamental again the fundamental algorithms map very similarly to the to the AI world. The AI world is much more linear algebra heavy. You see for transforms in there and other things. A lot of the algorithms seem to match. Um, I I think the AI people have a bigger emphasis on performance tuning and optimization. That's not to say the supercomputing guys don't want to go fast, but they're just they've got such a mix of things. It's really hard to tune to the last the last flop of your of your pedlops or whatever it is now. Whereas the AI guys, they they want that, you know, they run these things at such a big scale, it's really worth doing it. And so so in some ways, the job has got harder because we're covering more bases. But it's really interesting to see the similarity between them all. What was CUDA written in? CUDA's written in C. Okay. Um, the underlying like drivers and software stack is C. It's a huge software stack. Now, it started out really as just a language and a compiler. And then you build more and more and more things on it. At this point, CUDA is not just a language for programming the GPU. It's this whole suite of things where any way that you get contact with the GPU, you're going through CUDA in some way or another. So we've got image processing libraries and artificial intelligence libraries and compilers and all these other things going on because you know you want you want to use the best tool for the job you if it I I really see it as our job at NVIDIA to write a million lines of code so that the user just writes one. So it's like an abstraction. So you can call it from something else. So somebody could be using Python and then they could go okay I need to do something with GPU so they invoke CUDA in some way. Is that would that be fair? Yeah that's about right. So you know we like image processing libraries for example you got a Python program you want to do some parallel image processing on you just call one of these libraries it's a Python call it looks like your Python program where does it tie into the hardware you know what where does the software end and the hardware begin or is that too complicated question no I I think we can do something let's let's let's try drawing a picture right in the old days you used to have your CPU and that's all you had and then what Nvidia did was they added this GPU at the side which lots of people have cuz we all like playing games what CUDA does is it sees these two things as one. There's a connection between the two, but it means that when you're writing a program to try and target these things, you know, I gave an example earlier, you know, imagine that first you're going to, you know, you're going to load some kind of config file. Then you're going to say fetch an API from the internet and then you're going to do some image processing. And so CUDA lets you say, well, I'm going to have the CPU load the file. I'm going to have the API fetch from the CPU because it's connected to the internet. I'm going to do the image processing to the GPU. And you can literally just tell it this instruction goes here, that instruction goes there. It doesn't do it for you automatically. CUDA doesn't know what you're trying to do here. It you know best. So we just give you the tools to make it all look like one program and just send the thing in the direction that you need. Under the covers, there's all this complicated software stack and drivers and other things like that. But but really from the programming perspective, it's giving you the ability just to take your normal data and steer it wherever you want. What we have is we have all these libraries that you know the libraries do AI they do supercomputing stuff you know scientific computing we've got graphics APIs you know we've got data analysis APIs there's something like 900 different libraries and AI models and all these other things for doing this and you just pick whichever one you've got depending on what your data is in a way CUDA is both all of these things together plus all of the software stack that we live that lives under here so in Here is the CUDA driver stack and this is the CUDA libraries and their APIs and SDKs and frameworks and all sorts of things on there and they combine together to make a system where whatever your program is, you've probably got the right thing for the job. And is there like a hello world in CUDA or could you write hello world and CUDA? Is there an equivalent? There's actually hello world in CUDA. Um you can it the the CUDA C++ language is a completely regular C++ language a few with a few extended bits and so you can just use printf like you would from normal C. in Python you can just use print. Um in fact funny story like my the first week I got to Nvidia um they said inbuck said go and do something interesting by Friday and I was like how do I debug this thing and they said you don't. So I wrote print f for cuda on my first week here. It's the most useful thing I've ever done in 15 years. Fantastic. So what um yeah where have we come from there to now? Because you've said it's been around a long time now. And have you just been adding and adding and adding or have there been different versions or is it is it different from say the GPUs we had back then to the GPUs we have now? So you can really draw a straight line from the very first version of CUDA today. We we we are very adamant that no matter how the hardware changes CUDA version 1.0 still runs today where CUDA 13 is coming out later this year. So you know 19 years or 20 years or something later it all still works. And that's both a commitment from the hardware teams who build the GPU and from the software teams who make sure that all of the API structure and everything stays the same. And so you know Nvidia Jensen Wang the CEO of Nvidia made this decision. We are going to invest that CUDA is everywhere in every chip all the time. And that means that as we build new chips we're always factoring in does CUDA still work? Does the all the old stuff still work? So, so yeah. So, it's it's evolved, of course, it's grown, but literally you can run the old stuff all the way through to today and all the way into the future. That's that's non-negotiable. Does that mean you have is it difficult for security purposes doing that? I mean, you know, you've got to got work to do to make sure that stuff still works. You know, security is security is always difficult. You know, I I I used to I used to work for the military many years ago and I learned that security is it's it's it's painful and so you should do it right or not do it at all. Um, we spent actually a whole ton of time and effort over the last few years creating this thing called confidential computing where there's a fully secured encrypted channel. I drew this GPU and CPU picture on the map. It goes over PCI buses, right? There's things you can snoop. We can it can fully encrypt between the two. So, you can be inside a trusted compute network. You can have fully encrypted zero trust assuming the bad actors can get at the hardware and it can be fully encrypted end to end because you know people spend millions of dollars training these AI models and if someone can just go and rip off the weights that's not going to be okay. So they call it confidential computing. It's uh you know you're seeing actually a lot of hardware go into CPUs as well as GPUs for encryption hardware to make this kind of thing work. Just before you said about the backwards compatibility of CUDA and and all that side of it. So that's what the user if you like sees or the developer sees. But how much stuff is going on under the surface? How fast is this swamp pedalling to keep CUDA looking good? Oh my god. Well, you know, I I can I can draw you I'll take another piece of paper. Here's the sort of the universe as I see it right the the in the middle of the universe because of course I'm in the middle of the universe for myself is is CUDA and and up out of here is this enormous amount of software frameworks applications no libraries I've I've listed all those types of things before right this is the software universe and anyone who wants to contact the GPU as I said at the beginning comes through this CUDA in some way now at the bottom it fans out to all this hardware, right? So that down here is the hardware which we pretend really hard is all the same, but of course it's not because hardware never is. Um, you know, and so so so what you've got is you've got GeForce cards and you've got data center cards and you've got mobile things doing self-driving cars and you've got all sorts of stuff down here. So you've got dozen things down here, a thousand things at the top and everything funnels in and out of Kudo. So we're sort of at the central point pretending to the high level stuff that the low level is all one and pretending to the low level stuff that the high level is all one and and so it's it's a big illusion really in some you've got operating system differences you've got hardware differences but also is it bit like a kernel in its own right CUDA so it is kind of like a kernel we we call it a runtime um because what it does it's like an interpreter more then it's it's almost yeah it's it's it's almost taking the commands that you give it and turning them into the command stream that the hardware needs to control it. Deep down inside the hardware, in many respects, it still looks like graphics, right? The hardware was originally built to push pixels to your screen. And so there's these deep pipelines for graphics primitives and pixels. It turns out that pipelines for pixels is pipelines for matrix operations in AI or pipelines for fluid mechanics or any of these things. And so those same pipelines are much beefed up now have been what CUDA drives. So when you say do a job there's a there's there's a down down here there's a series of compilers which create the binary files. There's a set of runtimes that control the hardware to dispatch it to the hardware. And then of course there's there's the assembly language at the bottom which actually runs the program. No, in in many ways it's like a CPU in that your program layer sees a sort of regular set of C libraries and things like that, but under the cover it is I guess that's a long story short that yes, it is kind of like a So if you want to build your own, you could use our video curation tools, our tokenizers, and build your own. We don't predict the weather just once because you don't care about one prediction of a hurricane.

Transcript for:Understanding CUDA and NVIDIA GPUs

Transcript for:
Understanding CUDA and NVIDIA GPUs