CUDA Overview and Evolution

about 20 years ago Nvidia had these GPUs for just rendering lots of pixels they still have them um and a guy called Ian Buck for his PhD said "What if we use this for fluid mechanics?" And so he just his PhD was built on turning graphics processing into normal computing and he came to Nvidia and built CUDA so I I was one of the I don't know one of the first dozen people in CUDA I think so when I got there it was barely programmable and sort of built bits and pieces as we went um but but really so what it tries to do is in any program you've got bits of your program that are normal no open files fetch an API from the internet you've also got now do some image processing and the parallel stuff like the image processing you can farm off to the GPU and do really efficiently and the internet stuff and the file stuff you leave to the CPU so it's it's this we call it heterogeneous computing it's this mix between um parallel computing and serial computing and CUDA's job is to sort of unify those The fact that it started out as a graphics card it's obviously still great at graphics there's ray tracing there's rendering there's all sorts of stuff fun fact we started out with about 90% of the hardware of the GPU being fixed function hardware texture mappers and pixel m shaders and then about 10% of it was programmable now it's the opposite it's about 90% programmable and 10% fixed function hardware even for the graphics pipes it turns out that everyone wants you know procedural textures and all sorts of things but what's really interesting to me is that the the the graphics the way that graphics works is very very similar to the way that fluid mechanics works is very similar to the way that AI works there's obviously different emphases on them um but the same set of problems are encountered by the same set of groups which I think is you know it it says something about the way that you know these numerical algorithms all work in general and so you know the AI world is probably the newest of the party we've been we started out CUDA working with supercomputing um you know supercomputing is a very varied space they do everything from weather simulation to quantum mechanics to you know all sorts of things in between electronamics everything else um but the fundamental again the fundamental algorithms map very similarly to the to the AI world the AI world is much more linear algebra heavy you see for transforms in there and other things a lot of the algorithms seem to match um I I think the AI people have a bigger emphasis on performance tuning and optimization that's not to say the supercomputing guys don't want to go fast but they're just they've got such a mix of things it's really hard to tune to the last the last flop of your of your pedlops or whatever it is now whereas the AI guys they they want that you know they run these things at such a big scale it's really worth doing it and so so in some ways the job has got harder because we're covering more bases but it's really interesting to see the similarity between them all what was CUDA written in cuda's written in C okay um the underlying like drivers and software stack is C it's a huge software stack now it started out really as just a language and a compiler and then you build more and more and more things on it at this point CUDA is not just a language for programming the GPU it's this whole suite of things where any way that you get contact with the GPU you're going through CUDA in some way or another so we've got image processing libraries and artificial intelligence libraries and compilers and all these other things going on because you know you want you want to use the best tool for the job you if it I I really see it as our job at NVIDIA to write a million lines of code so that the user just writes one so it's like an abstraction so you can call it from something else so somebody could be using Python and then they could go okay I need to do something with GPU so they invoke CUDA in some way is that would that be fair yeah that's about right so you know we like image processing libraries for example you got a Python program you want to do some parallel image processing on you just call one of these libraries it's a Python call it looks like your Python program where does it tie into the hardware you know what where does the software end and the hardware begin or is that too complicated question no I I think we can do something let's let's let's try drawing a picture right in the old days you used to have your CPU and that's all you had and then what Nvidia did was they added this GPU at the side which lots of people have cuz we all like playing games what CUDA does is it sees these two things as one there's a connection between the two but it means that when you're writing a program to try and target these things you know I gave an example earlier you know imagine that first you're going to you know you're going to load some kind of config file then you're going to say fetch an API from the internet and then you're going to do some image processing and so CUDA lets you say well I'm going to have the CPU load the file i'm going to have the API fetch from the CPU because it's connected to the internet i'm going to do the image processing to the GPU and you can literally just tell it this instruction goes here that instruction goes there it doesn't do it for you automatically cuda doesn't know what you're trying to do here it you know best so we just give you the tools to make it all look like one program and just send the thing in the direction that you need under the covers there's all this complicated software stack and drivers and other things like that but but really from the programming perspective it's giving you the ability just to take your normal data and steer it wherever you want what we have is we have all these libraries that you know the libraries do AI they do supercomputing stuff you know scientific computing we've got graphics APIs you know we've got data analysis APIs there's something like 900 different libraries and AI models and all these other things for doing this and you just pick whichever one you've got depending on what your data is in a way CUDA is both all of these things together plus all of the software stack that we live that lives under here so in Here is the CUDA driver stack and this is the CUDA libraries and their APIs and SDKs and frameworks and all sorts of things on there and they combine together to make a system where whatever your program is you've probably got the right thing for the job and is there like a hello world in CUDA or could you write hello world and CUDA is there an equivalent there's actually hello world in CUDA um you can it the the CUDA C++ language is a completely regular C++ language a few with a few extended bits and so you can just use printf like you would from normal C in Python you can just use print um in fact funny story like my the first week I got to Nvidia um they said inbuck said go and do something interesting by Friday and I was like how do I debug this thing and they said you don't so I wrote print f for cuda on my first week here it's the most useful thing I've ever done in 15 years fantastic so what um yeah where have we come from there to now because you've said it's been around a long time now and have you just been adding and adding and adding or have there been different versions or is it is it different from say the GPUs we had back then to the GPUs we have now so you can really draw a straight line from the very first version of CUDA today we we we are very adamant that no matter how the hardware changes CUDA version 1.0 still runs today where CUDA 13 is coming out later this year so you know 19 years or 20 years or something later it all still works and that's both a commitment from the hardware teams who build the GPU and from the software teams who make sure that all of the API structure and everything stays the same and so you know Nvidia Jensen Wang the CEO of Nvidia made this decision we are going to invest that CUDA is everywhere in every chip all the time and that means that as we build new chips we're always factoring in does CUDA still work does the all the old stuff still work so so yeah so it's it's evolved of course it's grown but literally you can run the old stuff all the way through to today and all the way into the future that's that's non-negotiable does that mean you have is it difficult for security purposes doing that i mean you know you've got to got work to do to make sure that stuff still works you know security is security is always difficult you know I I I used to I used to work for the military many years ago and I learned that security is it's it's it's painful and so you should do it right or not do it at all um we spent actually a whole ton of time and effort over the last few years creating this thing called confidential computing where there's a fully secured encrypted channel i drew this GPU and CPU picture on the map it goes over PCI buses right there's things you can snoop we can it can fully encrypt between the two so you can be inside a trusted compute network you can have fully encrypted zero trust assuming the bad actors can get at the hardware and it can be fully encrypted end to end because you know people spend millions of dollars training these AI models and if someone can just go and rip off the weights that's not going to be okay so they call it confidential computing it's uh you know you're seeing actually a lot of hardware go into CPUs as well as GPUs for encryption hardware to make this kind of thing work just before you said about the backwards compatibility of CUDA and and all that side of it so that's what the user if you like sees or the developer sees but how much stuff is going on under the surface how fast is this swamp pedalling to keep CUDA looking good oh my god well you know I I can I can draw you I'll take another piece of paper here's the sort of the universe as I see it right the the in the middle of the universe because of course I'm in the middle of the universe for myself is is CUDA and and up out of here is this enormous amount of software frameworks applications no libraries I've I've listed all those types of things before right this is the software universe and anyone who wants to contact the GPU as I said at the beginning comes through this CUDA in some way now at the bottom it fans out to all this hardware right so that down here is the hardware which we pretend really hard is all the same but of course it's not because hardware never is um you know and so so so what you've got is you've got GeForce cards and you've got data center cards and you've got mobile things doing self-driving cars and you've got all sorts of stuff down here so you've got dozen things down here a thousand things at the top and everything funnels in and out of Kudo so we're sort of at the central point pretending to the high level stuff that the low level is all one and pretending to the low level stuff that the high level is all one and and so it's it's a big illusion really in some you've got operating system differences you've got hardware differences but also is it bit like a kernel in its own right CUDA so it is kind of like a kernel we we call it a runtime um because what it does it's like an interpreter more then it's it's almost yeah it's it's it's almost taking the commands that you give it and turning them into the command stream that the hardware needs to control it deep down inside the hardware in many respects it still looks like graphics right the hardware was originally built to push pixels to your screen and so there's these deep pipelines for graphics primitives and pixels it turns out that pipelines for pixels is pipelines for matrix operations in AI or pipelines for fluid mechanics or any of these things and so those same pipelines are much beefed up now have been what CUDA drives so when you say do a job there's a there's there's a down down here there's a series of compilers which create the binary files there's a set of runtimes that control the hardware to dispatch it to the hardware and then of course there's there's the assembly language at the bottom which actually runs the program no in in many ways it's like a CPU in that your program layer sees a sort of regular set of C libraries and things like that but under the cover it is I guess that's a long story short that yes it is kind of like a So if you want to build your own you could use our video curation tools our tokenizers and build your own we don't predict the weather just once because you don't care about one prediction of a hurricane

Transcript for:CUDA Overview and Evolution

Transcript for:
CUDA Overview and Evolution