L8 - GPU Architecture and Parallel Processing

Am I okay? We can just stop at some point to make sure It's It's already on mute. Am I audable? Can you hear me? Good. like not that bad but like yeah we don't have probably wireless Let's show you this one. Yeah. And they are currently And then you takes care of You require much. [Music] Okay, let's get started. Good afternoon everyone. Welcome to another lecture in fundamentals of computer architecture lecture 8. I'm Muhammad. I'm an official co-instructor of this course. And uh today we're going to talk about GPU architectures. We're going to build on a lot on the topics that we um covered during premier last week on SIMD uh SIMD execution. Um use hard vector processor and array processor to be quite powerful in this domain and I have worked a lot on GP architectures is actually quite close to my heart um my PhD thesis um is all about GP architecture how to optimize energy efficiency and register for it so it's actually I'm quite excited that I'm teaching my first lecture in focus it's GP architecture Okay, so we have covered a lot of different execution paradigm in this course. I think we haven't covered yet systolic arrays. Probably we we will cover it later in later lectures. But uh last week we started talking about SIMD processing vector and array and today we're going to talk about graphics processing uniting for GPUs. But before that um these are also some readings for today uh lecture this week. But I want to basically to jog your memory. I'm going to quickly uh go over some of the slides some of the important concepts that we covered last week uh to remind you. So remember that uh SIMD uh processing is all about to exploit data parallels that we have in many important applications which is also called as regular parallelis. when you think about doing operations on vectors on arrays like you want to add two vectors um basic element wise so that's the application that you can benefit from SIMD so you remember that we covered this u fleeing taxonomy of computers and we saw that say that SIMD is single instruction multiple data is a great way to uh exploit data level parallels and array processors and vector processor are very good example of how we can implement them uh sim uh in our hardware and SIMD is very efficient because you can amortize the fetch over billions of operants. So you fetch one instruction, you decode that instruction and then you're going to execute that instructions on many many operants and that's uh you can pro provide a lot of efficiency and parallels. So we talk a lot about array processor and vector processor. uh in array processor we know that we have multiple processing elements and each of them are quite uh heavy I mean they can do a lot of operations and essentially in your vector you can u assign your instructions to each of these processing elements. So in in at the in each time because you you can do load operations in different places on different processing elements and then at the next cycle next uh cycle you can go to another um instruction and so on so forth. So you have basically same operation at the same time and uh like this and then you have also different operation at the same space because your processing elements are powerful and they can do a lot of operations. Uh on the other hand we have also array vector processors that u in order to reduce also the cost that's one of the good way to do it. So you kind of you have different processing functional functional units that they are specified for doing some operations like load, addition, multiplication, store and then you do pipeline. Um so on this load operation you can pipeline uh load instructions across different uh array elements in a in consecutive uh time. And so when when you have your pipeline full, you can see that you're basically executing different operations at the same time. and uh and that provides you parallelism and performance. So that's the difference between array and vector processor and in order to make them effective you need you really need to fit the pipeline with the data. So that's why these uh array processors and vector processor they're relying on high parallel memory and memory banking and so and basically you need quite bandwidth in order to access your memory and that's not only your offchip memory access it does it also needs you also need high bandwidth for your register file when you want to access register and bring them to your pipeline. So we talk about this memory banking and we're going to also learn a lot about memory system later. You're going to see that this is one of the most important part of our memory system basically our computer system uh nowadays and this is an a very good example how we can combine this paradigm uh how we can combine array processor and vector processor here we have an example for vector processor that you have a processing element and then you have basically different um elements of your arrays and they are you are feeding them to your pipeline but then what you can do is you can multiply And you have different of these processing elements, different instances of them and have it in the space and then you can basically categorize your data and map them elements that they are corresponding to each of these element processing elements and with that you can have kind of array processor and each of them is a vector processor and with that you can combine uh these two paradigms nicely together. I'm actually uh skipping these slides relatively quickly because they are kind of reminders for you. Um and yeah and actually you're going to see all these slides today again. In fact this one we're going to see a lot uh when we want to use GPUs and say how we GPU are running uh these operations. So I'm not going to repeat that again. And we talked about a lot about the advantages of these architecture SIMD array process and vector proc. But it's also good to know uh what's what's what are the issues and the main problem is that the the these architectures they work only in the sense of uh to be effective because these uh processors are also general purpose. So you can do you can run any application on if you want but if you want to have good speed up and good efficiency you really need to have a application that have quite good data paralysis like for example matrix matrix multiplication if that they can provide very good performance for you but if not basically u your performance will be limited meaning that for example most of the time you are instead of having a vector with a lot of active lanes and active elements you're doing operation only on one element which is not really useful uh for such uh architecture. So basically they are very inefficient if paralysis is irregular like for example very good uh example of that is um in pointer chain sync in link for example in link list when you are searching for a key and you can actually read this uh uh note from fisher from this real paper and also we say that uh if if the application is regular then actually we can automate uh this data level paral and basically SIMD instruction generation can be automated And we work on this example that you can see that from this code everything is so simple. So you can easily um do CID instruction because you have a for loop and then you want to do addition across these two vector and so that's perfectly parallelizable essentially okay and uh yeah we talk about vector and cmd machines that they are good at exploiting regular data paralysis but also we discuss about performance improvement limited by vectorizability and of the code and He brought up again this AMD's law which in the end your performance might be limited by the serial portion of your application. Uh we're going to actually learn a lot today about um many existing ISA that they include CID operation. That's actually one of the thing that I'm going to cover in more detail and this is for our AMD law just as a reminder. So we should not really forget AMD law uh when we are designing architectures and when we are programming essentially. Okay. So let's see now the how are these CMD operations are happening in modern IC. So the ISAs that we had in the past they were handling basically scalar operations and during the basic decades like 1980s or so people realized that we do need um these kind of SIMD instructions in order to for example accelerate applications like graphics. And then um uh the idea was that for example using our current scalar operations like you you already have for example instruction that you can do addition of two for example 32-bit register. So you add these two 32-bit register and then store the result on the third register for example. Then they just come up with this nice idea that they can chunk this array. For example, here uh they are chunking this uh this vector not vector this register to uh four elements. Now we can actually call it as vector register because this you can consider that you have four elements in your register and then you just do addition and the only thing you really need to do is uh you don't propagate the carry from uh the first element to the second element. If you don't do that then basically you have a element wise addition of these two array right. So you can easily extend the ISA that you have in order to support vector operations and uh and people have have been also doing that in many for some other operations like I don't know if you know for example BCD uh uh when you want to um yeah binary decimal code when you when you for example decimal code u like 89 the number 89 you just you don't binary code it you encode the first eight to four bit and the nine also in four bit and when you want to do operations on these uh encoding you do need such also support you don't need to necessarily propagate carries for example and or you may need to propagate carry in a different way so people have been doing that a lot and uh I think that kind of experience also help people to build on quickly on such stuff as well okay so this is the example I set for four bit uh four 8 bit numbers Yeah, I said that basically we need to modify ALU but that's not very big modification for sure. So Intel actually was one of the I guess the first one that started doing that and they uh described this um ISA extension they call it MMX multimedia extension if I'm not mistaken and later on they also changed the name um for many reason for marketing reason as well but uh this is actually one of the very interesting extension that they have done it. So uh in their uh normal scalar operation they have registers of uh 64 bits. So you can do uh addition multiplication or such operations on on a scaler when you have 64 bits. But you have this possibility that you chunk this 64-bit into two chunks and then you have a word of 32 bits and or in other word you can say that now I have a vector register that have only two elements. You can even go further. You can chunk it to four or you can chunk it to eight. And in this actually having chunk eight uh bits uh chunks are actually quite uh good when you're doing uh image processing for example because pixels are usually eight bits for example and then you can use it nicely in order to do graphics operations. Uh interestingly they didn't need vector length register. So different vector lengths or how do you chunk it is actually specified by the up code by the up code of that instructions. So basically up code determines data type and uh you as I said you could have eight 8 bit by four and so on so forth and when you are accessing memory u stride is always equal to one. So if you remember from our last lecture when you want to fetch the when you want to load you basically you need to provide effective address right effective address for each uh load operation. So in in a vector uh you also need to consider your stride. So a stride is set to one here. So you cannot change the str. Yes. But is the stride because of the eight bits within that one chunk or is it because of the next chunk next to it? No no no it's not related to that. So you have a load imagine that you have a load operation here and uh so that uh load operation so you calculate your effective address for this one and then you want to go for the next one you just u increment the address by one for example okay and you can uh learn more about this ISA from this paper uh so they they actually also yeah okay yeah here is an example of this ISA that they call it this packed uh compare greater than word. This is the the way of showing this instruction. You can see that the pack compare uh equal or greater than and um so in this example we are showing that for example you are having uh four um 16 bits of elements and then you can do this operation. Uh another interesting uh support was for example supporting u multiply addition uh which is quite useful when you want to do dot product. So in that product you need to multiply and then you need to do reduction sum and in this ISA they actually provided this support that for example here you can do multiplication over these two vector register and then uh you have the multiplication of element wise and then you have reduction sum in two u basically position here. Yeah. And you can see how we can use that for for example doing um basically dot product for do. If you're interested you can check more but that's not really magic. Uh it's easy once you know the instruction and once how they map it. Here is also an interesting example that they provided. Uh unfortunately they didn't uh provide the picture you know colorful. So we just uh give you some color here. So there is a lady here and then the background here is blue and we have another image which is a bluesome background and essentially you want to put this background for that uh lady so such that that lady has a better background. Um so if you want to write it in a C in a scalar way you can just say that um you have this four and then you check uh the pixel of each pixel of this image if it's blue or not. If it's blue, you just replace it with this new image. And if not, you just replace you just you don't do anything and in the end you have a picture that you know you have a different background and then you can easily implement this with this uh mmix. Um the way you can do it for example is that you have a register that you initialize with uh blue that that's a pixel for blue color essentially and then another uh uh register that uh basically m yeah mm tree that that you load it with your uh data data with your this image x and then you essentially you need to compare uh if it is equals to blue or And then you generate a m mask based on that right and this mask can be we store it in mmm here. So doing that then you can actually manipulate over this mm1. So in mm4 you have this uh background basically image right which is bluesome background. So in mm fork imagine that you have that bloo background and then in mm1 you have the mask that uh you just do and operation pack and operation across these two and then you got to this uh basically output which uh on the position that you don't want to have that background we have zero value and on the position that you want to have that background you have basically your bluesome pixels right and you also do the for that for your image as for your original image. But for that you need to basically do the and operation on negate of your mask. That's why you're doing and operation negate here. Um and then as a result of that you get to the basically uh another mm1 which you have your original pixel uh on on those places that you don't want to place with the background and then zero on places that you do need background. And then you just need to order these two together, right? And you get to this mm4 which is your new picture. And this is the code. You can check it if you are interested with these ISA extension. But of course, uh your picture might have thousand of pixels, right? And this architecture only provides eight basically pixels. So you really you also need to have a move have a loop here and go over all the pixels and make sure to you know finalize your picture. Does it make sense? Good. So yeah. Um and you can read this also note about this Intel Pentium MMX operation that they say that Intel plans to implement MMX technology on future Pentium and Intel architecture processors and they are still also they are supporting such extensions. I mean of course with different names which I'm showing here u to you. So, MMX is the 64-bit MMX register for integer. Later on, they name it SSE streaming SIMD extensions. And they also provide different generation of SSE like the first generation was uh 128 bit XMM register for integers and single precision floating point and later on they also provided double precision floating point and so on so forth. Um after that also they uh provided a week advanced vector extensions uh which they provided for example two 256 bit floating point and also they provide this support which is FMA fused multiply at which quite important when you want to do that product it's so you do multiply and addition and fuse operation basically you don't need to do it so you don't there's no intermediate step that you need to store that's actually a very ter nice terminology with fusing when you fuse the the two function or two steps together that means that the output of that step will consume immediately with the next step. So you're fusing these two operation or you are fusing two kernels together. So if you see some kind of u terminology like kernel fusion for example in GPUs or in some uh machine learning that's actually you need to bring up uh remember this kind of idea that the the main idea is that the the intermediate result that one step produce I want we consume that immediately for the next step. Yes. This is for this is for saving one consumption of storage for storage and also not u it's not only about for uh storage space but also for time because you need to store and then you need to recall again like invoke another kernel for example so when you fuse yeah and you can pipeline it nicely you know when you have this fuse operation you can pipeline the whole thing nicely okay and uh after that they also provided Max advanced matrix extension because they wanted to provide a matrix multiply unit it tile the matrix multiply unit so you can also check if you're interested but uh I just wanted to bring you know overview for you but basically CID operations are also using a lot in modern machine learning accelerators I'm not going to detail of that there are many many different accelerators for machine learning that they use indeed but I'm going to highlight some of them uh this might have you seen already like in prior lecture sir brass that they come up with this design of wafer scale engine. So these are like we have different dyes and instead of chopping all these dyes and package each of them they keep them on a wafer and these dyes are interconnect through some switches and then you can use the whole wafer for your computation. It's actually quite exciting. Uh I'm going to show you this picture. So this is your chip in CRUS that you need for sure you need a lot of power to power up this uh but it's also quite u efficient in the sense that uh you lose a lot of power and efficiency when you want to go outside the chip because of packaging overhead. But here because dyes are actually connected with um thinner metals and thinner connections you consume less power and also much you're much faster in your data movement. Yes. But is it really? Um some people are saying that. Yeah, it's not hard to imagine, right? Yeah. Yeah. I mean, if you're you actually I going to show you a slide that um we had a Safari live seminar that we invited some people from Serbas that they gave a talk. So you can also watch that if you're interested. But of course power up is cheap and also cooling down is actually so hard and u the the thing is that um um so the thing is that this chip is so powerful. So if you want to for example do computation comparable to what you can do here with for example with GPUs and some other you also need to use a lot of GPUs and for those alo a lot of GPU you need to also spend a lot of power but the main difference is that that might be a bit more distributed so maybe cooling down is easier but here you really need to have a very sophisticated process. Yes. So here we have everything on one weight but if we have manufacturing defects we lose a lot of uh of our trans surface area. That's true. Because we have to deactivate a lot of stuff. Wouldn't it be better if you do something similar to what Apple does with connecting chiplets and then package the whole thing. Um so the thing is that what Apple is doing uh which is um also um is a known technique that other companies are doing but in that uh we are much limited in the number of dies that you can pack them together and you know interconnect together. So the idea about this wafer is that you can connect several dice together many dice actually. So the yeah I think uh these are not really comparable to each other but as what you clearly mentioned that's one of the issue that some of these dies may not work at all because of the u yield and many other issues. Uh so and one but one exciting thing about this architecture is that um you can actually reconfigure switches nicely and then just bypass faulty nodes but then of course the power of your wafer is reduced when you don't have many of them active of course yeah but why I'm showing this because this architecture is actually kind of considered as MIMD which is multiple instruction multiple data because each uh tile and Each each each die is actually doing computation on different layers of neural network but uh underneath uh each of these elements are actually CID engine. So you have a MIMD machine which is distributed memory um and you are connecting them with a 2D machine interconnection fabric but essentially you have a MD machine with SIMD processor. So each of uh these small dies they are actually CIMD machines which is quite exciting. So if you want to learn more I would suggest you uh watch this uh Safari live seminar. It happened some time ago in 2022. Okay. Any question. So now I'm going to jump to graphics processing unit and we're going to see a lot that how GPUs uh combine array process and vector processor and what the nice trick that they made in order to make their programming uh much easier. So but I want to first start with some motivational slides. So this is kind of evolution of recent GPUs and we are showing here actually from volta which is the architecture uh released in 2017 or so if I'm not mistaken uh through blackville uh which is the most recent architecture of Nvidia which I guess introduced around this time actually last year in March or so I guess and um so these GPUs as you can see they are integrating more and more transistors so like in this black where we have more than 200 billion transistors and um and they are getting beefier and also bigger and a lot more powerful in doing computation but the point the problem is that for today's applications that we uh one GPU is not enough uh because for many for two main reason one of them is that you need much more memory capacity which one GPU cannot provide that so people combine several GPUs together to provide that memory capacity that they really need and at the same time when you provide that memory capacity you also need to provide the computation power for that. So as as a result of that people Nvidia also and other industry they try to combine GPUs and uh together and interconnect them nicely to provide performance. So they actually improve a lot on also the way that they can connect GPUs. This is the architecture they provided in amper um that they use envink generation 3 and then uh in hooper and then blackville. So basically you can see that you don't need to know a lot but what they are showing here is that their interconnection is uh less and provides better bandwidth as they uh grow and this is what I already said like we need basically uh multiGPU system to interferes at a scale and um also different interconnection network but the thing is that fundamentally uh all GPUs are the same and I want to Now to go into the fundamental way. So GPUs are SIMD and gene underneath. Uh the instruction pipeline operates like a SIMD pipeline like array processor and vector processor. But the main difference is that the programming is not is done using threads not SIMD instructions and that makes a lot of difference here for us. So to understand this we will check our parallel parallelizable code example that we have seen also before. But before that I like to distinguish a bit between programming model which is at the software level and also versus execution model which is happening at the hardware. So programming model refers to how the programmer express the code like uh sequential uh following fun data parallel using SIMD or data flow or multi-threaded MD spd and so on so forth. So that's a programming model that you might have used. How many of you have have already programmed parallel uh code? Yeah. So that's a kind of program model that you can you know use when you want to parallelize your code. But execution model refers to how the hardware executes the code underneath. Uh it can be for example out of order execution which we have seen in last lectures. Uh vector processor, array processor, data flow processor, multiprocessor, multi-traded processor and so on so forth. So execution model can be very different from the program model. For example, you can have a funima model um that you write your program following funima model completely sequential but then you can run it on out of order processor which underneath this uh code is actually executing out of order of you still need to commit and retire instruction in the programming order but that provides a lot of parallels for you or for example SPM SPMD model which is single program multiple data meaning that you run one single program on different data elements. For example, your program is addition of two arrays for example. So that's your program and then you run this program on multiple billions of data elements. So that's the paradigm. We call it SPMD which is more or less actually on the program model. It's not actually exist SPMD model implemented by can be implemented by the SIMD processor which is exactly what GPU does. So GPU is uh basically hardware which implements SPMD model on a SIMD hardware. So let's elaborate more with this example. We have this uh vectorzable loop that we used also in our last slides. So we want to examine three programming options to exploit instruction level parallels that we have here. So by looking at this code you can see that we have very good parallels, right? We have several iteration of this loop and in each and all these iterations are completely independent. Right? So we will examine uh three ways. One is a sequential code SISD. The other one is data parallel for example using SIMD and then multi-threaded like MIMD or SPMD. So if you start with a sequential code, you can execute this uh on a pipeline processor on out of order execution processor which uh independent instruction can be executed when ready and u differentrations are present in the instruction window and can execute in parallel in multiple functional units because all these iterations are completely independent and it's actually one of the very good example that um your out of order processor can easily uh you know provide parallels for So basically this loop is dynamically unrolled by the hardware. So that's once you use out of order execution in that instruction window you're you know fetching you're uh fetching instruction you're decoding and all you have a lot of these operations of addition right so all these addition instructions are coming load and addition uh you have in your uh instruction window and you can just you're or in the sense that you're kind of unrolling this loop you know in your hardware so you can also use supercaler or viw for example for running this subject. So you can see that I'm I'm saying all these example to to justify the difference between program model and execution model essentially another way is that so we have a realization that each iteration is independent. So we can actually have a programmer or compiler to generate sim instruction for us. So because it's actually automated quite easy to vectorize it and then you can basically execute the same instruction from all iterations across different data using SIMD programming essentially. So I'm skipping because it's not you already aware of all these and this can be executed by a CIMD processor for example like vector and array. Another way would be I realize that each iteration is independent. So I'm going to assign each iteration to a thread. So programmer compiler generates a thread to execute each iteration. And each thread does the same thing but on different data. This can be executed on a MIMD machine. So this part so this particular model is also called SPMD. So because your thread is actually the whole program. You don't have a basically in some kind of model you for example you have a program and then you u make some trades they do something and some of these trades are also doing something differently you know they don't do the same thing and then they you merge them and then you continue so that's a general multi-threaded for example paradigm but here uh all of your thread are doing exactly the same thing addition operation right load and addition that's why we call it single program multiple data And uh this can be executed on a SIMD machine. Um which I'm I don't want to confuse more but this also called SIMT machine. This is terminology that Nvidia use single instruction multiple threat and SIMD SIM T is actually quite nice terminology in my opinion that Nvidia is used. Nvidia uses the same sorry the exact same. Yeah. Yeah. like it's um it actually reflects to when you do SPMD on a SIMD machine you can call it SIMT. So basically SIMT is a terminology that's a specify kind of program model and execution model to get so that's why it's actually quite nice to have this terminology. Okay. So a GPU is a SIMD or SIMT machine uh except it's not programmed using SIMD instructions. is programmed using threads um SPMD program model and each thread executes the same code but operates a different piece of data. So each thread has its own also context can be uh treated restarted executed independently. So you when you write a GPU code you actually think that you're writing a scalar code because you're writing a code for a thread. Um and you don't need to worry about many different threads that you have because all of them are going to be the same. In some part of your code you might also break the rule a bit you know because you cannot do for example you want to do reduction you want to do reduction sound uh and you have a lot of threads but then at some point your number of threads is reducing as you want to reduce uh do reduction sum. So then you need to have some threads that you do some operation and some threads that they are not doing operation. So but overall uh these threads are uh basically execute the same code and a set of threads executing the same instruction. So you have a uh all threads have the same code like you have a one kernel that you call you're going to invoke or launch it for all threads but then in that uh kernel code there are many instructions right so there are given a time there are some threads that they are running exactly the same instruction because they are exactly at the same program counter for example so those instruction those threads are dynamically grouped into a VARP or wavefront warp is a terminology from Nvidia wave from is for AMD by hardware. So that's exactly like how hardware is making you making SIMD operation dynamically for you. So you are writing a code for threads and then dynamically in the hardware the hardware is mapping grouping these threads together to form a warp and that warp is actually essentially seeing the operation uh that's happening. Make sense? Yes. So this hardware group is this complicated. So um it can be if you want to be super super flexible and efficient it's can be very very complicated and people have looked into it to it a lot in the research but um industry in Nvidia and AMD they try to be not that much complicated. So I'm going to show some slide related to that. Um so they made some trade-off to make it simple and it's still effective. Very good question. Any other question? Yes. So multiple wars during different instructions is what considered as one big. Yeah. Yeah. We uh so our GPU kernel has I'm going to show it later but is uh we have many threads that we group them in a at the software layer we group them in a block. So we call it block of threads and the programmer just assign block of threads to the GPU and then hardware tries to make these uh warps actually out of those blocks. So programmer does not need to deal with warp at all. But in order to do a good to program it nicely and more efficiently actually programmer needs to be aware of the warps because that can provide uh much better performance. That's why actually programming GPU is easy if you want just to implement and get some speed up. But if you want to implement and get the very good speed up and I'm not going to say the best speed up because that's very uh strong word but as much as you want to improve your code you really need to be more expert and think deeply you know and also at some point you need to come up with better algorithm uh to paralyze your code. Okay. So going back to our uh example we have this SPMD on simp machine and the warp is a set of threads that execute the same instruction at the same or meaning that the same program counter. So in this example for example we can have warp zero at PC X and then that warp can go to the next PC to the next PC and the next PC and then you can have many of these warps that they are doing your operation for you. Any question? Good. Okay. So, we already know that and so let's also a bit compare what is SIMD versus SIMT execution model. So, SIMD a single uh sequential instruction stream sorry a single uh sequential instruction uh stream of SIMD instructions. So in CID operation you have one single instruction and you are stream that instruction over many many data elements and each instruction specify multiple data inputs uh for example you have vector load vector addition vector store and then you also need to specify vector lengths uh or somehow specify with your op code essentially. So but basically programmer needs to deal with all of these programmer or compiler um need to deal but not many applications are easily vectorizable. So sometime programmer really sometime they need to do a lot of effort uh to vectorize their code but in simple instruction streams of scalar instructions. Um so basically then you have just a scalar operation like load addition store and you just need to specify the number of threads that you need. So when you are programming you just program for one thread you are basically writing a scalar code and that makes the whole easy. There are two major SMT advantages uh which which going to we're going to cover it actually today. Um first one is that we can treat each thread separately meaning that it can execute uh we can execute each thread independently on any type of scale pipeline. For example, it can be uh so it can be MID processing. So really each thread can be um separate even though in order to be efficient and provide very good performance you want to have as many as possible threads uh that they are doing the same thing exactly the same thing such that you can have your pip uh your warp full active and then you provide a lot of performance but um it's also quite easy and you can do it and it's also I would say um essential for some application at some point that you need to treat each thread separately that uh basically u not all threads uh needs to run together at the same time and also not all threads needs to run the exact same code at the same time. So and that provides a lot of efficiency and and I would say programming ease for you and another important uh simply advantage is that we can group threads into warps flexibly. We're going to see a lot about that uh later also but let's first start with the first one. So the thing is that the idea of uh GPUs the way that they treat warps is actually you have seen this idea before fine grain multi-rading right in I think in lecture uh 6C uh we covered fine grain multi-rading of warps not warps but fine grain multi-trading essentially. So the main idea of fine grain multi trading is that uh whenever I have a stall in my pipeline I don't try to fix I don't try to eliminate that stall I just try to do something else you know and you you have a lot of threads and you can quickly context switch across these threads and in order to have fast context switching you really you need to have context of all threads on chip available so that's why in fine grain multi threading we usually need a big register file because you need to have registers for all threads available such that you can quickly switch between them. GPUs are essentially doing the same. So they have many warps available and whenever a warp encounters for example a load stall you know waiting for memory access you just switch to another bar and then with that you can keep your pipeline full without stalling. That might not work all the time because as you can imagine we might not have enough number of eligible wars. we might uh you know uh have quite long latency in access memory and that's why also Nvidia uh do not Nvidia does not only you know rely on this funing multitrading or multi- trading of barbs they also added caches you know they they they they try to for example have a deeper cache hierarchy in order to reduce memory access latency but the main focus is actually fixing stalls by using fine grain multi trading But for example in Fermy architecture that they pronounce they announced in 2009 I guess they added for example L2 cache they didn't have L2 cache before in 2009 they realized that they need they do need L2 cache and then as of now they actually keep L2 cache and they make it bigger and bigger like I I think in their recent architecture they have something 60 megaby of L2 cache for example Okay. So now let's see what is a fun multi warps with better example like assume a warp consists of 32 threads. If you have 32 uh k iterations then meaning that you need 1,000 warps for example where you're running your application. Warps can be interle on the same pipeline which is fine grain multitrading of warps. So here as an example warp zero can be at PCX and then warp one can be sometime later can be at this PC and warp 20 can be at PCX plus2 for example sometimes so we are basically running them um in different way and the thing is that all threads in a VAR are independent of each other I mean that's what we really want sometimes that's not the case and that's why Nvidia also tries to add some instructions that threads in a warp can communicate ate each other. They added instruction like shuffle exchange that threads in a verb can communicate. But the the fun the the main way the or the main idea is that we want all trades in a verb to be completely independent and verbs also they want we want them to be completely independent such that we can run them in any order. Let me uh yeah workers problem one continues to yeah so yeah but that not necessarily work we have a warpuler yeah but yeah in this example yeah exactly yeah yes it happen very fast that warps fall apart um yes actually because of memory What about greediness? What? Greedy like warp. So warp two context switches to warp one. Yes. Warp one then hogs the GP as long as it requires a problem suffocates. Oh okay. Oh I see. So you're um you're thinking about fairness of our scheduling. That's true actually. So in people that they are doing research on warp scheduling they try to make it fair and also high performance because sometimes when you when you when you are acting too greedy you know you try to run instruction from one war as much as possible then uh you actually get to very bad situations similar to GP where they have like a fairness. Yeah exactly. So there are a lot of research going on in that also schedule. Yeah. So this is just a reminder for you about fine grain multi trading and this is the lecture we had actually in FOCA uh and if you are interested you can also check this one from our DDCA course in 2022 okay um let's have a break uh but not for the whole time let's uh let's be back at uh at 3:10 for 9 minutes Great. Thank you. say something. Can you hear me? Yes. Can you hear me? Oh, yes. That's just good. Okay. Um I think it's not 310. That's almost. So let's get it started again. Okay, we were discussing about warps. Uh warp is set of threads that execute the same instruction. We can have many warps and then we need to schedule across them and basically you run your warp on a um CID pipeline and in that you have a scalar thread execution essentially. Uh so your the back end of your execution uh in GPU is actually uh scalar in a vector. So you have a scalar uh functional unit which it can be also pipeline. So that's why we have it like a vector processor but then we have several of them which is similar to array processor. So that's why how we combine these two together. It's a high level view of GPU. It's uh it's actually quite old this version but u a lot of uh fundamentals are still the same. You have several cores they call it shader core. It's actually coming from graphics terminology. those scores that they were doing shading uh for graphics applications is actually part of a story that I didn't cover much today but Nvidia like people in the past we didn't have this possibility to use GPUs for running you know our high performance computing applications uh around the years of 2000 or so there were some intelligent people that they realized that okay if they just make a nice analogy between their application and also graphics application they and use GPU for them and also some researcher in the GPU in Nvidia apparently they were also following the same thing and u but then basically Nvidia also realized that okay that's very cool idea and then try to extend their ISA and also this they provided this CUDA support uh such that they make it more automated so nowadays that we are working on GPU we don't need to deal with graphics pipeline essentially but in the past like in 2001 or two there are some papers that that people actually use graphics library in order to do for example matrix operation. Okay. Um so these shader cores interconnection network and then you have memory controllers and access to your off signal and in a shader core we have basically some uh you know the front end which is the program counter mask in caches uh your decoding to decode instruction and your cd execution model we're going to see more about that and I discussed already about latency hiding which is happening in bar level fine grain multi- trading which Nvidia actually relies on that a lot. Uh GPUs essentially they rely on that a lot. Okay. And in order to accelerate this make it possible so you really need to have register values of all trades uh available on chip in the register file. And that's why uh GPU register files are actually very big. Like for example, we have uh 256 kilobyte of register file per GPU core which is 64k 32-bit register uh per core. And during my PhD actually I realized that even that is not enough and I work a lot to you know design a register five which is eight times larger like 2 megabytes uh per GPU. You can check that paper in as 2018 LTRF. Okay. Um we already shown this uh picture um and now we want to see how does it look like when you have warp. So essentially you have uh your warp and different threads. I'm going to show here actually it's better. Yeah. So this is your uh recall for you. But then you can have a warp zero that doing load operation and at some point uh another warp warp one for example doing multi u accessing your multiply unit in your vector processor unit and then warp two can be in addition unit and then warp zero warp three for example in load unit. and warp four, war five. So this is how you can for example do interle these warps and they are doing different operations and uh here you can see that uh if your sim engine is for example eight lanes wide your warp is 32 bit not not 32 bit like 32 threads um you assign eight threads and then you run them in parallel in that processing element like in eight processing elements but then you need to pipeline the rest of it. So you need to run four iterations of one 32-bit warp, 32 thread warp in order to finish for example the execution of warp zero. So that's the part that you are combining array processor and vector processor together. Does it make sense? Each one of these dots and triangles and everything these are all a single single thread not process single thread. These are all yeah but you don't have enough number of processors you don't have as many as threads number of processors in your GPU so you need to interle them and run them concurrently yes these are all pipelines yes yeah not all of them like you can see that like eight trades are running in parallel completely because this is the time right so all of these eight trades are running in parallel because you have eight u parallel engines But then after that you need to pipeline basically you know next cycle next cycle. Make sense? Good. Okay. And uh for memory access same instruction in different threads uses a thread ID to index and access different data elements. For example, here you have consider you have the two arrays that each of each of them is 16 elements and you want to run them in on a on four warps and each verb has four threads for example. So you have uh warp zero to warp three and then essentially uh each of them each of these warps has a thread thread ID. So in VARP zero we have trade ID 0 1 2 and three and in VARP one also we have thread ID 0 1 2 3 and then if you combine the thread ID with your VARP ID then you can get to the um unique ID and based on that ID you can actually access your memory and that's that's the way that your address calculation in the hardware is quite easy and that's why actually it's actually coming from SIMD. So in CI CDI also you don't need to pay a lot uh when you want to calculate your next address of the memory. So most of the time we are calculating address with the stride easily. So here also you can see that it's just using thread ID. Okay. So for maximum performance memory should provide enough bandwidths as we know also for SIMD and that's why actually GPUs are integrating high bandwidth memory nowadays. HPM uh HBM 2 and now HPM3 for example that for for example I think this in Maxwell not Maxwell in Blackwell Blackwell architecture they have uh 8 terabyte per second memory bandwidths per GPU core and again that's not only also offchip memory uh or the GPU memory is also the register file should be also high bandwidths so because of the register fing GPUs are actually quite banked there are several registers banks and there are multiple operand collectors that they are fetching register values preparing them in the operand collectors and then to uh fit to the pipeline. So we need high bandwidth everywhere in the GPU because if you don't have that bandwidth funding is not going to work well for you. Okay. So warps as I said they are not exposed to the GPU programmer. Um when you have when you want to run a jeep like a GPU program you actually have a mixture of CPU threads and GPU kernels. The parts of your program which is sequentially or modestly parallel section you write them for CPU and the rest massively parallel sections on GPU like blocks of threads. So for example in this show that you can have a serial code that run on a CPU host and then you have a parallel kernel device that this is the notation for Nvidia for example in CUDA when you want to invoke a kerneler you just name the mention the name of the kerneler and then specify the number of blocks and the number of threads per block if I'm not mistaken and then some arguments that basically you need to pass through that block and this going to be look like something like this each each of these is a block and we have many of these block and the another terminology is that we call it a grid and when it's done basically we do another for example serial code and then we can invoke another kernel and so on so yes and such um yeah you I mean um okay like to figure out which which threads require what and how many threads X no this one is actually should be calculated by the program oh okay so program needs to yeah needs to realize how many threads how many blocks for example you need but these are GPU threads these are not CPU threads so in your uh GPU program you have a actually CPU code which actually start with int main for example, you know like as you write but then in that code you uh invoke a kernel GPU kernel and for that you need to first um uh do allocate some memory in your GPU memory device and then you need to move data from CPU memory to GPU memory and then invoke the kernel but that invoke can be blocking or can be non-blocking as well. So sometimes you can invoke a kernel and then CPU can keep going because you are not dependent on that you can do but at some point you're dependent on the result then you can add some barriers that you wait and then you yeah for example it's a sample for example GPU code this is your CPU code for example and this is your CUDA code uh which actually quite simplified here but essentially you need to calculate thread ID which you can calculate get your um unique thread ID using a block dimension and block ID and thread ID that you have and using this you're going to get the unique number of ID and then you just need to do um this operation like you load from your array A and B based on your thread ID and then the output array C ID would be some sum of these two and that's your GPU code and this is less simplified uh which this is your GPU program which has in the main part which is a CPU part that needs to be run on CPU and this is the part that basically is your kernel. Here you are defining some uh your basically block and grid of threads. Um and then you need to call this GPU program add matrix and then you def you mention this grid and also the dimension of your block and these are the argu arguments that you send to your GPU program. Of course, I'm not covering detail of programming. So, if you're want to uh you know get your hands a bit dirty with GPU programming, uh I would recommend you watching this lecture. And we also had a um uh Pianist course in the past on heterrogenous system uh course uh that uh we provide a lot about how to do GPU programming. one of my former colleagues Juan that um he's now in in Nvidia actually he delivered nicely this course and I know many people in all over the world actually they are watching uh these lectures to learn how to start with GPU programming and how to get actually advanced and be expert programming. So this course is actually quite exciting. I I also watched it to be honest and I learned a lot. Okay. Um yeah so from block suburbs uh let's see so this is our uh SM architecture this shown from Fmy architecture in 2009 and we have uh basically blocks but the thing is that the programmer sees the blocks and assigns blocks to the execution but in the execution we uh GPUs needs to um assign these blocks into warps. So blocks are divided into warps and uh so warp size for example it can be 32 threads which is usually more or less the case. There are some GPUs with different number of threads per warp but 32 is the common number and the different warps are belonging to different blocks and that's actually very important when uh which you should know uh the hardware should make sure that uh basically do not uh interle not inter do not uh share information across these uh blocks. So basically when when the hardware is grouping threads to make a warp those threads should should belong to the same block. So for example the hardware should not u select a thread from block one and group it with threads in block zero to make a instruction to make it work because that can violate some uh programming model. Two different programs. No, no, no. There might be actually the same program, but the thing is that there are some um so what I'm also saying is that is also very dependent on how we design, you know, and in order to keep things simple and also some of the um in order to obey some rules that we have at the program model. For example, we have a shared memory in GPUs uh which threads that they are in the same block, they can use that shared memory to share data to each other. So if you basically group threads that they are in this different warps different blocks then meaning that these can be I mean GPU should really understand that these threads they should not share location in the block because they are from different blocks. So that adds another complexity to the GPU hardware. So it's the same program but they have blocks which are independent parts of and they should inter Yeah. For Yeah. Exactly. For a long time actually GPU was not supporting synchronization a lot across across blocks. So when programmer wanted to communicate across blocks most of the time they had to do uh coarse grain synchronization like shut the kernel finish the kernel then make them synchronize and then start another kernel. But now in recent GPUs they also providing some support such that blocks can also share information and communicate. Yes. But why would you make There isn't there is a limit there is there is a limit there is a limit and also uh GPUuler assigns several blocks to one SM which is in each streaming multipprocessor or each core. So if you have only one block that means that u GPU only assign that block to one SM and the rest of SM which can be hundreds they are useless but yeah you're right actually I mean these u all these are happening because of some of the decision that we are making at the hardware and at the program model. So none of them in my opinion are really fundamental. Okay. Uh okay so I think we already covered this uh traditional SIMD contains a single thread again comparing between warbased SIMD and traditional SIMD and ISA contains vector SIMD instruction but in warbased SIMD we consist of multiple scalar threads and uh this does not have to be lock step meaning that threads that they are in the same warp they don't they can actually execute independently they don't need to be completely at the same. They don't need to execute instruction in lock step manner. Meaning that there might be the cases that some threads for example they access cache you know they access memory and some threads actually hit in the cache. So they get their data quickly but other threads some other threads in that warp they may miss in the cache and they need to access offch memory. So they are expressing longer latency. If we consider lock step manner meaning that that warp is going to be stalled until all threads uh get their data right which usually actually the case for Nvidia GPUs and I know that in recent version they also try to you know relax this also a little bit but but again in order to make it simpler they consider that warps need to be uh lock step but this does not have to be. So in the paradigm we can actually be more flexible. Okay. And then again you know that software does not need to know vector lengths and that enables multi-rading and flexible dynamic grouping of traits and ISA is scaling. Okay. So, SPMD we already covered it just a summary single pro program multiple data and this a program model rather than a computer organization and each processing element execute the same procedure except on different data elements essentially multiple instructions execute the same program. Okay. Yeah. And this is also I want to uh emphasize that each program procedure works on different data and can execute different control flow paths uh which at runtime. So threads that they are in the verb. So you can have a control flow graph whenever some time to time you have some branch instructions that the result of that branch can be taken and not taken right and the that decision can be related to your thread ID related to that thread data. So for some of the threads in that verb they you may want to take the taken pass for some of them you may you need need to take a not taken pass and then they diverge and we're going to see that that can actually how we can handle that okay so we kind of finished the first part but now I want to talk about how we can group threads into warps flexibly so this is what I was discussing uh this a control flow graph of your one of an example application and you have several threads but different threads even though they have the same kernel code they might take the different passes because they are working on different data and this can cause basically reduction in your VAR efficiency. So a GPU uses SIMD pipeline to save area as you know and uh this can cause branch branch divergence which we call it that. So you have a warp that all threads are active and then you reach the branch instruction and then some threads wants to take pass a some of them wants to take pass b and then this uh full active warp diverg to two uh basically less full active u or partially active warps and then after some point these needs to be recon again so GPU hardware needs to deal with that which is also not that easy and that's the the fact that GPU hardware handles that makes the programming very very easy otherwise one of the reason that SIMD programming is very hard is actually dealing with branches uh because you need to deal with a lot of these masks you know as a programmer but GP hardware is doing that for you um but of course if you don't if you do it without thinking about the hardware then your program is going to be super super slow and inefficient because in the end this is not a good thing because you want to have your warp fully active in order to have uh you know quite good parallel but now you are not saturating or utilizing your performance as much you should or you could okay so let's see we can how we can do this group threads into warp flexibly so if you have many threads we can find individual threads that are at the same PC even though these uh threads might belong to different warps But as long as they are the same PC meaning that they are executing the same instruction this paradigm should let me to you know group them uh to another warp. So I can make a new warp which is now fully active. So let's see with that can improve SIMD utilization. Let's see with example. So this technique actually called dynamic warp formation or merging. That's uh essentially you have two for example different verbs warp X and VP Y and then you can see okay these two verbs are executing the same instruction but both of them are partially active so I can combine them and make a new war verb Z with this and this verb Z uh I'm going to run this only at ver X and Y and this is more active so you are increasing your SIMD utilization but in this example another example at some point uh you can see that you couldn't combine them nicely together. Why is that? Because the threads that they were active some of them they actually were on the same link. Uh and in that you cannot easily combine. Why is that? Because the register file access as I again in order to be simple uh when you want to access register file your thread ID is specify on which register you're getting your data. So when these two are on the same thread ID, you need to at least permit some of them, one of them, right? So then you need to change the ID of one of them. And changing the ID needs a lot of complexity in the hardware, which usually hardware does not support that. Even though this is a I'm not sure if it's still active, but people in the past were working a lot on it. how they can for example people u suggest that we can do uh rotation sometimes by by one rotation you can try to reduce conflicts or sometime by doing some some tricks because so once there are some tricks then you can make your hardware at least aware of those tricks but if if you want to flexibly just change the thread ID then that might not really work because the hardware going to be super super complex You have X Ycessor doing two different things but then they happen to both have the same PC. No, they are actually doing the same application. Oh, sorry. They're in the same block. Yeah. Yeah. Yeah. They're on the same block actually. Yeah. And then instead of having underutilized two processes, you just merge them into one process. Yes. Not process to same warp. So you make a new warp and then you combine these two warps together. They call it warp compension something like that. So this is like yes this kind of new warp formation that's why we call it dynamic war formation. So you make a new warp compaction. Yes. But again u so it is actually considered that these warps are on the same thread block because if that's not the case then your program might violate. So the only real benefit is because of lock step because all wars have to be in block, right? The only benefit of what? So the only benefit of merging the two X and Y into Z is because of block because otherwise X would have to wait for it other. No no no not at all no um actually uh GPU hardware does not do that. when when you have divergence when divergence happen uh GPU consider that you have two um warps now and then try to execute these two warps separately and then merge them at some point but then your graph actually can go to more if and else you know so that's yeah exactly that rec uh recon convergence point can be actually quite late and there are many works they have tried to predict that or reduce that but here Um that's not locustive. Actually lost is coming from the fact that um in one warp the ex instruction of the instruction that you are doing all of them needs to finish that instruction. Yeah. But but they already finished that branch instruction. Yeah. So that's that's not related to revering something else that one that was left over can be reveract. Okay. But then if you're lucky and you have another warf then you can try to you know combine this and then make it again more active and then you can so this is idea that has been proposed in this uh work in 2007 and I already provide you the idea essentially that you can try to merge them as much as possible but you you should not forget about this issue. So there are some cases that you cannot do. So there are there are there are actually I would say three important constraints when you want to do this. One of that is that these two warps that they want to compact threads together they should be at the same PC which makes sense. Another one is that threads should not overlap. Thread ID should not overlap. And the third one is that uh they should belong to the same thread block because otherwise this can cause some basically correctness issues. And actually later on this team actually prop published paper in 2011 that they try to um tackle the third issue that they they call it thread block compaction technique that they try to do compaction at the thread block level to make sure that this does not violate that uh important uh basically construct. And here is an example that you can play it maybe at home if you're interested. It's a bit quite a slow but nicely done animation I guess. Here it's a baseline codes that would run like this. But with dynamic war formation you can okay I should put it. So in dynamic world formation you can actually do uh you know you can combine them and make you but there are in some cases you are not basically yeah here these two for example you can combine them these two you can combine them and then uh yeah in this E and this F these two E can combine but then you have a E which is you know you have only one active thread and basically that would save some cycles for E there was no win There was no No. Yeah. Unfortunately. Yeah. Because one of the trades actually Okay. conflicts. Yeah. Okay. But yeah, this uh basically that's exactly what I talking. Can you move any thread flexibly to any lane, you know, to fix this issue that for example here you cannot combine them. Well, you can but that would cause a lot of complexity. So people don't do it. Okay, so analyzing GPUs is very fun. Uh we actually use uh this concept to for exam question a lot. This is an example and I'm not going to solve you. You can actually check the solution but you're going to also probably see in August or so uh some it's a very nice you know exam question for me. I really like this question to be honest and I usually I used to design them a lot. I'm not sure this time because hopefully TA will decide. Okay. Yeah, I don't have much time. Uh but I'm also going to cover a little bit some more concepts very quickly. So we already talked about branch divergence. We also have another issue which is long latency operations that we know that fine grain multitrading in general you want to use u context switching um fast context switching in order to hide latency for long latency operations but that's why our warp scheduleuler needs to be really intelligent so because here you have your warp based uh fine grain multitrading so this work um professor Musloo uh he has also worked a lot on GPUs. Actually when I when I started contacting professor Musu in 2016 or so you know to start collaborating with him I was back then I was working a lot on GPUs and I was excited about the works that he has been doing a lot on GPU and I have read a lot from him and then I contacted him and then our collaboration starts. So I have a very good memory actually from this paper actually because it's one of the paper I read and I enjoy it a lot. So I'm going to quickly go over this paper but not into a lot of detail because we don't have time but essentially if you are greedy in your scheduling um that you try to do you know try to schedule warps a lot and they compute um a lot of instructions and then they reach to the long latency instruction all at the same time then you don't have to do anything you know you basically you're going to have longstall time. So that's why uh in this paper uh they provided this two-level round robin scheduling. This work actually has been done with collaboration with Nvidia. I I kind of I think that they are using this in many GPUs as well uh using this two two-level round robin that basically you try to schedule warps and they uh compute but then you don't you want to have another group that they so you schedule a group of warps they compute and then you reach to a position that you have long latency operations but the good thing is that you have still another group that they are in the compute instructions so you can schedule And with that you can um basically reduce the stall time a lot. So this two-level warp scheduling is very interesting and it's very effective. If you're interested you can actually check this paper also. In this paper they discuss how we can uh reduce branch divergence as well by uh providing this large warp definition. So you define that they can have a if they we have a large warps then you can u large warps meaning that much larger than 32 threads then you can actually flexibly try to make subwarps out of this big large warp and keep your SIMD active and basically do computation. I don't really have time to go into detail of it, but if you're interested, you can check this paper. Okay, question. I'm going to quickly also show some case studies, some of their GPUs. I'm going to show a lot of marketing numbers. U but I'm not from Nvidia. Uh this is one of the GPUs that they provided um very long time ago. Um GeForce GTX 285 and this GPU actually the numbers that we are seeing here is actually quite like a joke compared to what we have now. But even then it was actually super powerful. So they had uh 240 stream processors. Nvidia also call them CUDA cores or they sometime you can see that as SP you know SP is actually acronym for a stream processors and uh but another terminology is cores also streaming multiprocessor or SM so you should be careful so like in GPU we have for example in this GPU we have 30 cores and each core we have uh some amount of stream processor so basically the the amount of SP per course you need to divide 240 by 30 essentially to calculate I think it's going to be eight right so you have eight SPS per essentially and then one block of one stream processor processor or um one block or some or several blocks several blocks on one stream processor one streaming multiple no so one of the one of the yeah one of the course yeah okay and this is the architecture of it um which is actually a bit u old in that sense but again I mean the today's architecture also not that different fundamentally so for here we have 64 kilob of storage for registers and uh we have uh decode step and also um basically functional units for doing cd operations like multiply add multiply and so on so okay and var size is 32 and basically up to 32 two warps are interled in in a fine grain manner in this in each SM. Meaning that the maximum number of threads that you can have and you can schedule per SM is actually 1,000 in this architecture because you can have up to 32 warps and each VP is 32 threads. Uh so that's the maximum number of threads that this architecture has. Uh later on actually more recent architecture I think we have maximum 2,000 threads. two I mean two to the power um two to the power 11 number of threads per okay as a structure architecture of the whole uh the number of cores that you have and the number of threads so in total you can have around 31,000 something along that threads on your GPU okay but basically we are way uh advanced compared to that uh basically uh architecture in 2009 and you can see that GPUs gets uh beefier and beefier. So he can you can see that the the fun the number of functional units and here we have gigaflops and I'm going to show you for example volta architecture in 2017 that in volta we have uh 5,000 stream processor in total and uh we have 18 80s and in each SM we have 64 SIMD functional units per core or 64 basically SP per core. interesting thing that they added in Nvidia Volta is that they added tensor cores specifically to support machine learning applications. So at some point Nvidia actually focused a lot on machine learning and that's why they are now quite successful and they making a lot of money. Um so they started uh adding some acceler even though GPU in general uh their model actually very good for machine learning because in machine learning you do a lot of matrix matrix multiplication matrix vector but if you want to even one that GPU is not enough for that if you want to make much better you need to design accelerators and GP NVIDIA for example designed tensor course uh for that specific operation for matrix matrix multiplication in order to accelerate specifically did that kind of operations that we have a lot in our machine learning. Tensors are still tensor cores are uh they are now still they are still sim they're also sharing but yeah so this architecture of V 100 which we have 80 SMS and you can see this L2 cache I think we didn't have for example L2 cache in that the first architecture that I showed you can check later but okay and these are some numbers about you know the performance how much uh performance you can get like for example with single precision you can get uh 15.7 teraflops which is quite a lot and for if you use tensor cores for operation like deep learning you can for example you get you can get 125 uh truffle flops uh for specifically for deep learning uh operation and tensor cores are essentially are optimized for doing this operation so you have two matrix that they are in this architecture they should be floating 16 and then you multiply them and then you add them basically you need to accumulate in order to do this multiply and accum accumulation and then you have the result in uh 32 floating point yeah and this is uh the same thing um some people actually try to demestify the ar the tensor core exact architecture that we have in Nvidia this paper is actually from Toronto's group and they show for example that each warp this is a architecture of tensor course micro architecture in volta I'm not going into detail of that you don't also need to know that for the interest of this course but if you're interested to work on this topic you can of course uh check this paper so each warp utilizes uh two tensor cores so meaning that each tensor core contains basically each tensor core can work on 16 threads So warp is 32 threads. So basically each warp works on two tensor cores and each tensor core contains two octets so-called octets and u 16d units per tensor core meaning that eight per octed and we can do for example 4x4 matrix multiply and accumulate each cycle per tensor core. And if you look into it basically the main u functional unit that you have is these rectangles and it's essentially this operation you do multiplication and then then you have a reduction sum a tree for reduction sum which is why is that because you want to do dot product essentially each of these elements uh processing element is doing reduction um dot product for you and you can use it. It's also quite interesting here that unlike conventional SIMD uh register contents are not private to each thread uh but shared inside the warp because that's needed when you want to uh do that product otherwise you have to repeat a lot of data and you have you need to have many many repeated uh data elements. But here you can actually load register values and you can share them a lot through these buffers and essentially they can uh work together to execute. But uh other companies also they designed tensor cores like this one is from Google Google edge tensor processing units but this actually coming from a little bit different paradigm which is systolic arrays and we're going to also hopefully learn about systolic later but uh it's actually from one of our research uh we have done with Google on understanding Google HTPU why are what are the shortcomings of them and how we can improve essentially. Okay. Um if you're interested about this work, you can check this one and if you want to learn about sysic array, you can watch this lecture. But probably we will also cover systolic area in later lectures. Okay. So after u volta we can we have this uh amper nvidia a100 is one of the flagship uh uh u basically GPU device back then and uh so we have something uh 6,900 stream processor we have 108 SMS and a lot of and you can 64 cmd functional units per core the interesting thing about Nvidia 100 that I want to emphas emphasize is support for sparsity. So uh these tensor cores that I showed they are actually designed for dense matrix matx multiplication and people have said that okay for many of these machine learning we do we we specify you know in order to be more efficient we don't need to do a lot of operations so we prune the network and we have a lot of sparity in our matrixes and if you run them you can see that the performance of volto is not good so in this work in this generation amper they try to support for provide better support for sparity once your matrix is a bit sparse you can still get better performance but I'm going to show a number which you can see that is not that great so the for deep learning if you have dense matrix this is the tr 312 truffle flops that you can get but if you have some sparity you can just get the double of that you know but in many works they actually show that we have 90% sparse values or 95 5 or 99. So you really for example if you have 90% sparse you really want to have u I'm losing my voice uh you need u for example 10 times the speed up right because you have but because of irregularity and also many many other reasons they cannot get that much performance improvement but still uh they are trying to do better when they are sparse and here is the architecture uh that they have for uh basically tensor of uh later on in uh two 2022 we have hooper architecture that has 144 SM on the full um architecture and 60 megabyte of L2 cache. Um interesting point about this architecture is that they added a lot of different um basically precision for data. Why is that? because they realize that for many of these machine learning training for example we don't we don't need uh 16 bit 32-bit 64-bit data so we can do operation on 8 bit data and that can provide a lot of efficiency and also you can reduce how much you're getting the how much you're consuming memory bandwidths for that we are on time yeah so yeah this is what I want to uh basically underscore for here uh that they added different precision to their design and this is the most recent one uh Nvidia Blackville that they announced it u last year I'm not sure if it's already released I think maybe some of them in GeForce because Nvidia has different category there are some um GPUs that they're coming from for graphics I mean you can still use those GPUs for doing computation as well um you may know it like 3080 Ti I don't know for 409 ID something. So you can do use them for game of course but you can also use them for doing high performance computing. I think some of them they already released but I'm not fully sure but this is for example one of the architecture that they pronounce they announced that they have they call it I guess super cheap that they have two um blackwell um chip which I believe they are B200 B200 version and also a CPU chip that they have it on this and this is quite powerful and they need they can combine these uh these super cheap in a in a cluster. and make a super competing out of. So this is a picture of this Nvidia U blackwell architecture. I don't I don't think this one is for B200. It's actually it's it's very hard to find the picture for now because it's not already released that much and there are a lot of marketing information you know that's very hard to inter filter uh marketing information and get the information that you can rely you know it's not that easy but um is one of the architecture of black so in B200 which I think is the most powerful chip of this architecture they have 160 SMS or 160 cores And they have 8 terabyte per second memory which HBM3 and you can have up to 192 gigabyte of uh data. And you can see that the performance they get for uh for example 32 floating point is 180 and for 64 floating point is 90 flops. Is it exciting? Yeah. Is is it even more exciting if you know the history that people around 2000 so they were trying to see how they can make a supercomput of one trouflops and then very soon actually Nvidia you know uh comes with this architecture and then we have trops in our pocket nowadays. between the last couple graphics cards. It just looks like the image they just grew. This is D. No, this is only the D. This is the die of GPU or the chip. Yeah, exactly. The fundamental are the same. But yeah, but they're also adding some accelerator like tensor cores as we discussed and also support for uh lower precision. I believe in blackmail they also provide uh support for 4bit data 4 bit precision. Um so yeah these are the changes that they are making. They are not fundamental in my opinion but they are doing I think they are doing their way to keep up with the performance but there might be fundamental different way to do computation of course which we're going to also see later lectures for example but with memorycentric computing storagecentric computing you may not need to have that many GPUs to do computation we already actually disclosed two papers related that that you can to uh large language model um by using memory centric computing and get the better performance which if you want to get that performance you need to combine many many GPUs to get that so that shows the maybe we need to rethink about if what we are doing is a good thing or not okay so I'm u showing you these pictures again because I want to uh share some takeaway with you that essentially each GPU is quite powerful and it's getting powerful and powerful but then today's applications are they need a lot of memory capacity and at the same time they need a lot of u computation power so people u combine many GPUs several GPUs in a cluster um to to basically provide collectively higher memory capacity and higher computation power and that's why the interconnection network is going to be fundamental problem as we go also in the future it was very fundamental and hard problem to solve in the past but is also getting harder and harder is is the demand outpacing what we can design I couldn't follow the question so is the demand for like the AI processing such is that outperforming the actual demand well uh it also depends so I would say that it also depends on the models so as you make your models also more intellig ent you may not need to be also that huge which actually we have seen uh recently that by you know by making some model more optimized you don't need to for example have that sophisticated hardware to run it you know the general trend yeah general trend is that people just train more and then make the model bigger and bigger yeah exactly but uh but I think that's also going to be uh one of the important research problem as we go also go to the future how we can make our models much smarter and less compute hungry because maybe you don't need that because our brain I don't think is that power hungry okay and these are some uh interconnection network that people propose uh in Nvidia in different generation to combine so this architecture they combine they provide old to old connection among nvlink domain of 72 GPUs which actually for blackwell so they combine 72 blackwell GPUs in this cluster using this uh interconnection network essentially. Okay, so there are some food for thoughts. So we haven't covered systolic arrays but uh we're going to hopefully cover it but there's also a lecture that you can check if you are interested but essentially uh there are some question that which one is better for machine learning because there are companies that they are using systolic like Google TPU for example and GPU that they are still GPU more GPU to be at systoary. So which one is better for machine learning? Which one is better for image vision processing? Which type of parallelis which one exploit and what are the trade-offs? If you're interested in such questions and more uh I would recommend that you take this course. Some of you I know you already um you are already taking this course. Um but yeah, if you're interested, you can take this seminar in computer architecture course and our master level course for computer architecture that we go into cutting edge research more deeply. Learn more about GPUs. I would recommend these courses again and with that I will conclude today's no question I guess Right.

Transcript for:L8 - GPU Architecture and Parallel Processing

Transcript for:
L8 - GPU Architecture and Parallel Processing