[Lecture 31] GPU as General Purpose Processors Overview

this is the agenda for today uh we will talk about the GPU and Accel as an accelerator you know that gpus are originally uh deviced for processing Graphics we don't really care about that uh side of the gpus in this lecture we are only going to uh talk about the their capability as general purpose processors and uh and there um yeah the way that we are going to uh organize this lecture is by first presenting what's a typical programing structure when we write Cuda or opencl programs which is which are the the way of programming gpus for uh general purpose processing and we will talk about the bulk synchronous programming model bulk synchronous parallel programming model what does it mean what are its limitations and then uh we will talk about the um uh memory hierarchy and memory management and uh finally some performance considerations let's say uh things that we have to take into account when we want to optimize programs for gpus this is not usually not trivial optimization so that's why it's interesting to talk about them and also uh they are kind of uh even though we are going to be talking in particular about gpus they are kind of um uh you know like generic um optimization techniques that you may also use on other platforms as well finally only if we have time we will start talking about collaborative Computing this is let's say like new programming features that allow us to use uh different devices uh in a collaborative manner working on the same uh workload the say application okay uh recommended readings uh the first one is uh clearly the Cuda programming guide um you will see it's a lot lot of uh documents there from the very basics of Cuda programming uh to like very very uh complex um um you know uh apis functions and and programming tricks and if you want to read a book there are several good ones but the one that I would recommend is this one by Professor W May who and and David Kirk the other day uh in the last lecture the I mean there we had a few more slides to cover about uh the you know like example gpus and and this is what what where we are going to start uh right now uh this trend of uh GPU programming or general purpose programming for gpus started like uh around 12 13 years ago and the very first uh architecture was uh called Tesla architecture we will talk a little bit more about the uh different AR iectures later over the lecture but um yeah this is kind of one of these first gpus that were uh capable of doing general purpose processing um in Nvidia speak this has 240 stream processors and this is simt execution recall that we talked the other day about simt execution we will uh uh recall that uh again today and in generic speak um these 200 stream processors that Nvidia also calls Cuda processors um are actually 30 uh cores 30 GPU cores or 30 um uh simd pipelines each of them including eight uh eight Lanes so uh you will see um the detail actually you can see the detail here this is one of these cores one of these uh streaming multiprocessors in Nvidia terminology and inside each of these we have these different Lanes you can see the eight of them inside each of them we might have several uh processing element or several alus let's say for example uh we have something that is able to do multiply accumulate operations or multiply add operations that are very useful for um linear algebra for neural networks for example and we may also have other ones we can have um uh special units for uh integer computation for flating Point computation and this is also something that has uh a changed a little bit over time in the different designs of the GPU architectures but essentially uh the the them that we have inside one of these GPU courses what you can see uh in this slide um yeah so as you might remember as well on each of these um GPU C what we execute are well what we do is mapping thread blocks you might remember that term from the uh previous lecture and these thread blocks are decomposed into warps and each warp is a a set of 32 threads that are going to run in parallel they are going to run uh concurrently as a Syd unit as a basic Syd unit we will uh talk a little bit more about this today and this is like the let me show you okay no wait this this is like the uh full picture of the entire GPU this gForce GTX 2 um 285 and here uh you can find the 32 GPU cores that the uh first slide mentioned and actually the the even though the the let's say the basic architecture is essentially the same it has evolved a lot in the last uh 12 or 10 years right this this one was released in 2009 but the first Cuda capable gpus were released like 2 3 years before and yeah but the thing is that in the last 10 years you can see how much the number of these exem processors and how much what is most more important how much the uh Peak Thro putut has increased uh we I mean this it's more like 30 times right so from this GTX 285 until this v00 Vol Nvidia Volta V 100 that was uh releasing 2017 and is um espcially devised for training deep neural networks among many other task of of course that you can U run on it and this is the Volta uh Nvidia v00 uh now observed that how much this number of stream processors has changed from 240 to 5,10 uh much much powerful thing the way that they are organized is a little bit different the the in the other GPU we had like uh 30 cores uh with eight Syd functional units per core and here what we have is 80 core 80 cores with 64 of these um um Cuda cores or or streaming processors stream processors um something interesting also in this uh Volta architecture something that started actually in this vaa architecture is uh some some specialized functional units for machine learning these are called in Nvidia terminology these are called are called tensor cores let we'll we'll see uh something more in the next slide but here for now what you can see is the like the entire block diagram each of these rectangles here is one SM or GPU core and actually it's divided as we will see in the next slide better it's divided like in four different pieces if you count the number of SMS or the number of GPU cores here there are 80 in total observe that they all have access to this shared L2 cache and uh from there you can access the memory controllers to the hbm2 memory that provides very high bandwidth around 800 or 900 gigabytes per second uh these are uh some numbers on the right hand side about what's the uh big throw putut that you can get in single precision double precision and also uh for deep learning with this tensor course that you can find here observe that this is the picture of an entire SM or cacore um and it's actually divided like as you can see like in in four different parts but essentially what we have in each of these parts is the same uh we have an LC instruction cache uh from from that one the warp scheduler reads instructions and dispatches instructions to for for for the individual warps uh onto these uh different units that we have here here we have eight uh floating 64 for double Precision uh computation here we have uh integer units here we have floating Point 32 and these are these uh tensor cores what does exactly mean these tensor cores well it's something like this for each of these tensor course what we're going to be able to do is kind of this computation observe that we have two inut values we multiply them and then we accumulate the partial result with uh more products coming from uh other tensor cores um as you can see this is uh this uh is able to do sort sort of mix Precision we are not working with uh floating point so for with single Precision or double Precision floating point operan but with half Precision this is uh 16 bits to represent the floating Point number it is uh very uh common practice in um deep neural networks because with this Precision is more than enough to uh train and uh to do training and inferencing neural networks and actually you can have uh what what you see here is this mixed Precision because the uh input operant are 16 bits but the output are 32 bits or you can also have the output with 16 bits and this is more or less like the computation that each of these tensor cores is able to do um essentially uh multiplying two uh matrices of four * 4 and accumulate with the um original values of Matrix C so this is a b and c so by decomposing an entire matrix multiplication into many of these tiles and mapping these tiles onto the tensor course you can achieve like very high throat putut as you can see which is uh essential these days to um to train networks fast and in an efficient way that is why these uh gpus are let's say like the accelerator of choice for training neural networks even though there are you know like other uh architectures as well that are being um designed and are being explored for the same purpose actually this is more or less what this um uh slide is about this food for thought what is the main bot leg in GPU programs this was something that I wanted to ask you the other day at the end of the of the lecture but we didn't have time and maybe you uh you started figuring out what were the the Bal necks and and and GPU programs we are going to talk about this today for sure um about distance or course we are talking about distance or course like very useful or very um efficient to uh train uh deep neural networks to essentially perform Matrix multiplications because as you may know uh What uh we do with um when we are uh training well not only in training also in inference where when we are running neural networks what we do is mapping the con the convolutional layers among other layers uh as Matrix multiplications and that's why we can uh easily map them onto these uh gpus and and onto this GPU course and uh achieve very very high Thro put and the question here is uh is it possible to use these tensor cores which are super fast and and and and very nice is it possible to use them for other applications well this is something that uh some people are uh exploring these days trying to do and trying to see how to uh modify slightly distance or course in order to be able to uh do uh more computation for example uh there is a very recent proposal in micro 2019 about uh uh you know utilizing these uh tensor cores to um execute um sparse Matrix computations as well which is going to be like more and more important uh for neural network training um yeah another question that I wanted to ask you is uh can you compare and contrast gpus uh with other accelerators in particular it could be it's very nice to compare them with systolic arrays why is that because systolic arrays and as an example of them tpus are like the one of the other uh good alternatives for training and inference of neural networks um I think that we might not cover systolic arrays here in this course but if uh you interested you can uh you can watch Professor mudu lecturing the sign of digital circuits and you will start figuring out what are the differences and and maybe you know at some point it it would be nice to combine uh both uh parad times which one is better for machine learning try to answer that uh which one is better for image and video processing in the end there are a lot of similarities between uh machine learning or yeah in particular uh deep neural networks and image and video processing why is that because uh in the end we are working on usually uh two-dimensional or three-dimensional inputs and the way that we operate on on them is by applying filters on top of uh these inputs input feature Maps is the uh name that we that we use in neural networks and we will we will see how uh we operate on these uh on these different tiles when we are doing U GPU Computing it's something that we are going to cover in this lecture today and what type of parallelism each one exploits you already know what type of parallelism gpus exploit right recall that we have simd and inside the you know it's Syd paradigm they they have kind of uh Vector processing but also array processing but we also talk about uh fine grain multi-threading right recall that I had the other day a few slides uh to talk you know introduce briefly uh fine Rin multi-threading yeah this three uh types of uh parallelism are are um or these three types of uh um um processing paradigms are are inside the gpus we are going to uh talk to more about the especially the fine grain uh multi-threading thing and what are the trade-offs and gpus yeah everything has trade-offs in life right okay yeah so we're going to talk about this trade-offs for sure recall this uh a slide how is each of these GPU cores or simd pipelines composed recall that I mean and this is actually what you have seen in the previous slides recall that in each of them what we have is um one stage where we are fetching the instruction this is what the war schuer does when goes to this L zero or or L1 instruction cach and reads one instruction and starts decoding it to issue onto the onto onto the lanes so we decode it and then we have this uh kind of scalar pipelines in parallel and each of them so uh each of them represents one of the lanes inside one of the simd lanes inside the GPU core um we we we need to go through the or we need to read from the register file what are the operants of the arithmetic operation for example that we're going to execute and then we have our alus which are these you you have seen this multiply add and multiply units and um and and then we have uh uh data cache L1 cache from this one if if we miss we will have to go to the L2 and if we also miss there we will have to go to the uh Global memory the dram essentially is called Global memory in in gpus and uh yeah and then finally the right back uh we will write to the the the the the the output to the uh register file uh also recall how one warp is executed on these uh units we we we talk about you know different possibilities right of mapping the computation on the uh multiple lanes that we have and it will depend on the number of lanes that we have recall Vector processors uh we might be writing uh simd programs where we assume that we have a certain length of the or or or simd unit and and then um depending on how many functional units we have we will uh have to perform the computations in in a different manner right based on how many um functional units we have in case that we have one single functional unit we can have let's say um um concurrency in time uh for example here we have first launched the computation of a z plus b0 and obtain c0 after that uh we'll we uh input the operant to compute uh C1 C2 C3 and so on if we have more of these units we can have more of these operation in parallel so we can have concurrency in time and in space um how do this uh simd Lanes look each of these is going to be one of these simd Lanes we may have several of these functional units and the operants are going to be provided by the register file the register file has in total a certain size this is the number of registers what that we have inside each of the GPU cores and what we do when we the threads onto this uh GPU course is that we need to partition this register file and assign each of the threads as many registers as it it needs obviously as long as the we have enough registers for that and we are also going to talk about that about that availability of registers and how much it impacts in the overall execution of the warps and you may also recall this one um this was for an example where the warp size is 32 so we are going to have 32 Wars that are going to execute um in loog step we are going to execute concurrently but the way that these actually are um uh executed onto the lanes depends on how many lanes do we have right so in the in this example we have eight Lanes so uh if we need to if we want to uh execute 32 threads what we will have to do is launch 8888 up to uh 32 so in every cycle we will start running uh or we will start executing um eight threads and this would be like the first instruction for example um in the load unit for warp zero we execute a load instruction and maybe here in the multiply unit we execute a multiply instruction for warp one uh an addition for warp two and again uh blow the unit uh warp three warp 4 warp five this slide very quickly just for you to have it as a reference because as you can see I might be I well I'm actually mixing a lot of terminology here right I'm I'm talking about the Nvidia terminology I could also talk about AMD terminology and then we have the generic terminology which is what we uh would really need to use but it's also good that you are familiar with the uh terminology of the different vendors and um yeah this is you know for for your reference so say for example when I talk about GPU core um is what Nvidia calls a streaming multiprocessor or um inside a a streaming multiprocessor what we have is uh some what n media calls uh streaming processors and this is for us a SD pipeline or SD functional unit okay uh let's start talking about the um basics of GPU programming and the very first thing is to recall this slide that we also uh presented the other day what was one of the main uh drawbacks in uh of Cindy processing one of the main drawbacks of Cindy processing was that it's difficult to program right and this is uh you know this is what somehow with uh bull synchronous parallel programming models like Cuda or opencl uh we are uh trying to to solve or trying to tackle this trend of uh general purpose processing on gpus as I as I said started like 12 13 years ago and uh the you know the the the most interesting contribution here is that it's is it's going to be easier or it's easier to program uh simd processors with the spmd program programming model we already discussed this spmd programming model the other day um yeah many people say that gpus have democratized high performance Computing because achieving the throat putut the performance that you can today achieve very easily even in your laptop uh with one of these gpus is uh I mean it was like um you know very prohibitive like 15 years ago it was something that we probably couldn't only like achieve in uh big clusters and super computers but uh these days is uh like um yeah it's um it's uh I mean everyone can can do it um in uh commodity PC even and they are uh typically very good where for workloads where we have uh very iner and parallelism of course the the good thing of the SPD programming model is that it is also possible to uh Implement irregular workloads is also possible to uh Implement um um you know to to have Divergence across uh different threats uh running concurrently obviously uh the performance is going to be degraded but it's still possible to uh program them this way and um and yeah so um yeah but they essentially are very very good whenever we have uh inherent and regular parallelism and examples of that are linear algebra that's why matrices image or video processing and deep neural networks however this is not for free and why is that because there is a new programming model and we have to learn it um algorithms need to be reimplemented and rethought that's the uh first thing to keep into account because I mean if you are used to write C programs for example or forr or C++ which are essentially uh sequential computation uh yeah obviously you need to find a way of uh taking advantage of all the threats that you can have running in parallel right and that's not uh so trivial good thing is that there are already like uh many uh or several higher level programming languages that allow you to uh Express this parallelism in a um more um easy way something like open ACC which follows kind of similar um uh paradigm as open MP you might be more familiar with open MP right you have a for Loop and and before this for Loop you you just write a pragma on Parallel for and then the compiler will do the the the work for you right something like that for example exists also for gpus is called open ACC and yeah for sure you can achieve a good performance for uh very regular work workloads if you know your uh program your computation your algorithm is a little bit more regular then um obviously it's going to be more difficult and it would be uh better it would be necessary to to go and and directly programming Cuda or open Cal and uh yeah and even though gpus are super uh powerful super fast there are still some bottlenecks the first U bottleneck is the fact that uh the GPU is an accelerator right and it's typically uh connected to the uh system uh through the PCI Express or now for example Envy link has this fast uh kind of PCI Express that can achieve a little bit more higher bandwidth but still uh is uh like one order of magnitude uh slower than accessing memory and even uh accessing memory accessing dram from within the GPU course uh it's uh it's it's also a bottleneck why is that because even though these um memories gddr5 gddr6 and HM2 are uh you know they they they can provide you something between 300 gigabytes per seconds and 900 gigabytes per second this is still far from the peak throw put that you can achieve and then actually uh something interesting is recall uh how I started with this gtx285 and this uh Volta V 100 um if you compare both if you check what's the peak Thro put uh in Tera flops or Giga flops that the original I mean the the the first one uh gives you I I I don't remember the exact number but might be something uh around 100 Giga flops uh compared to the uh 15 Tera flops that the that the um V100 uh provides so if you compare these two numbers with the memory bandwidth of these two gpus you will see that the uh ratio between them has increased meaning that now gpus so the cores I mean the number of cores and and how powerful these cores are is uh is uh is higher with respect to the available bandwidth than it was 13 years ago what does it mean it means that the yes the gpus are more powerful they have much more bandwidth but this bottleneck is even worse 13 years later okay um yeah this is something that for sure you you know but it's always good to um yeah to to recall to remind um what's the main difference between a CPU and a GPU the main difference is uh how are you using the the area that you have in your chip um and a CPU you know that we typically have few out of four course a very powerful course but only a few of them um and um a lot of this area a lot of this die is used for control for example doing speculation Branch prediction and these uh sort of things and then we also have like very big caches right we have deep cach hierarchies uh in CPUs on the GPU most of the area is devoted to course Computing cores typically uh in order fine grain multi threaded cores what what we need to for you know control for catches is much smaller and yeah actually this more let's say corresponds to something old like this gtx285 where we didn't even have uh L2 or we couldn't use it for you know the the general purpose uh pipeline um now we actually have an L2 but it's still the size of these caches compared to typical sizes on CPUs is much smaller signicantly smaller okay and this is first slide of how do we compute on the GPU how do we of load the computation to the GPU three steps first we will usually start the computation on the CPU side where we are going to have the original data structure that we want to operate on or data structures that you we want to operate on so the first thing to do is to transfer input data from the CPU memory to the GPU memory Second Step performing the computation or kernel execution or multiple kernel execution and finally we will transfer the results back from the GPU memory to the CPU memory so these are the basic steps especially in discrete gpus which are the most common ones um you know any of these uh laptops and also you especially desktop computers the GPU is discrete is connected to the PCI Express in the best case connect to the Envy link so we will always have to do uh this thing these days it's more more frequent to find uh system on chips where we have inside the um uh same chip we have integrated CPU GPU and maybe more accelerators Etc and they have access to the same memory space You Know by and by you know by using that we can enable many more features and this if we have time uh we will talk about that at the end of the of the presentation uh yeah but there still what's the problem with those gpus integrated gpus is that they are usually uh uh much less powerful than discrete ones and what's the uh typical programing structure I talk about traditional programming structure because that's the um let's say what is called sometimes a original accelerator model that follows exactly uh what the I'm I'm explaining here um discrete GPU where you have originally have your data in the CPU memory you need to transfer it over the PCI Express BS to the GPU memory and then you start the computation that's a um traditional accelerator model and the uh program structure for for this it's going to be uh something like like that I think that you you have already seen this a slide before we will start the computation doing some uh serial code or modestly the uh parallel code on the CPU side and then we launch a parallel kernel on the uh GPU um for this kernel we launch a certain number of threat blocks and each of these threat blocks contains a certain number of threats you can see them all here when they all finish the execution the kernel terminates and then the control is returned to the G to the CPU and then uh we might launch a second kernel kernel B in this case um one uh thing that you can notice in this slide is that uh two terms that we are going to be using interchangeably uh with CPU and GPU are host and device when we talk about host we're talking about the CPU when we talk about device we're talking about the GPU um it's better when talking about Cuda programming or talking about open seal programming it's better to talk about device instead of GPU because in the end the device could be something different as well as you may know with open c for example you can also program fpgas and actually I don't think that we will have time but I I have a few slides at the very end of the presentation because you know the the the same uh role that the um GPU can have as the device in in Cuda programming for example uh can be applied to an FPA in open seal programming Okay so recall this uh single program multiple data or single procedure multiple data this is a programming model rather than a computer organization we were talking uh uh the other day about this we are going to have multiple instruction stream uh executing the same program uh and each program or procedure is working on different data and can execute the different control uh flow path we will see how this is done how we need to write these programs what are the implications of doing that what happens if we actually uh follow different control flow paths and we will also discuss about how whenever is possible we can try to uh avoid these different uh control flow paths uh or or or yeah or program them in a way that we can still be more or less efficient um K and open CL follow as I said in the beginning this bul synchronous parallel programming model this essentially means that there is no um Global synchronization inside the kernel when I talk about a kernel I think something that I already said is uh whenever we talk about a kernel we are talking about a function that is executed on the device and and the only way to synchronize all the threads that we have inside a kernel all the threats that we have running a kernel is by terminating the Kel it's the only way of doing Global synchronization as we will see today it is possible to do uh synchronization uh local synchronization uh inside a GPU core inside a thread block but the only way of doing uh the global synchronization is by terminating the kernel and why is that there are several reasons indeed that uh the main reason is that or one of the main reasons is that we want to write programs that can be uh used on any GPU if you go to Nvidia website or AMD website or Wikipedia or whatever and you look for different models of gpus you will have you will see that uh yeah there are many different models from uh very uh lowend to highend uh devices from let's say two GPU cores to 80 GPU cores and we want we want to write our programs once right so the way that we write our programs is by saying okay I need to perform computation on these huge Matrix and I'm going to use 1,000 thread blocks but now the way that these 1,000 thread blocks is going to be actually executed on the hardware will depend on how what what's the number of course that we have right so uh we might have a kernel where we use eight thread blocks as we have here and depending on our device how powerful is our device we will be able to do something like this or will be able to do something like this right so here for example this is a device with two cores that one is a device with four cores the way that these blocks are scheduled onto the course depends on the number of cores that we have and actually even though you can see in the slide that okay first we scheduled zero and one and then two and three and then four and five and then there is kind of the same I mean they are schedule in order in this slide we don't even have uh that certainty we don't we don't really know how they are scheduled we cannot assume that block zero is going to be scheduled before block seven we cannot assume that even though yeah it's true that typical internal scheduling of threat blocks uh is a round robbing but and and that's actually something that um I also mentioned the other day at the end of the presentation but there is no guarantee that this is like that so if we want to make sure that these eight blocks have finished the only way of doing this is terminating the kernel and then launching a new kernel so if eventually we need threats in this block to use intermediate data produced by threats in this block we will have to terminate the kernel and launch a new kernel with maybe again uh eight new blocks okay yeah going back very quickly to the previous slide uh these are very common terms here you have in blue the Cuda term in parenthesis you have the open Cal term they they mean exactly the same thing when we talk about grid we talk about the whole number of blocks that run a kernel and when we talk about a block uh it's a a certain number of threads that are going to run inside the same GPU core within a block as you can see in the slide we they they these uh threads belonging to the same block can share data uh ex exchange data or share data using uh an unch memory that is called shared memory and they can also synchronize so this is the local synchronization that I was talking about and then uh yeah we have thread blocks or work items in the um open seal model okay about the memory hierarchy this is kind of abstract view of the memory hierarchy and actually if You observe this I mean it's a nice picture but but uh it's kind of uh mix thing right because on the one hand we have the memories which is let's say something like Hardware uh like for example registers shared memory or the global memory which is the dram uh but then we also have like the software concept right we have the uh entire grid we have the block and inside the block we have threats but the the the slide is nice because um uh it it shows very clearly uh what are the different memory spaces that threats can access right so uh we have two assuming for example a block blocks of only two threads so we have these two threads uh each of them can access their own registers recall the pipeline register file where we have you know the the different simd lanes and uh and we have the the whole register file that is partition and part of this register file certain number of registers as many as they need are assigned to to the individual threads uh from one thread I can access my own registers but in principle I cannot access other threads registers uh but they can exchange data by using the shared memory you know and all the threads running on the GPU can access data in dram can access data in the uh Global memory the global I mean the global memories is is all these uh or let's say the dam the device memory is all this orange thing here but yeah the the it's it's kind of divided into from from the programming perspective can be divided into uh several parts for example a constant memory is uh you know some the memory areas that you allocate for readon data uh Global memories like the general purpose one for read write and then uh yeah if you want to use uh textures which are or surfaces which are typically uh using uh Graphics programming yeah you can also allocate space for that but it will be in the same device memory observe that this device memory can be accessed by the threats running inside the GPU obviously but also can be accessed by the host and why is it necessary because we need to transfer input data from the host to the GPU and results from the GPU to the host and uh more about the traditional program structure now we start uh you know like saying like uh typical uh yeah declarations and uh function prototypes and and apis uh so we'll see um this main function is going to run on the host on the CPU first thing to do if we want to do computation on the GPU side will be to allocate memory on the device using Cuda malog you all are familiar with Malo in Cuda we have Cuda Malo the syntax in opencl is a little bit more complex but in the end we have the same kind of thing observe that in this C amalog what we are doing is allocating a certain number of bytes and uh we uh we we are going to use a pointer to access these bytes and then we need to transfer data from the host to the device um using this API kudam me copy this is the pointer to the CPU memory this is the pointer to the GPU memory and one of the things you will see it in the next slide one of the things that we are going to have here is the total size that we are transferring of course and also what's the direction of this transfer because the API is the same uh for uh transferring from the CPU to the GPU and transferring from the GPU to the CPU and then we need to uh set up the execution configuration which essentially is how many threads do I need how many blocks do I need these are two numbers that uh we are um usually going to determine based on what's the uh what are the characteristics of the data structure that we need to operate on for example for matrix multiplication the way that we typically map matrix multiplication onto the GPU is by assigning the output uh the output elements to threads so for example if we have um we need to compute on a 1,000 * 1,000 Matrix this is my output Matrix what we typically do is assigning one thread per output element of the output Matrix so in this case we would be talking about 1 million threads and each of these threads is going to read one row of Matrix a one column of Matrix B perform the dot product and store the result in the corresponding element of C right so um What's the total number of that we need for this operation is 1 million right and now these 1 million threads need to be divided need to be grouped into thread blocks right you might recall that the other day I mentioned that typical thread blocks sizes are multiple of 32 right so actually typical numbers are something between 128 and five12 so for example if you're using 512 and you have uh 1 million elements to compute on what would you do how many thread blocks you will need to launch it's an easy calculation right but as you can see uh the way that we write the program is uh you know like as a function of the uh the the output size in this case next thing to do is to call the kernel this is the kernel Lounge uh inside this um um this part what we have is the execution configuration so here we will have number of thread blocks that we launch uh here the number of threats and then the arguments what's going to be one of the arguments or for example in our Matrix computation what would be the arguments at least uh a b and c right we will usually we usually use when doing you know GPU programming uh D we will typically use da a DB DC in order to distinguish them from the copies that we have on the CPU side right those will be typically H and next thing is to when we finish the computation the kernel terminates and what we do is transferring the result uh from the device memory to the host memory now observe that the destination is this H out on the uh CPU side uh the the source is this D out and the and the device memory and here we will also indicate IND that the transfer is from the um CPU the GPU memory to the CPU memory and we can repeat this as needed I mean as needed uh it will depend right there are some algorithms that are iterative typically for example graph processing in graph processing we go over the graphing multiple iterations and for this of each iterations we might be launching uh exactly the same kernel or there might be also uh programs using multiple different kernel Kels and for uh each of the the kernels we will obviously have a different definition of the kernel but in the end uh what we are going to do is uh is something similar right okay uh yeah as you can see here on top but also here as well when we write a kernel when we write a GPU function uh we need to qualify it with this Global um um yeah Global um how is it called qualified um to to identify it and to distinguish it from a regular CPU function that that like that float serial function that you have on top and uh yeah and the way that uh usually I mean we are going to enter the you know details of compilation and and and these things but for you to know when you write program you will see that Cuda programming is very similar to C programming whenever you use automatic variables inside your uh program these are going to be assigned are going to be mapped into registers typically but we can also Define uh shared memory and essentially this is uh we will need to use this special word inside the kernel function inside the kernel code we will declare like shared and then array of certain size in the same way as as we uh do in in C for example and inside one block if we want to synchronize we will use this this is a local V varrier is this sync threads okay here you have like the the exact definitions of all these apis kudam Malo kudam and copy observe that uh this uh special word here kudam and copy host to device this is the data transfer from the CPU to the GPU if it's the other way around it will be kudam and copy device to host and then for uh the allocation we have Cuda free and um and for explicit synchronization we use this Cuda device synchronized why do we this is not so important but it's interesting also to mention uh why do we or where are we going to need to use this this CA and where do we use it we use this API in the host code and where are we going to use it uh right after after typically right after the kernel call in order to make sure that the kernel has finished the execution and why do we need to do that because uh it's the only way that we are that we can make sure that the kernel terminated why is that because this kernel call is a synchronous meaning that when the H code re reaches this point what the H is doing is calling the driver and tell the GPU driver and telling the GPU driver hey please start running this kernel on the GPU right but right after that the GPU is the CPU thread is free so it can continue Computing doing whatever you know like maybe arithmetic operations right after that so if you eventually need to um you know to make sure that the kernel terminated the uh the specific API that you need to use is this could a device synchronize right yeah not uh not always necessary but this uh is good to know that it exists it's as you can see it's um it's not actually like these Global barriers inside the GPU code that uh that they actually don't exist in the um bul synchronous parallel programming model okay um let's start thinking about how do we need to uh map the computation um in in inside our programs um by you know having some running example here as um image processing or simple processing on a uh on an image which is a two-dimensional data structure as you know images have a certain height and a certain width in this case um or or beautiful image here is 8 * 8 of size and uh as you know when we we want to uh refer to any of these pixels what we typically do as it is a two-dimensional data organization what we do is image ji being J the row and being I the column right so for example this guy here is image 21 or this one is image 01 and this one is image one two but how is this image mapped onto the memory the way that this image is is mapped in memory is on in Cuda and in opencl same as in C in C programming uh it's a row major layout meaning that the way that the uh in entire image is stored in memory is as a linear array and we store row by row right the distance between this element and this element is kind of a stride that actually has the same value as the width of the image right so uh whenever we are talking about uh image element ji it's actually in memory image J times width plus I you see for example this is image 01 so it's actually element 8 * so 0 * 8 + one or this one image one two so if we want to access data access this image inside the inside the GPU code inside the kernel uh um what uh what we need to uh keep in mind is how this is actually stored in memory the first uh I mean in this Slide the first uh way of doing it is by using one uh one uh one dimensional grid when we talk about the grid is the total number of thread blocks recall or total number of War groups and um and this can be onedimensional when it's done onedimensional onedimensional what it means is that um uh or thread blocks have only one dimension and uh and we're going to have a certain number of threads inside this thread block these are some um you know like builing variables that we can use inside the let me show it okay inside the where is my cursor here um so inside the uh GPU code we can use this filin variables this grid dim. X is the number of thread blocks that the grid contains in the X Dimension this block dim do X is the number of threads inside each thread Block in the X Dimension and then for both blocks and threads we have an index right so this uh block idx dox represents the index of the block this is the index of the thread so for example one way of uh Distributing this whole image among blocks and threads is by assigning you know consecutive chunks of pixels to cons consecutive thread blocks for example we can assign these four pixels to block zero of course this is a toy example we will typically have much bigger images and much bigger blocks but in this case we consider that each of the blocks is composed by four threads so for example uh this uh four pixels are these four pixels assigned to block zero and um and yeah and yeah so this is for example the way that we are going to distribute these four pixels uh across the four threads right so whatever computation is related to this uh element will be done by thread zero anything related to this element will be done by thread one and so on so now if we want to know which thread is for example uh accessing this or Computing on this element what we need to use is the block index the block Dimension and the threat Index this guy here is the pixel number 25 and thread number 25 in the whole GPU is uh is using it for computation in this case uh this would be uh thread block number six uh the size of the thread block is four and inside the thread block uh the this this uh so it's thread one which is going to work on it so this is for an example for onedimensional uh blocks and this is for two-dimensional blocks which is actually like let's say more suitable for image processing Matrix processing and so on right because they are a two-dimensional organization it's also possible to use uh 3D grids and 3D blocks as well but yeah it's it's a exactly the same uh thing here so in this case we have uh two dimensions of the Grid in the uh X Dimension and in in the Y Dimension observe that in this particular case GD dim X is four and grid dim Y is four as well and inside each block um yeah we also have uh block DM X is two block dim Y is two right so the block that is going to operate on this tile of the image is in this case block 21 and we can assign threads to each of these individual pixels and that guy for example would be a thread index X1 thread index y0 if we want to uh access a specific element in a specific row and column of the image the Expressions that we need need to use are these ones you see block index y times block DM y plus uh threat index y to determine what's the row here I hope this is clear if it's not it will take you like 30 more seconds to uh look at yourself okay yeah okay let's continue and uh yeah very brief uh review of the GPU architecture recall that we we were saying oh compared this GTX 258 285 with the Volta V 100 the number of GBU course is different the number of uh uh SD Lanes inside each of the GPU cores is also different yeah so this is an example this is actually something like the the the internals of each of the cores of the gtx2 uh 85 for example is something like this you can see inside this one what we have is uh eight of these s piece eight of these uh simd Lanes so how do we need to do to run to execute one uh warp of 32 threads on on top of these eight guys so we will do is we have 32 threads in each warp when this um instruction or this warp scheduler that we have here um launches the or issues an specific instruction for a specific warp uh first uh so we we will first uh um um start the computation for the for eight threads then other eight threads eight threads and eight threads every cycle we can start Computing for eight threats right so this is corresponding to the old Tesla architecture that uh yeah started in 2006 or so and this is uh you you already saw this um uh picture that this Slide the other day uh this is the Nvidia firing architecture from uh 2010 in this case observe that in total we have 32 of these asps 32 Sim Del Lanes but actually the way that uh they are used is by using uh 16 on one uh side and 16 on the other side notice that we have two warp schedulers there so we have two warp schedulers that are um fetching instructions from the instruction cache and they are issuing the uh instruction either on D16 or D 16 so in this case we need two cycles to start the computation for the whole warp you see the difference right actually more uh modern gpus we directly have 32 of these so in one single cycle we can start the computation for them recall that thread blocks are mapped onto the GPU CES or streaming multiprocessors and there they are divided into these uh warps of 32 threads something like this so uh whenever um the war scheduler fetches a new instruction is going to fetch an instruction that can actually be executed there are no dependencies if uh for whatever reason we need to wait until the previous instruction finishes to have uh um one partial result as an input operant of the next instruction to fetch this instruction won't be eligible um this uh the the the war scheduler internally uses a a scoreboarding that is entially is like a table uh of uh of the instructions that are ready for execution and instructions that are ready for execution are those uh that have the operan ready right so this will be like the way of controlling the the internal dependencies inside the GPU program okay yeah more terminology I think that we have already gone through uh all this terminology and and here you have the comparison across the different uh Nvidia Generations the latest one is Volta that was released I think either end of 2018 or beginning of 201 19 and um yeah I'm I I think that there's in so the last one is touring is not in the slide uh I think there's not no no touring GPU with uh this many uh number of uh of course yet yeah anyway that's not so important in the end the architecture is uh still very similar to this vola one but you can see the whole Evolution here and the uh nice choice of names for the different architectures right okay I think we are done with the basics uh do you have any questions so far no everything is clear that's great and uh yeah so I think that we can start with the performance considerations and maybe have a single break is that okay okay good okay so performance considerations um yeah as I said in the beginning main memory main bottlenecks of the gpus first of all the data transfers between the CPU and the GPU second uh the global memory access in the end this accessing dram is where we have uh most of the I mean we're we are going to spend most of the uh execution Cycles when we are doing GPU uh computing um for the yeah for the CPU to GPU data transfers we will see how we can deal with them or how we can try to optimize them for the memory access this is the first thing that we are going to talk about um yeah so we will uh first talk about what's the concept of occupancy which is very important for the latency hiding mechanism that the GPU has this latency hiding mechanism essentially is based on the fine G multi-threaded architecture Ure that the GPU has and uh and then also an important consideration is how do we access uh uh to to the memory we we need to have cess memory accesses we will see that what this means and then uh we will also see how to use this shared memory that we have is an internal actually it was in the let me go very very quickly back so it's it's right here is uh this shared memory M that actually maybe not in all gpus but it's typically the same memory space that the L1 cat as well uh and is internal to the streaming processor or GPU core so that's why uh all these units that we have here can access this shared memory a scratch pad on cheap memory it's it's kind of um software uh programmable cache it's it's its role is to to to act to be to to be used as a cache but is something that the programmer can explicitly manage it's not like a regular Hardware cache where you just um let the hardware work and and don't have to worry about it here you you can control it explicitly okay so we will see how to use it we will also uh go back to something that we actually uh started to discuss the at the end of the lecture the other day is the Divergence in inside the our warp how to deal with this we will also talk about Atomic operations with is which is one of the uh sources of inefficiencies in the threat execution inside the GPU because we need to have a serialized execution especially in the cases where two threads that are running concurrently need to update using anatomic operation the same uh element in memory and uh yeah finally we will see uh what we can do for the data transfers okay so let's start with the memory access um yeah so as I said uh gpus are uh fine grain multi-thread architectures um uh here the multiple threads are actually multiple warps and what we do is that uh we can have warps executing running on the cores on the simd lanes and these guys can somehow hide the latency of the uh other long latency operations like for example accesses to the global memory so very uh naive toy example here uh if you have let's say four active warps inside a GPU core uh you can start you can be running um or executing instructions for the different Wars for example instruction three for warp zero instruction two for warp two and so on and and suddenly warp zero needs to execute this instruction for which is an access to global memory to read an entire cach line for example so um yeah so what we can do is if if we have enough warps running inside the GPU cories we can continue uh executing instructions for other warps of course for warp zero we will likely have toall until we get the values from from dram from the global memory and then we can continue execution of instruction five so this instruction five will only be uh ready for for issuing eligible as I said before after this instruction for completes assuming that data that is read from memory by this instruction four is used as as an operant by this instruction five but in the meantime as you can see we can hide this latency and for that in this particular toy example we need four active warps what might happen if we only have two active warps in this toy architecture is that at some point we have this instruction four and then yeah in the beginning we can continue with the execution of instruction three for war one but then um instruction one all uh also needs to access memory so we will have all this Gap that we cannot hide so that's why it's important to have enough number of instructions ready for execution because this way we can hide the long latencies problem here is that um I mean a typical way of of of having this uh number of instructions ready for execution is having enough active warps inside the GPU core the thing is that we cannot always have the maximum number of active warps inside the GPU core right um because there there is obviously a limitation for the number the maximum number of warps inside the GPU core any idea what's this [Music] limitation don't think about any program just think about the architecture think about a fine green multi- threaded architecture we can have up to a maximum number of warps uh no actually not because uh instruction cash cach is not is not a problem because all these warps are running the same [Music] program so one of the things if you uh review the slides that I showed you the other day one of the things that determine what's the maximum number of threads in a generic fine grein multi- thread architecture or maximum number of warps um in in in the GPU is the number of pces that you have the number of program counters that you have because as you can see all these guys here these four Wars are running the same program but at a particular point in time they are on a different instruction right so here is instruction four for warp zero here is instruction one for War 3 so each of them needs its own program counter and in the end a program counter is a register and we have a maximum number of registers inside program counters inside the inside the GPU core so that's essentially what limits what's the maximum number of warps that we can have inside each of the GPU cores but then there are more limitations right because depending on the program the number of resources the different resources that these different warps need is different that will depend on the program we might have programs where we need let's say two registers per thread that will depend on you know how good compiler is as well but okay let's say um uh two registers per threat so if we have uh warps of 32 threads this this means 64 registers per warp um but there are also there are other programs where we might need four registers per thread so in that case is twice the number of registers that we need and obviously the total number of registers that we have inside the GPU core the size of this register file is also limited by right and the same happens as well with the uh shared memory the shared memory this scratchpad memory inside the GPU core uh has certain size typically 64 kiloby and these 64 kilobytes need to be uh assigned I mean need to be partitioned and assigned to the different thread blocks that are going to be mapped onto the uh the the uh GPU core you see so these are typical values for the SM for the GPU core resources maximum number of warps 64 which is related to the maximum number of program counters maximum number of blocks per SM it has also its own uh limitation is 32 register usage the the size of the register file is 256 kiloby typically and for the shared memory usage is uh 64 kiloby so depending on what's the uh usage of these resources that each individual thread and each individual thread block does we can have more or less thread blocks and warps running at the same time inside the GPU core right so now imagine uh for your program each thread block needs to use 30 kiloby how many uh blocks can you run inside the same SM only two right so if I have four SMS in my GPU I could be actually running eight blocks at the same time if my program in the execution configuration I laune 200 thread blocks I can only launch so I can only have eight and then eight eight and then eight for sure the GPU has a very smart threat schedular and as soon as one of these thread blocks finishes completes its execution there is a free slot and the threat scheduler is going to map it's going to uh put there and and start running a new block right and that's uh and that's how it's uh going to to work also observe that this occupancy which essentially you had actually the definition in the previous slide is a the occupancy is like the percentage of active threats active warps that we have um this occupancy for the same program might also change depending on what's the actual platform why is that because if you compare different generations of gpus you will see that these numbers have changed a little bit over time typically increasing right that's for sure okay so yeah so these are the things that you have to take into account when you want to do uh to to to do the occupancy calculation and in principle one of the basic rules for optimizations of GPU programs is to try to increase the occupancy in the end it's not uh always like a very good practice for example if uh your workload is very very memory bound you have many many accesses to uh Global memory um it might be good to actually reduce the occupancy a little bit because this way we won't put so much pressure on the interconnect and on the memory subsystem but yeah as let's say as a initial um good practice for GPU optimization increasing the occupancy uh tends to be a good thing but in the end it's a lot of exploration as well writing your program trying different number of threads per block trying um um different um execution configurations and and play with these uh numbers quite a bit okay uh second important consideration for the memory access is the memory Cod Lessing what does it exactly mean um yeah but this means that when accessing Global memory we should try to make sure that concurrent threats access nearby memory locations that's uh let's say like the most basic definition of memory call Lessing and this uh is actually the way of achieving uh Pig bandwidth is when all threads inside a warp access one single cach line um uh yeah so the size of the cach line and the typical values of The Cash Line size in the L1 is 128 bytes in L2 is uh 32 bytes but if you if you think about so what this means is that every time that L1 requests uh one cach line it's actually four cach lines in L2 and actually if you think about 128 bytes what does it mean for a single warp that means four bytes per thread right and four bytes is one single Precision floating point or uh an integer right so that's like reasonable thing right reasonable values so uh what's uh coess memory access it will be something like this so for example I could write a program where I assign this entire row to thread one and I assign this entire row to thread two or I can write my program like that I can assign one this column to thread one and this column to thread two what's the difference between them the difference is that when these two threads start the execution of the program first thing to do is to read this element and this element and if you look at these two elements they are too far away how much depending on this stride this width of the Matrix so if this stde is this Matrix is very small okay that's okay these two might be even in the same catch line right but if this is I don't know 2,000 elements then for sure thread one and thread two will be accessing will be reading different cach lines however if I assign the computation in this way thread one here thread two here these two elements are consecutive so they belong to the same cach line right what's going to be the difference the difference is that for the same access here I need two memory transactions while here I only need one memory transaction so that's a way of doing the coess access and to have Peak bandwidth in this uh memory access this is uh a little bit more detail about what an aness memory access it imagine that I assign the rows two individual threats so yeah in this case it's magx is only 4 * 4 but now imagine that is 4,000 * 4,000 so uh when when we talk here about time period in in reality we want to talk about iteration so in the first iteration thread one reads This Thread two reads This Thread three reads this and these three are very far away different cach lines so I need three actually Four memory trans transactions to satisfy this memory request this first load instruction uh in the second iteration I access the next element so this is again four memory transactions good thing is that at least these days gpus have reasonable caches reasonable size so it's likely that in this second iteration This Thread one can directly fetch this from L1 right but yeah in the past this was uh worse but still it's a it's a very uh important uh performance consideration and in coess memory access what I do is assigning um uh threads to the columns uh in this case so this way in the first iteration I have consecutive threats accessing consecutive elements so in the same iteration I only need to fetch maybe one cach line if this this this is a war let's say and each of these is four Byes so it's one single cach line uh if it's a double Precision it will be two cach lines but are one after the other second iteration and so on and actually you can see here uh what's the difference between so it's it's it's important right recall that slide talking about the disadvantages of CD processing and one of the things that the that uh uh text says text says is that um yeah so you need to be um you need to be smart when you organize data when you arrange data inside memory to make it more um SD friendly let's say and in this case is GPU friendly and actually these are two different ways of organizing u in this case uh uh an array of structures or an structure of arrays the data is exactly the same so we have two possibilities here uh this is like uh so this truck here contains four arrays this is a b c and D and here what we have is that each of the so in total this is an array of structures and in each struct we have elements of a b c and d right so if you think about this and you think about the typical prog program what is more favorable what is more desirable for gpus this thing right because if I have multiple threads operating on this is I will have in the beginning of my program I want my threats to access elements of a so I will have threat zero here 1 2 3 4 and so on right so they are accessing consecutive memory locations belonging to the same cach line while here there is already a stride between these axises so more likely that I will need more memory transactions right and actually you can see in the next slide a comparison of uh you know accessing um these kind of organizations what we have changed here is the structure size uh and this uh is for um this is like uh um so this would be like the array of structure the longest the structure size I have more longer stride across threads so you see how the throat put how the H sustain the the achieved bandwidth is going down very very quickly right however if you compare that to a CPU it's uh kind of different I mean in the CPU it looks like uh not so more U irregular there but but but you see that in general uh CPUs are fine with having uh larger structures and why is that because in CPUs you typically have one single thread or maybe two threads or something like that each of them has it individual cache so if uh I have my thread uh reading this element I will bring all these data to the to the cache and next I I will have uh Heats in the cache when accessing the rest of elements of the structure yeah you can find more details about this kind of analysis in this paper okay next thing what time is it uh I think we can have a break no let's have a 10minute break is that okay okay any questions so far or any more questions okay okay I think we can continue are you ready okay next thing data reuse we're going to discuss now how to uh make use of this um scratchpad memory this shared memory that we have inside each of the GBU cores um observe for example this uh image here and that green tile that we have there so if you think for example about you know the typical filters that are applied in image processing but not only in image processing also in uh deep neural networks these convolutional layers usually do something like that uh we typically have a filter and the filter is uh you know something like this uh Gap thing is like you know like a small um data structure two-dimensional data structure with certain values and and we need to calculate the convolution for uh the uh pixel image the the pixel um the image pixels that that are let's say covered by this um this green window that represents the filter so yeah the computation that we need to do for of the individual pixels is going to be something like that this is like the uh sequential code for um for this particular in this case is as I said called Gan filtering so but now if you uh think about what's the uh computer so so yeah so that's the thing what we are doing here is accessing all these nine elements here and uh we take the nine elements from the image we take the nine elements of from the um from the filter and then we do the computation but now if you think about what's the computation that needs to be applied for this pixel here it's uh it's actually the same thing but instead of being these nine elements here it's going to be with this window here with this window of nine elements here so the thing is that there is a lot of reuse across the threads that are Computing on these two pixels right in particular out of the nine elements there are six of the elements that are shared and that's the data reuse and that's what we have to take advantage of we don't want to go again to the uh Global memory which takes us several hundred Cycles to fetch a c a cach line uh in order to do the computation if we read this For Thread one we want to reuse it quickly For Thread zero right so what we do typically when we have this kind of computation what we do is tiling which essentially means that we divide the entire image in this case into uh pieces chunks tiles of a certain size and each of these tiles is in the GPU programming model typically going to be um assigned to one entire thread block so now uh think about a thread block of four threads two-dimensional thread block of four threads here this is thread 0 0 this is 0 1 1 0 and 1 one Computing on for for you know the filter for filter output for these four pixels so these four threads will need to read all these values from theam and stored in somewhere where they can be accessed much faster and this somewhere is the shared memory is the scratch Pad memory and and and this is something that the controller so the the programmer can control itself for that we will need to declare uh some uh uh yeah array in the shared memory in this case is L data local data of this size and uh and then we will have to synchronize why do we need to synchronize we need to synchronize to make sure that we have finished loading the entire tile into the shared memory and then we can run the computation directly reading the piece of the image the tile of the image from the shared memory right again back to the reason why we need to synchronize it will be because typically this threat block is going to be composed by multiple warps and as you know the way that warps are scheduled is you know independently of each other so if I have let's say okay let's assume in this example let's assume that my warps have two threads so here I have two warps right so um these two warps need to go and read the entire thing warp zero will be reading for example this part War one we be reading this part and the only way to know that both finished loading from Global memory into the shared memory is by using this local barrier this syn threads that's why we need to use it okay yeah so this is an example of uh using the shared memory it's a you know typical um typical case of uh usage of the shared memory but the shared memory has has also its own uh considerations performance considerations why is that because the shared memory is a banked memory it has multiple Banks typically 16 or 32 which which actually is a number that is you know related to what's the warp size so the perfect uh access to the shared memory by a Warp what is it how is it determined you already know how multibank or interlift memories work right we want to avoid Bank conflicts we want to have we don't want to have two accesses to the same bank because this way what will happen is that the accesses are serialized so if we have a warp of 32 threats what's the ideal situation we have each threat accessing each bank right so that's what we uh should try to to have each bank and service one address per cycle you already know that typically 32 Banks which is in line with the or related to the number of threats in the warp and the way that uh these addresses are mapped into the banks is something like this 32 is the number of banks we already uh saw this expression in the um in the uh in the other the previous lecture okay and this is interesting as well Bank conflicts are only possible within a warp right makes sense right because the access to the share memory is scheduled at a different time that's for sure so warp zero and warp one cannot have conflicts the the actual conflicts happen inside the warp among the threats uh that belong to the same warp uh some some uh example patterns um Bank conflict three in this particular case we have 16 threads and we have 16 threats and 16 Banks um yeah if or strides is a stri is one we have linear addressing then no one conflicts what's the access here let's say your array a with a z A1 A2 A3 at at up to a16 for example I have 16 threads so thread zero accesses a z which is located in Bank zero thread one accesses A1 which is located in bank one and so on the width of these Banks is four bytes which also matches very well with the cach line size you recall that right this is another example as well we don't need to always have linear addressing we can also have random addressing or different addressing the only important thing is that they all these 16 threads belonging to the same warp are accessing different banks each this way we will we will be able to exploit the maximum bandwidth of the share memory uh yeah this is an example of nway back Bank conflicts for example two way Bank conflicts when does does it happen when we have a stri two for example we have thread zero accessing a z thread one accessing A2 thread three accessing so thread 2 accessing A4 and so on this is what is going to happen when we write in our code a two times threat ID if we do that then the stri is two right and if the St is two we are going to have two-way Bank conflict and think about that we have two threats going to the same bank which means that we are reducing the effective bandwidth by two and can be of course even worse this is an eight-way Bank conflict so how can we reduce the shared memory band conflicts of course the uh lay the the layout is very important right but you know in some cases we might need to have thread zero accessing element zero and thread one accessing element two this might happen why is that because that's a computation that I I need to carry out or I want to carry out so there are optimization techniques that we can apply being the simplest one uh padding let me give you an example of padding here using the uh the Blackboard if I can find what I need okay um yeah so imagine that you will see it's actually uh very simple imagine that this is or shared memory this is Bank zero this is Bank One Bank two and this is Bank 31 so I need my my threads accessing something like this a threat idx do X which means that I'm going to have thread zero accessing here thread one accessing here and so on right so what's going to be the problem here the problem is that I have a two-way Bank conflict and why is that because I'm going to have one thread accessing here which is which threat R 16 right so is is unavoidable of course it is if we write the program that way because we really need SE thread 16 access uh this Element 32 um and and also uh you know these are the characteristics this is the way that data is mapped this is the way that data is the addresses are mapped onto the banks and this is the way way that threats access the shared memory recall that the specific bank where one address is allocated is this right this address modu of 32 what can I do that what what can I do here uh what's uh what's uh what what is what padding means padding essentially means that I'm going to reserve extra memory um locations for for nothing simply to avoid the band conflict what I could do is simply not use this and store a32 here this would be a 33 and so on and in the next row let's say the padded position is going to be this one so the good thing thing is that this threat 16 instead of accessing here is going to access here in bank one where no one else is accessing right so I avoid the bank conflict and now I have uh I can get the maximum bandwidth from the shared memory that's essentially the padding thing of course this uh can also change there might be other ways of of doing it but uh that's the the main idea is it clear okay yeah go down okay yeah there there have uh there there have uh also been like U different uh proposals to deal with the bank conflicts as well I think that we already I already mentioned um this uh randomized mapping uh from this paper from Bob R in Isa 20 so Isa 1991 uh sudo randomly inter Le memory um that's uh so essentially the idea here is to use uh hash functions that do the mapping in a different way instead of having a linear mapping like uh address modulo 32 uh we can have like more uh complex addressing and if this addressing is handled by the hardware the overhead of it is is negligible and and and you can get a lot of uh a lot of uh performance Improvement and and this is actually uh something like that but specifically apply applied to gpus um yeah so yeah in general actually if you take a look at this at this work you will see that the different functions that are being used here that I actually configure for The Specific Instructions for the specific GPU programs you will see that the the impact that they have in the overall per performance is uh is really really good problem of using these hash functions problem of using these fancy addressing is that you as a programmer cannot reason very easily about what is happening so because you know you you may always have some uh let's say uh you know you know like um uh Corner cases where where you really uh still detect for example when you do when you do profiling of of your program you still detect that there are Bank conflicts but now you don't really know how the data is uh is mapped onto the bank so it's much more uh difficult to reson and try to solve it your way usually that won't happen so frequently right so yeah but uh yeah in the end uh things like padding can be uh pretty useful in these cases okay uh we are already done with the things related to memory access we've talked about um efficient access to global memory trying to use codess memory accesses uh efficient data reuse taking advantage of the shared memory and also uh what to keep in mind if we want to really have an efficient access to the shared memory and now let's going to talk about the the pipeline itself like about the the CD utilization the other day we already um introduced the problem that we might have with the spmd model the good thing the spnb model is that I can write whatever I want I can use control flow um statements for example and I can have thread zero going this way and thread one going this other way you have a um a clear uh example here branch and two different paths A and B depending on you know for example the value of a data read from memory or depending on the uh index of the of the specific thread we will go through path a or path B and the problem here is that this produces Branch Divergence and Branch Divergence translate into warp diver Divergence intrap Divergence and why is that a problem a problem it's a problem because uh you know the the 32 threads of the warp are issued at the same time so they we have the war Schuler uh which reads one fetches one instruction from the instruction cach and then launches the 32 uh threads uh over the the on the on the eight S Lanes or 32 s Del Lanes or whatever number that we have so they execute in lock step ideally they are going to be execute uh executing the entire instruction uh together over the the whole Pipeline and that can be very efficient as you can see the first instruction here in this in this picture but if we have a branch then part of them will go through one path the other part will go through the other path and the thing is that essentially we need twice the number number of Cycles to do the to do the computation here right so um so yeah so can we avoid that well uh the other day we uh saw some a couple of uh proposals to deal with that from the hardware point of view but unfortunately current gpus still don't Implement so these fancy proposals so um what we can try to do is um you know figure out how to write our programs in a nicer way to avoid these kind of programs um so yeah give me let me give you a better example of uh I mean our or let's say on code what's an example of intar Divergence something like this for example we have uh we do some computation here um in principle there's no Divergence um in in in this uh initial part but then at some point we have this if if elsea statement and now we want that you know OD uh so even threads do this and odd threads do that um this would be something uh like this in the in the execution if I have wars with two with 32 threads and these uh 32 threads are um executing these compute instructions together perfect 100% utilization that's fine but then um we want uh OD so even threads do the do this in the if part the O threads um in the else part right so problem here is that half of the warp is on use here the other half of the war the warp is unuse here right can we do something here okay so this is a possibility Divergence free execution I change the mapping of threads to the data and and now instead of having you know even threads doing this OD threads doing that now what I have is the 32 threads I divide them or I mean in this case yeah consider that we have more than two Wars let's say um 64 threads for example what I'm going to do is that the computation that should be assigned to 64 threads is going to be divided into what was done by the um even threats now is going to be done by the um threats in the first warp and what was done by the odd threads now is going to be done by the threads in the next warp so something like this uh when we uh reach the if else statement the If part is computed by these guys here while the else part is computed by these guys here and if these two parts belong to different Wares then I no longer have Divergence and I can have 100% um um SD utilization this is obviously nothing right because you don't know what is do this and do that but now I'm going to give you an actual example that is a very nice one and very useful primitive uh in GP in general in in parallel Computing is a vector reduction Vector reduction is that I have a vector 12 elements in this case and what I want to know is What's the total sum of these 12 elements how can I do this you know you might know that for um Vector reduction on a parallel machine what we typically do is what is called a tree based approach essentially uh several steps log n steps being n the size of the array and in each step I have one threat adding up two elements for example in my na na mapping I have thread zero adding elements Z and one thread two adding elements two and three thread four adding elements four and five and so on this is naive but it's the most straightforward way of doing it right I mean of course after this lecture after this slides you will always think on on the right way of doing it if you have to program this or something similar to this on a GPU but this is like the most straightforward way and the most the easiest way for a programmer to write this the problem here is that as we uh make progress over the iterations the number of threads gets reduced which makes sense obviously because I don't have so many elements to add in each iteration but observe that these threats are iteration after iteration uh farther away from each other actually from the very beginning I I only have here thread zero and thread two Computing and thread one and thread three three are are doing nothing so even in the first iteration I already have 50% SD utilization right so [Music] um and this is the uh naive mapping observe that this uh really uh simple program right so uh here I have a variable T that represents the uh threat index and now uh over the iterations I start increasing the stride right so uh here for threat Zero The Stride between the two elements that needs to add is only one in the next iteration The Stride is two in the next iteration The Stride is four in the next iteration The Stride is eight so that's what I have in this code here right you I I I start increasing the the the stride from warp from one until block uh Dimension yeah don't don't really worry about the the limit for for I mean the the the end of the for Loop for the example but the point is that the stride is being double in every iteration right so if uh if you are the the the threat that you have to be then you perform this partial reduction problem very low Sy utilization so how can I come up with a Divergence free mapping well in the case of the redu is something uh really possible because reduction typically is an associative operation right so think about the the sum think about the addition I don't really care about because of the associative prop property uh associative and commutative properties I don't really care about what's the order of the of the additions right I don't really have to add first zero and one and then two and three and so on and um so yeah so what I can do here is assign uh the input data to the threads in a different Manner and this way I can have consecutive threats working over the uh whole execution of the program so for example here to threat zero I assign in the very first integration element zero and 16 to thread one elements one and 17 and this way one first benefit of this is if you look at if you look at the memory access is that for sure I'm going to have coess accesses and uh and second benefit from this is that I have consecutive threats working so if this instead of being 16 is 32 threads the 32 threads belong to the same warp right in this example for examp uh in this example you can maybe think about a thread block of 16 warps and think about warps of four threads so if this is a thread block of 16 warps I have the four warps working in the very beginning in the next iteration half of the warps retire but the other half of the warps continue working and inside each of these warps so if this is one single warp inside each of these warps or in this particular uh warp I have 100% utilization because all the threads of the warp are active and then I continue and so on yeah in the very end obviously I don't have 100% utilization but that's simply because the reduction is uh with less than the the rest of the reduction is with less than 32 elements and uh yeah if you look at the code I will let you uh think about it yourself but if you look at the code this might be a little bit more difficult complex the addressing but in the end it's not uh it's not that difficult and um and yeah so with this I can achieve uh higher SD utilization so it's a nice example of Divergence free mapping any question okay next thing Atomic operations recall uh Atomic operations I mentioned in the beginning the problem with atomic operations is I have threads uh running concurrently right and if these threads inside the same Warp need to update atomically the same memory location I will have to serialize the accesses yeah that's something that uh is going to happen for sure but there might also ways of optimizing it um yeah so this is the Syntax for an atomic operation in high level Cuda programming language uh Atomic at uh where I have one uh element in memory I have a pointer here and uh in this memory uh position I want to add this integer in this case I mean this is for yeah the Syntax for integers but they also work for floating Point double Precision uh and so on um yeah in in in Cuda as you know there is a compiler as well and this compile takes your Cuda program and compiles it to the you know ones and zeros that the machine and understands right but there is also an intermediate representation that is very useful uh that is called PTX and the reason why this PTX exists is because the architecture has changed over time right so and you want to have the same program the same program that you wrote 10 years ago should also work uh in 2019 um so uh that is why there is a kind of intermediate uh representation or an intermediate layer in the compilation um in the compilation flow um that is this uh called PTX this is common for all gpus but then you can go to the a specific um um assembly code that is called SAS for the different architectures actually if you look at the the you read the PTX you see that this is kind of simple to understand right so this is an atomic operation on shared memory it's an atomic ad on ansign 32 4 bytes right and this is the output typically what this Atomic ad returns is the old value of this memory location so this is the output register in register 25 this is the address um in this rd14 the address in shared memory that we are updating and this is the value that we are adding so how does this uh translate to the SAS code to the assembly code that is the thing that is actually executed on the on the GPU Hardware uh yeah in the case so it has been a a change here and and and that's actually um something nice that shows uh how much um you know um improvements in the hard can impact the the overall performance of of programs in the in the initial architectures Tesla fery and Kepler this atom chair ad uh was was compiled into this a small Loop if you look at the Loop what you have here is a load instruction then an addition then a store instruction and then a branch right and also if you look at this you can see that there are something here this P0 and P1 which are predicate registers you remember we have talked about predicated execution right so essentially this predicate register is one single bit per thread that's enough one single bit per thread and what we are going to have here is zero or one depending on the S the you know the on on the fact that the threat succeeded or not while trying to lock the access to the particular position particular location in the global memory in the local in the shared memory in this case so if we have two threads belonging to the same warp and they both want to update the same location they have to contend for one lock I don't need to say where these locks are you can guess where they are inside the the shared memory uh so they contend and only one of them um can uh actually yeah gain the access and lock the access to the to the specific U memory uh location and this give us this uh predicate value so if the your predicate is one you will actually execute this and this and if it's not then you will jump and start from the beginning the reason why they changed this is because this was very inefficient right imagine for example uh you know like uh I don't know you are I don't know counting elements that are equal to 16 in an in a whole array and you want to count and the easiest way of counting the fastest way of writing your code is saying okay let's use um Atomic operation so I check I read from memory is this equal to 16 or not if it's equal to 16 then I update right I add one to this uh shared memory location yeah so in the end you have one single address that is being accessed by all the threads using uh Atomic operations so imagine that your six the 16 threads in the warp have read 16 from memory so the the the the 32 of them will have to go over this thing 32 times so the number of instructions that you're executing here is very high right four * 32 uh yeah Nvidia improved the implementation of this since the Maxell AR ecture we have native Atomic instructions in shared memory and uh and they are way more uh efficient way way faster yeah this is a little bit more um elaboration on the example that I just gave you right so in this case we have these thread zero and thread one which are going if they update different elements and and also in different banks because of course if this is shared memory we also have Bank conflicts when executing uh Atomic operations if they are you know updating two different positions in the um in inside the same bank we we will still have uh Atomic so Bank conflicts so we have this uh if the access is something like that then we clearly need uh um yeah an extra latency serialize updates and this is something that uh actually I mean I first gave you like a very naive example let's count how many elements are equal to 16 but um this is actually like very uh typical thing that happens in in some useful Primitives like for example uh histogram calculation you might be familiar with histograms they are used in many many different applications especially in image and video processing um and one way of um assigning the computation of an histogram I mean the the let's say the typical way of assigning the the computation of a histogram across threads is doing something like this this is your input image these are the different pixels and uh yeah I will typically have thread zero reading this thread One Reading This Thread two reading this why is that because in the end what I'm doing here is is very little amount of computation so most of the thing is going to be most of the Cycles are going to be spent on reading the image from the global memory so I want to I need to um you know try to have efficient access to memory big bandwidth cess memory accesses and if I want that I will have to do this assignment and now the thing is if these guys here need to update this histogram in global memory or maybe um yeah in principle in global memory I might have two of these inside the same warp updating the same being of the histogram so that's going to be an atomic conflict right and this entails the realization and the performance of the uh program might go uh down very quickly and this is very frequent in uh histogram computation of natural images why is that because these pixel here is very likely that they have the same value right so for example uh these uh threads are reading all these pixels and these two are reading pixels with value 192 and the distance between these two pixels is two so for sure they're going to or very likely going to belong to the same uh warp right so this will entail the the serialization uh so one way of uh dealing with the all the possible conflicts that we have when updating when generating this histogram is doing a very simple but effective effective technique that is called privatization the example that I'm giving you here is a very naive thing is for histogram calculation but this privatization is also for example applied to graph processing when you follow what is called a top down approach I don't know how familiar you are with graph processing I don't really need to explain so much about that but you know in graph processing there are essentially two ways of doing uh graph processing one way is doing uh top down and the other way is doing bottom up the difference is that you know that graph processing algorithms are typically uh iterative so um you go over the graph uh the let's say in principle the the the way to be faster is by doing a top- down approach in which you only need to in each iteration you only need to worry about the frontier meaning the vertices or the nodes of the graph that you need to visit in this particular iteration right um the other approach is Boton up Boton up means that you always is checking all the vertices of the graph and what you try to do is say Okay I I I've been assigned this vertex uh am I in the frontier or not if I'm not in the frontier I don't do anything if I'm in the frontier I do something problem is if I have uh I don't know a graph of 100 million elements I will need 100 million threads and many of them will be idle most of the time why that because they are not in the frontier right so that's the problem of a bottom up in in in theory it's more desirable to have a top down approach in which I make progress and I only have threats exploring the nodes in the frontier but the point the point the the problem here is that this Frontier will be typically a que and if I want to inq a new vertex into the frontier I will have to update some uh Global uh shared or I mean when I say Global or Shar what I mean is an atomic variable that I need to update so that I get a new index to in Q in in the queue this is only an example of uh usage of atomic operations and uh privatization because if I map this top down approach to uh a GPU I could have privatization like small cues assigned to each of the thread blocks such that each thread block only needs to update its own que and at the very end of the kernel I will merge all the small cubes so this is uh privatization and this is what we can also do for histogram computation instead of having one single histogram for all the thread blocks that I'm going to run one single histogram in a global memory I can have multiple sub histograms of the same size as you see in this case like uh four beans of the same size in the different share memories in the different um um SMS and when I'm done when I'm done with creating these sub histograms I merge them into a final histogram in global memory and this is a reduction recall the reduction you already know how to do very efficient reduction right what what we have seen before yeah so this is a way of dealing with uh you know Al deating what's the cost of these uh Atomic operations any question okay okay uh next thing data transfers between CPU and GPU recall that this is one of the main um overheads that we have in the in in the in GPU uh computation one of the main bottlenecks why is that because we typically in the traditional accelerator model we have the CPU we have the G CPU in the middle we have a PCI Express or okay MV link can reach up to I think 25 gabes per second but this is clearly one order of magnitude lower than accessing the global memory so the problem here is that if I need to I need to compute on a matrix of uh 4K * 4K I will have to move a lot of megabytes from the GPU CPU memory to the GPU memory and this St takes time right and depending on what how much computation I'm going to perform on the GPU side I will be able to amortise this cost of the data transfer or not so yeah so there are uh let's say um some ways of trying to deal with this uh there is the possibility of using a synchronous transfers uh and what are called streams in Cuda or command cues in open CL and these are defined as sequences of operations that are perform performed in order I will show you what we mean what we mean with that in Principle as you know it's uh we we need to transfer from CPU to the GPU then do the colel execution then do the transfer of the results from the GPU to the CPU so usually it's going to be the default stream is going to be something like this these two are two sequences so are two operations that belong to the same string because they are performed in order right I cannot start the execution of the kernel until I have finished copying the data from the CPU memory to the GPU memory right so if this is I know matrix multiplication first thing to do is to transfer both matrices from the CPU to the GPU memory and then I can start the execution so they represent these two operations represent a single stream the default stream um the good thing is that in some cases the data that I'm transferring in this part of the transfer is going to be computed in an independent way of the data that I'm going to transfer in the um uh in the rest of this time so the good thing is that as I have transferred for example some rows of Matrix a and some Columns of Matrix B let's say because I I was giving you the example of the matrix multiplication I could actually start the computation of these dot products and start Computing some of the elements right so what we can do is divide in the computation into multiple streams such that they are independent these two are one stream they are in order you see but independent of this stream this stream and this stream as soon as I'm done with this transfer I can start the computation and at the same time because these two operations can be a synchronous these two streams can be asynchronous I can overlap the data transfer for the uh next Chun of the computation with the actual computation of the first chunk and by doing that I can save a lot of time uh for sure this is something that I cannot always do I need to know uh my computation I need to know my program I need to know if I can do it or not but uh it's not so strange right that you can do it for example yeah these are just like some estimates of how much I can I can save right so if I if I have end streams and um uh execution time of the kernel is longer than the execution I mean the time for the transfer uh what I would be hiding is the transfer right so the only thing that I will see here is this this is uh TT over end streams is is this piece here and then what I'm what I see next is this te right and this is the other way around in case the transfers are dominant I don't know why this happens but yeah I guess that uh you get the point obviously from doing these from from dividing the computation in multiple streams there is a little overhead of managing of the this thing but yeah in principle it's um it's really worth trying and uh as I said it will depend on the specific computation but yeah for example video processing this is very very good use case for uh for these overlapping of communication and computation you have a sequence of frames here so if you need to compute on the frames uh individually on each of the frames of the video uh you will need to transfer the entire video and then compute but if you do this um ex stream processing or stream execution you can transfer a Chun of frames and while you are Computing on this Chun of frames you can uh transfer the next Chun of frames and you save all this amount of time okay clear good thing is that over time yeah you see this is from 2012 over time this this has improved and uh and these days um Nvidia provides something that is called opencl as well both Cuda and and and um and opencl provide something that is called the unified memory and actually with this unified memory uh the GPU can request Pages by itself to the CPU to the CPU memory and this is transparently handled by the driver so you don't really need to I mean of course it has a little bit more overhead uh that you could control yourself if you if you do it uh manually but yeah these transfers can now be or this overlapping can these days be achieved in a completely transparent way um yeah and this is the summary of uh what we have seen so far is that okay shall we continue 20 more minutes okay uh let's talk about collaborative Computing are these uh new features they let's say started to be available in Cuda maybe uh two or three years ago and U what they enable us is to have like a a tighter collaboration between the CPU and the GPU or if we talk about opencl a tighter collaboration between the CPU and other devices which could be two gpus or one GPU and one FPA or you know multiple uh combinations like that um we can very very first review um uh what we have just in actually this is like a typical uh program notice that we have here Cuda malog so we have first malog for the input array in the host Cuda malog for the input array in the device then we transfer from the host to the device and then we allocate the output and then we launch the kernel we have this Cuda device synchronized recall that this is the way that we can make sure that um uh that that this kernel has finished before we do something else in principle typically we don't use this thing if if the structure of the program is like that we don't use this thing because this kudam and copy uh is synchronous in the way that it will never happen before the execution of of the kernel but there is also one kudam and copy a sync which is the one that we have to use for uh the ca streams and in that case um yeah this and this will be as synchronal so if we really want to make sure that this finished then we need to use this okay a small clarification that you don't really need but yeah so what's the what's the problem that we have here the problem that we have here is that uh the actual collaboration between CPU and GPU is um is very difficult and when I'm saying actual collaboration or when I talk about collaborative execution I I'm talking about the CPU and the GPU using their threats to compute in did to do actual work because if you look at this model what we are doing is simply the CPU thread the only thing that is doing is of loading the computation to the GPU and waiting for the GPU to finish right so during all this time the CPU thread is doing nothing the GPU cores are KN doing nothing at least related to the CPU cor sorry are doing nothing at least related to this program I could somehow try to get benefit or use these CPU threads by for example writing some computation here right I could I don't know like very uh naive thing imagine that um I need to do two Vector so one vector addition and my array my input array has or my two input arrays have 1 million elements and I know that the GPU is much more powerful for parallel computation than the CPU but the CPU can also do some of the work right right so I could say Okay 1 million elements let let's assign uh 100,000 elements to the CPU to the GPU and the rest to the CPU so what I could do is launch this GPU kernel for this input of 100,000 elements and the remaining 100,000 elements are going to be added here I could do that yes for sure but it's like a very I don't know um naive way of doing this collaborative execution right and yeah that's fine that is still fine in a vector addition why is that because I can clearly partition what is assigned to the CPU and what is assigned to the GPU there's no overlap right but now imagine that I don't know I have a more complex computation where uh for example a histogram calculation I have the entire image and I could say okay uh this number this a small number of pixels to the CPU this large number of pixels to the GPU but now the pro the problem is that they both need to uh update the same histogram right so yeah that's probably not doable at least with one single histogram that's for sure so yeah to deal with these kind of cases um um yeah we have this unified memory I'm going to use the the um the the Cuda term which is Unified memory in open CL this also exist and it's called shared V virtual memory but this exactly the same thing uh this unified memory was introduced in Cuda 6 for the Kepler architecture but uh by that time it was simply like kind of U wrapper that uh simply was like doing the data transfers for you but there was no actual uh unified memory uh with ca8 and the Pascal architecture starting this was actually 2016 end of 2016 um yeah so these two can access at the same time the same arrays right and the thing is that you only need to allocate once in the CPU memory and somehow the GPU can directly access the CPU memory somehow means that every time that the CP the GPU needs to access the CPU memory sends a page request like a page fold come on this gu is independent Okay so um uh every time that the the GPU needs to read something from the CPU memory uh um sends a page fold to the CPU and this page uh 4 kilobytes typically I think it's 4 kilobytes are going to be transferred uh from the CPU memory to the GPU memory in a transparent Manner and it can also happen the other way around for it's it's it's memory coherent meaning that if uh the GPU uh takes these four kilobytes modifies them and after that the CPU needs to update whatever in these 4 kilobytes they will be transferred back to the CPU memory as you can guess this has obviously an overhead but it's also a a huge uh um jump in in terms of programmability um yeah so this was the uh yeah so in the beginning with the the good thing of uh using the unified memory is that now you don't need to do the double allocation that we had we we actually have this Malo here but we don't really need this because this camalo manage observe that now the API change a little bit the name um this camalo manage is actually allocating in both places at the same time for you and uh that's why we are using here simply A M Copy and uh and and we don't have an actual kudaman copy there is no actual explicit transfer between the um CPU and the GPU memory and um and yeah and this is an example of uh of what I told you before um in the in the traditional model uh if you want to partition these uh vectors and and do the comp sound computation on the CPU yeah the CPU can do something here right can do some things here but it's not actual uh fine grain collaboration the good thing is that with the unified memory the way that it's already exist in Pascal uh Bolta um um architectures from Nvidia but also on the HSA architecture from AMD and other vendors is that you see the kernel is luned with these arguments output and input arrays and as soon as I launch the kernel and this is an completely asynchronous of operation the CPU can continue Computing and can directly operate on these arrays that are also being operated by the GPU kernel and this is transparently handled all the memory coherence here is transparently handle so these are the apis um C amalog manage that you have already seen and it's also possible to have systemwide Atomic so now it would be possible to have one single history that is being updated by the GPU threads and the CPU threads for example like using this fetch ad on the CPU side on the GPU side you would need to use this Atomic ad system it's a kind of atomic ad but it's over the entire unified memory that's why it's different from the U Atomic ad that we have seen before uh for the shared memory which also has the same syntax for the global memory this is in open CL 2.0 or since open Cal 2.0 and this is uh C++ amp which is a related thing a little bit I don't know if you have heard about this um yeah it's a um yeah kind of C++ high high level programming model as um in the end it end UPS compiling to to open C typically and from there to the to the actual uh assembly code uh yeah and and and with this uh collaborative patterns uh we can so and with these features we are now able to uh implement this collaborative patterns so um this is what you what we originally have uh in a traditional accelerator model in the traditional programing structure we have different for example two kernels right this is the computation that we need to carry out right this is Kernel one this is Kernel two and what we have in the middle is this core grain synchronization that is it's it's simply the kernel termination and launch of a new kernel recall that this is the only way of doing synchronization Global synchronization uh inside the GPU right so this is like the progam structure we have the the entire operations that we need to to do here can be divided into multiple tasks you can uh for example think about each of these tasks being assigned to a single threat and actually inside these tasks uh we can have subtasks I don't know for example yeah uh adding two values and um and and then uh reducing the partial results for example that could be that those could be the the two uh subtasks here so with the new um features it is possible to very this new programming features it is possible to do a very um efficient partitioning across different devices in this case the device one and device two can be one CPU and one GPU and and you know because these are different devices they are better for some tasks and are better or and worse for other tasks for example in this particular case uh the yellow subtask is runs faster on device one and the and the and the green one runs faster on device two right um so yeah this is one possible partitioning that I can do right and I would have my co grain synchronization the kernel termination and then I continue and and also you can see that yeah this this thing is faster in device one this thing is faster on device two right and what I'm doing here is I'm simply assigning to each device an amount of computation that is uh related to the compute power that I have in the in the different devices maybe I have four CPU cores and here I have 8 GPU cores so yeah okay I will assign uh many more of these tasks to the um to the GPU threads right than to the CPU threads this could be like the most naive way of partitioning the data but now observe that okay I I clearly see that some operations are faster on the on the CPU while other operations are faster on the GPU so what I could do actually if I if I look at this I first I I can first look at the entire task right those tasks there are faster each individual task is faster on the GPU while here each individual task is faster on the CPU so one thing I can do is some sort of coar GRA task partitioning where I run all this computation in device two and all this computation on device one right I could do that and in principle you see that in this uh picture you don't see any concurrency right you don't see uh device one and device two operating at the same time right but I could I could really do that because for example if I'm operating on a whole video stream that is composed by thousands or hundred of thousands of of of frames I could be here I could be doing this computation for the previous Chun of um of video frames so I can have this uh concurrency but it could be even nicer if I do A fine grain task partitioning and this is something that is really doable these days with uh this all these new programming features I directly assign those subtasks that are faster to the device where they are faster so the the yellow thing in device one the green thing in device two and then at some point I need to do a CO green synchronization and then um uh yeah the the the light green thing on the CPU the the dark green thing on the GPU and so on and this communication this synchronization can be implemented how who can give me an idea of how to do this may maybe very very late for example using a flag right when this I assign this to threat zero in the CPU for example and when threat zero is done with this sets a flag in the unified memory and in on the GPU side I have for example a GPU threat block checking that flag using Atomic operations that's POS possible this uh Atomic ad system that I I just showed you so this guy updates with an atomic operation the flag and this guy here say sees that this flag has been set and now the data this data is ready to start the execution on the um GPU site and as you can see this can be very flexible right because I can have uh CPU threads and GPU threads continuously running and communicating using these new features like the system wide atomics and which at the same time take advantage of the memory coherence across both memories and that's more or less the the idea here of these collaborative patterns this is like a very uh naive uh example I mean I I think that I already uh me mentioned this example this is a histogram calculation for an image so I assign part of the image this would be like the traditional thing in the traditional thing I need to assign part of the image to the CPU part of the image to the GPU each of them has its own sub histogram and finally I need to merge them or reduce them this can be fine but it's a little bit more complex to program it could be even much easier actually if they both share the same histogram in the unified memory and threats from these SMS and from this uh CPU course directly update uh this histogram don't ask me if this is efficient or not as you can guess for a single histogram of let's say 256 beans is not efficient but from the programming point of view um is um is actually very very good um and actually uh one more thing is that this is not efficient on discrete gpus why is that because you have the PCI Express V in the middle right so if uh if you need um these threads here updating this you will have to move one page from the CPU memory to the GPU memory and then back when you want these threads to update this right so you're going to have like this ping pong in that is going to destroy your performance but if you have an integrated GPU and these two things are inside the same chip and this is the same memory space is the same dram then it's going to be much more efficient and you can see uh quite a few of uh examples of this collaborative implementation veser surfaces it's uh kind of uh processing for you know for graphics uh this is on an um Nvidia jet song I don't know if you have heard about of this platform is like a board that contains a CPU and GPU CPU with uh um uh four cores four armed cores and GPU with two SMS so this is the 1 177% uh of speed up that you can get from doing a collaborative execution uh here you have uh yeah this is a also performance Improvement on an AMD Cav which is an uh integrated uh system chip with four CPU cores and eight GPU compute units or GPU coursing AMD terminology and here it's 45 47% improvement over the GPU only version here you have a few more examples padding I want to go over these um yeah you can see uh yeah some some some results showing some performance Improvement this is another thing string compaction in this case how much 82% Improvement on this AMD Cav on uh over the GPU only implementation what else we have and this is a breath for search actually the way that this is explained and I cannot elaborate because we we don't have time but you can take a look uh by yourself this is the top down approach that I uh talk about before and you know this applies uh several tricks like not only privatization but also uh persistent thread blocks which is also a nice way of uh optimizing GPU programs this I'm going to explain you this very quickly so uh remember that from the beginning I told you you have this whole array uh this whole array has 1 million elements you're going to use thread blocks of 128 threads so what you do is how many threats thread blocks do I need to laune this much right 1 million divided 128 or the ceiling of that right and this is the number of blocks that I have to launch but in the end these uh I don't know 100,000 thread blocks are going to be scheduled like uh for example first four and then when we one of these finish is one more goes here and then the next one goes here and so on so the idea of using persistent thread blocks is that I only launch it's a little bit more complex to program but in the end is is kind of a very easy thing to understand instead of launching let's say 1,000 thread blocks what I launch is only four thread blocks and what I have is these four thread blocks to go over the whole array so when thread block zero finishes with this part of the computation instead of retiring and finishing and releasing this slot for the next thread block what it does is go into this part here and compute this part Here and Now good thing of doing this is that for example I can control the uh computation in a better way by doing a scheme like that I know exactly where this guy is uh is working on where this guy is working on and it might be possible to to synchronize them so it could be a way of doing uh this um Global synchronization that is not possible um uh if if if you don't uh deploy these persistent thread blocks but as I said this is like a little bit um more advanced programming let's say yeah you have more performance here some results for the integrated thing in this case graph processing breath resarch for this uh graph of New York roads is like 15% uh performance Improvement some uh sample code it's a related things single Source short shortest path on the AMD VAR 22% performance Improvement more examples and yeah and this is the thing that I wanted to show you uh if you want to take a look at these kind of collaborative patterns uh you can uh access this uh chai Benchmark Suite where you will find many of these patterns and and different ways of using these new programming features and that's all for now uh yeah I hope that you found it interesting thank you very much for your attention and have a

Transcript for:[Lecture 31] GPU as General Purpose Processors Overview

Transcript for:
[Lecture 31] GPU as General Purpose Processors Overview