[Lecture 30] Understanding SIMD and GPU Architectures

but today we're going to cover another fascinating concept which is the Paradigm of single instruction multiple data processing uh that has influenced computer architectures quite a bit uh essentially all computer architectures have this today uh that was not true in the past uh but uh it was so important that every computer architecture has built sing Single instruction multiple data support either as part of a separate processor like a GPU or as part of a separate functional unit like a simd functional unit uh today but before we get to that this is where we are uh essentially we're covering execution paradigms and we have simd architectures and graphic processing units which build on some the architectures left today and next week essentially we've covered everything else except for decoupled Access and execute uh which I'm going to refer you to this optional lecture it's 11 or 12 minutes this another Paradigm that's actually used in existing systems the idea of decoupling Access Memory access and execute so that they can uh run asynchronously to each other so that you can overlap the latencies of memory and execution units in different ways essentially that's the idea basically but this uh short lecture goes into a little bit more detail if you will now before we start simd let's wrap up systolic arrays uh We've covered systolic arrays and V yesterday uh and we were discussing essentially systolic arrays are specialized architectures to accelerate specialized computations and this was the computation we looked at yesterday and we also said these have influenced real systems I want to talk a little bit more about the real systems because we we kind of rushed through it yesterday but essentially this uh TPU that was developed by Google uh starting in 20145 and it was described in the cisa 2017 paper is essentially a systolic array basically you can you can even read that they say systolic data flow of the Matrix multiply unit software has the illusion that each 256 byte input is readed at once and they instantly update one location of each 256 accumulator Rams basically they accumulate do multiply and accumulate to do matrix multiplication essentially again I won't go into the details but you can see how they do it uh it's transparent to the software so they basically translate a multiply and accumulate operation a large one matrix multiplication essentially to this systolic computation using a lot of support in the system so this is the systolic computation unit uh and this is actually somewhat similar to what we have discussed uh exact data flow and exactly what is accumulated inside each unit is different of course uh because you can actually design systolic areas for matrix multiplication many many different ways uh but you can see that this is the support that they provide uh according to them this The Matrix multiply unit and you can see that uh they there's a lot of support to input the data into this unit uh and also input the weights that they're multiplying things with because this is really doing matrix multiplication for uh for convolutions for example like we discussed yesterday uh so you need to actually feed the weights and feed the uh uh data vectors from here and then get the data out uh from here okay I won't go into the details you can read the paper if you're interested but it's doing essentially matrix multiplication yes not necessarily yeah not necessarily so they're probably taking advantage of the reordering uh internally as long as you get the right result it doesn't matter which order you follow but yeah if you think about that uh in a wman architecture you have a sequential order dictated by the program right you have to obey that order because that's the architecture you have here a lot of things can be happening concurrently so that's one of the other benefits of systolic architectures you're not tied to an order okay uh so they later uh so the reason they built this was to actually accelerate inference first as well and then training next in the Google data centers because they're doing a lot of data analytics clearly internally and later they and and they basically said uh we don't want to we have so much infrastructure that we can actually amortise the cost of building our own chips and that's why they actually did this so they actually amortise the cost of building their own just total total cost of ownership as opposed to getting gpus from someone else for example and they kept building this this is tpu2 they actually do training on this one as opposed to just inference in neural networks and they basically increase the memory bandwidth as you can see so providing data into the systolic array is still important even though systolic array is more efficient than a general purpose processor like a CPU in terms of how much uh benefit how much computation it does onp on a given data element that it fetches from memory like we discussed yesterday it still needs a lot of data because these are quite data intensive applications so they basically replace the traditional memories with high bandwidth memories as you can see they actually add floating Point operations and they increase the computational power and then tp3 has even higher bandwidth and larger memories more Matrix units as you can see not just two systolic arrays but four systolic arrays now and then the Terra flops is increasing and this is the latest one that they have exposed at least and they're targeting many many applications over here you can see that the uh total Tera flops per chip has increased uh well almost increased by almost three times as you can see over here so they've been improving this over time and you can read more papers about it so they were quite in public about it even though they don't disclose exact details they talk about uh some of the things that they've done okay uh but I will also mention that we've been analyzing these tpus in my research group together with Google as you can see in this work uh and we've been finding that a lot of the issues that they have is really due to memory and we're going to talk about memory after next lecture hopefully we're going to go into the memory design and this is to motivate the importance of memory basically you can see that all of these chips are increasing their memory bandwidth making memory more expensive increasing their memory capacity yet they still need more memory so in large machine learning models especially for for example long short-term memories Transformers Etc uh basically more than 90% of the entire system energy spent on memory okay we'll get back to that so uh now this is systolic Aras as I mentioned yesterday systolic are are not the only machine learning accelerators now we're going to see the simd Paradigm that people have been using to accelerate machine learning because simd is actually a very good fit for doing convolution type parallel operations matrix multiplication and yesterday we saw that how you can convert convolution to matrix multiplication as well right uh remember that slide where I showed you that you can go back back to that so basically uh people have been using simd architectures and this is uh later today I will show you that in at the core of it this has many processors inside well this is the newer version if you will this has many processors inside and each processor is a simd processor internally and they communicate with each other so it actually consists of multiple paradigms it's a multi-processor of simd processors or vector processors if you will and this is designed as a wafer scale chip and there are lots of units over here according to cerebra 850,000 cores and this is large amounts of on chip memory 40 gigabytes of memory SRAM on top of the chip so they're putting large amounts of computation memory together but this is a different approach clearly compared to systolic erase and you can learn more about this in this talk that was delivered by the CEO uh or CTO of that company okay I should also say that people have been using fpgs for this Microsoft uh has developed this brain wave system that they call uh essentially they use fpgas as a substrate to accelerate machine learning operations and they do multiple things partly uh they do some specialized data flow mapping as you can probably see somewhere over here uh I mean you need to read the paper but the paper doesn't give a lot of detail but they they do partly data flow mapping directly accelerate part of the data computation but they also put soft simd units on the fpj just like you're doing in your Labs right now you're putting a soft processor mips processor on the fpga right you're not mapping exting an application you're basically implementing a processor on the fpga subate they actually have a simd processor on the fpga substrate also at the end of this lecture we may get back to it even though we're not going to go into much detail because they don't go into much detail anyway okay but you can see that now uh there are multiple ways of accelerating things and clearly simd plays a large role and the reason is simd is very good at explo exploiting a particular type of parallelism which is called Data parallelism if you will essentially you're doing many many of the same operations on this on different data elements and gpus exploit the same type of parallels I call this regular parallelism also of course there are many types of regular parallelism uh in this particular case you're doing the same operations on many pieces of data elements so we're going to talk about that but before we talk about that let's talk about the taxonomy of computers we've been actually uh covering different types of computers uh and this paper uh that's a seminal paper by Mike Flynn uh you can see that almost 60 years ago categorizes computers into four different categories in terms of uh how many how many instructions do what to how many pieces of data elements so you have instructions on the x-axis and the data elements on the y-axis and the simplest machine is doing single in a single instruction operates on a single data element this a scaler machine like we have seen right the ISA is doing this essentially you do an ad essentially each each each oper end has a single date element right this is the traditional ad we have seen and we've accelerated this using out of order execution for example and then there's simd which we're going to see in this case a single operation each operant is not a single element but it's really a vector of elements so it's many elements at the same time so you're operating on vectors essentially and there are two types of these array proor and vector pror and these this is a distinction of a purest and we're going to see the distinction AR Pro does this doeses the same operation to different elements on in the same time uh across different elements Vector pressor does the same operation on different elements across time essentially MDI is interesting this is kind of similar to systolic array you have multiple instructions that operate on single data element basically you feed a single data element and then different instructions operate on it until you get the results this is very similar to a systolic array and I'll show you a picture somewhat similar a generalized picture soon so the closed form is really a systolic eras streaming processor and finally mimd you have essentially multiple instructions or multiple instruction streams operating on multiple dat elements you can think of this as a multiprocessor or a multi-threaded processor essentially and these are four different types of paradigms these can be combined with each other you can have a processor that has all four of these inside now some some systems some s so's today actually have that today like the Apple M1 has all of them I believe okay so if you're interested you can look at this paper the terminology is a little bit different but I've given you the gist of it so this is the example that this paper gives in terms of what is MD multiple instructions operating on single data you have data storage you bring the data and then you have some processing elements operates on the data passes the data passes the let's say modified data to the next processing element pass the modified data to the next processing element Etc so it's very similar to a Sy systolic array like we've seen yesterday so that's a generalized systolic array because you could as I mention mentioned yesterday these systolic execu these processing elements can be programmable using instructions as well okay so if you haven't attended yesterday's lecture that's yesterday's lecture and this is simd we're going to cover today and if you look at this picture that is provided in that paper you have this instruction unit instruction unit SS sends the same instruction to a different data elements essentially so these different execution units get different data elements from the opend store this doesn't make it very clear but we're going to make it clear soon and this is similar to what we're going to call an array processor and these processing elements functional units have very limited communication between them in fact we'll see no communication initially uh in in our processors does that make sense okay so these are the paradigms essentially so we're going to look at the simd Paradigm and I mentioned that simd exploits data parallelism and what does this mean essentially uh we've been so far we've been always exploiting concurrency right different s for sorts of concurrency out of execution for example exploits concurrency that's irregular it builds a data flow graph and tries to figure out which one can be executed you're really trying to find which instructions can be executed concurrently here in simd the type of concurrency we exploit is fundamentally different it arises from uh the semantics of the program uh performing the same operation on different pieces of data essentially that's why it's called single instruction operates on multiple data data an example dot product of two vectors or whatever you do on vectors actually applies nicely that's why one of the processor is going to be called Vector processors now con contrast this with data flow uh in data flow concurrency really arises uh from executing different operations in parallel in a data driven manner basically data uh gets available it actually fires an instruction and these instructions have no relationship to each other that are fired except they happen to have their data ready at the given point in time it could be complete different instructions basically contrast this with thread parallelism at a higher level uh this is multiple instruction multiple data you the concurrency arises from executing different threads of control in parall different tasks in parallel and again these may have nothing to do with each other but we're going to get back to gpus because gpus are going to combine simd and thread parallelism in a very particular way soon okay basically simd exploits operation level parallelism on different data same operation concurrently appli to different pieces of data this is really still a form of instruction level parallel because you can think of these operations as instructions except an instruction is combined collapsed such that it applies to many many different pieces of data right basically instruction happens to be the same meaning that you don't need to uh decode the same instruction many many times so imagine that you're actually adding uh two vectors of one billion elements in a scalar processor uh you would be decoding the same instruction add instruction one billion times fetching and decoding in this case we're going to see you Fetch and decode the add instruction only once and it's going to control a billion data elements to be added so that's where the efficiency really comes from you don't you basically don't waste time figuring out oh you're I'm doing this operation basically you know that you're doing this operation on a billion- elements so you're not going to waste time figuring out what you're going to do to all of those elements you're going to do that only once so you're basically amortizing the cost of control because you're doing the same thing at Cost billions of elements right so that's the beauty so if your vectors are larger this becomes more efficient as you can see right if your vectors are smaller this starts resembling a scalar procor in the limit if your vectors are size one meaning a single element in each Vector this is a scalar processor okay so basically this is what we discussed this is a Crossing paradigm single instruction operates on multiple a elements and this can be done in time and space actually a lot of things can be done in time and space we can revisit some of the discussions we have had for earlier paradigms in terms of time and space but here it fits very nicely uh basically we're going to have multiple processing elements for sure execution units just like we've been having so far and we're going to explore this time space duality in how do we handle the uh vectors so in an array processor the same instruction operates on multiple data elements at the same time using different spaces or processing elements if you will so you have an array of processing elements that can do the same thing and a single instruction in instructs them to do exactly the same thing concurrently at the same time to all of those dat elements different dat elements we're going to see this picture also Vector proc on the other hand is more frugal in the sense that uh you don't need to have many different processing elements that do exactly the same thing you can have only one but you reuse that processing element in time every single cycle you do the same operation on different data elements in consecutive Cycles basically the same instruction operates on multiple datea elements in consecutive time steps using the same space processing element this is why Vector processors were initially more popular because they were more Hardware efficient you don't need to replicate the same function unit many times you can you can just have one add unit and the a vector operations can be pipelined through it uh but then array processors became more popular as we could put more process more transistors on a chip today processors are a combination of both if you look at a GPU it's not a vector processor it's not an array processor it's a vector and array processor basically and we will see that also basically you can do both time and space uh uh type of processing so before I show you the time space Duality I need to Define one more thing because I think this important basically we're going to operate on vectors right we're not going to operate on scalar values so far we've been operating on scalar values and these scalar values are stored in registers and whenever we access a scaler value we say register one okay and register one gives us one element now the question is how do you get many elements right if you're going to operate on uh many elements your registers also should store many elements essentially instead of having a scalar value in a register we're going to have Vector registers so each Vector data register holds n m bit values n is the uh Vector length or maximum Vector length you can have in an instruction each register essentially stores a vector so this is register v0 we're going to call it v0 because Vector register is zero in fact I'm going to call it VR Z in the next slide V1 is another Vector register V2 is another Vector register so not a single scalar value as we saw before you can see that this is v0 z element V 0's first element and V Z's n minus first element right and then you feed the processing element one by one every clock cycle you feed it with one element and the processing element can be pipelined as you can see right and these are completely independent operations you're adding for example V 0 to V1 clearly the operations that happen on different elements Vector z z element plus Vector one zero element is independent of vector Zero's first element added to Vector one's first element right so it can deeply pipeline this whatever functional unit it is it could be square root your favorite fun functional unit so this is the idea basically this is the idea of vector register and the result is stored in a vector register the zeroth element gets stored in Vector uh the sum of these zero elements get stores in destination register V2s zero element okay so just extend the scalar value to Vector value and this is our simple Vector program to illustrate uh the difference between an array processor and a vector processor to begin with uh so this is our instruction theme it's something very simple basically it's a four uh element array a you first load it from memory into a vector register I didn't name the vector register just one vector register whatever because that's the only thing we're going to use actually and then we add one to each element this a vector add every element is incremented by one and then every element after being incremented by one multiplied by two as you can see so there are dependencies clearly and then we store the resulting elements uh into uh the array again into memory okay so this is basically our scalar Pro actually you can take the scalar program and just vectorize it right that's the idea uh okay let's see how this executes on an array processor versus a vector processor Vector length in this case as four I'm going to Define that more formally soon but basically an array processor can do the same operation across multiple processing elements at the same cycle in the same uh time basically so all of these processing elements are capable of executing all of these instructions so there's are B processing elements expensive so in the first cycle we can execute four loads now what does this mean we're basically loading element zero from uh memory into Vector register zero elements in this processing element element one from memory into Vector register element one in this procing element and then element two here and element three here okay concurrent now it looks beautiful and we're done with the instruction because we have parallels right and then uh we take element zero and increment it by one in this procing element element one and increment it by one in this procing element and element two and element three concurrently again and then multiply each element by two in concurrently in different processing elements and then do the store of the resulting value concurrently to different locations in memory consecutive locations in memory makes sense right so basically this is beautiful and this is the fastest you can get uh this program using this Paradigm if you will okay so let's take a look at how Vector processor does it Vector procor is Frugal as I said Crossing elements are simpler you have only one load element one element one procing element that can do add one procing element that can do multiply and one pring element that can do stores so you don't have this luxury if you will so you have to perform consecutive operations on consecutive elements in in time right so this is uh in the first cycle you start loading the first element uh from address a to the vector register zero element in the next cycle you load element one okay concurrently this result is done hopefully uh and you can add one to it right so you're operating on zero element and adding one to it in the add unit and then in the next cycle you're loading the second element in the load unit adding one to the first element in the add unit and multiplying the zeroth element by two in the multiply unit and then this is the steady state if you will you're loading the third element adding to the second element multiplying the first element and storing the zeroth element and then you keep doing this okay you can see that Vector processor takes longer because its resources are let's say uh more frugal it's not as expensive as array processor but this is the time space uh difference if you look at a vector processor you do the same operations at a given time and this is the time axis over here and this is the space axis uh but if you look at a vector procor you do different operations at a given time right and in the steady state you're still doing four operations but that steady state may take some time to reach the vector process as you can see okay so if you look at the space you do different operations in the same space in the array processor whereas Vector processor you do the same operation in the same space which is really really the huge cost efficiency of a vector processor that's why Vector Processor have been really popular uh until we were able to actually increase the functional unit significantly in our systems okay so as we will see later on we're going to combine these at the end of this lecture and gpus will combine both okay so now now let's contrast this to VW yesterday I show I showed you this picture uh you looked at VW we had a single program counter that would fetch a long instruction word right multiple independent operations are packed together into a long instruction and these independent operations had nothing to do with each other right okay array processing let's just look at array processing in this particular case if you look at a program counter it specifies a single operation except the single operation gets expanded to multiple different data elements so this is the contrast between vliw and an array processor vliw actually can exploit more general purpose parallelism as you can see as long as it's discoverable by the compiler and packable into instructions like this VW guarantees that these can be concurrently executed array procor also guarantees that this instruction specifies four operations in this case Vector length is four that can be concurrently executed because they're independent of each other because of a different semantic property in the first case the compiler in vliw the compiler guarantees that these are semantically independent in this case again the compiler or the programmer guarantees that these are operating on different data values right so in this sense they're similar but uh similar in the sense that they're doing multiple operations per cycle except uh the relationship uh between the operations uh in this case is you do you're doing the same operation single operation on multiple different data elements makes sense ofly and we're going to see that VW was actually developed uh to fix the problems that aray processor is having I'm going to read a part of the VW paper that is very let's say not harsh but uh realistically hard on an array processor unfortunately array Processor have been quite successful but webly has not because it relies on uh some parallelism that's a lot harder to discover even though it's much more general purpose no question about that this is much more general purpose as you can see right but this is much harder to do uh from a compiler or programmer perspective and this is much easier if your pro problem fits this parallel exists in all images all videos if you're doing the same operation on many many images or many many parts of an image you have millions of pixels that you're operating on right that's your vector basically whereas this one is okay so VW back so we we'll get back to VW now let's talk about Vector processing a little bit more detail uh I'm going to start with Vector processors but everything I say is really applicable to uh array processors it's just you just need to expand it in space as opposed to time so what is a vector vector is really a one-dimensional area of numbers this is a math definition and clearly many scientific and Commercial programs use vectors uh there are many of them that's why people have developed simd extensions that we're going to talk about at the end of this particular lecture but let's let's look at this example this example averages two vectors right it basically takes the element yse average of two 50 element vectors A and B are both 50 element vectors at least your your working on 50 elements at a time and storing the average uh into uh the C Vector so a vector passor is one whose instructions operate on vectors rather than scalar values so you take Vector registers essentially so there are some basic requirements for this let's define these clearly Vector registers that's why that was important to Define even before I defined array process and Vector process you need to load and store vectors you need Vector registers uh and these clearly contain vectors okay sounds good now I've kind of introduced the vector length register you need to operate on vectors of different lengths and this is really why you need the vector length register there are instructions that set the vector length register so that you can say oh at this point am I operating on vectors that are of length four at this point I'm operating on vectors that are of length thousand there's clearly a maximum Vector length and that's really determined by the number of elements in a vector register right it could be 64 for example we will see in cray it's 64 but it could be much larger as well as well now gpus are nice they eliminate this Vector length it's easier to program okay so and then there's a vector stride register uh this is something I'll introduce uh how many of you know the notion of stride okay good you're going to learn something again so basically elements of a vector might be stored apart from each other in memory and Vector stride register denotes how far apart from each other they should normally be stored regularly apart from each other irregular storage we're going to handle later with some different types of instructions we're going to introduce later but basically stride is the distance in memory between two elements of a vector and we're going to see why this changes stride is not always one if it's one actually it's nice basically consecutive elements of the vector that you're operating on actually stored in consecutive locations in memory this actually has a lot of benefits as we will discuss but your stri is not always one and I'll show you a very basic example matrix multiplication our favorite program these days right so let's take a look at A and B matrices both are stored in memory in row major order do you know this terminology row major column major no okay you're going to learn more but basically row major or order means that consecutive elements in the same row are stored in consecutive locations in memory column major means exactly opposite consecutive elements in the same column are stored in consecutive locations and this has a huge implication in terms of locality prefet ability caching Etc we're not going to deal with that we're going to talk about Str right so let's take a look at two uh M matrices that we're going to multiply this is a matrix this is a Zer element over here this the B Matrix this is B Zer element they're in row major order so the first row of a a Zer is stored in consecutive locations as you can see the first element is in location zero the next element in location one uh the next element is location two these are offsets from beginning of a of course right uh and then the first element in the next drove starts from location of set six right and if you look at b0 sorry Matrix B the first row uh is very similar it has 10 elements in a row uh and uh they're stored in consecutive order the next row starts from offset 10 so clearly uh this is what they look like basically uh I didn't show the entire thing over here because I don't have space but uh a starts from some location and this is these are the offsets of the elements where they are and this is what they correspond to in The Matrix B starts from some location in memory and these are the offsets of the element and this is what they correspond to in terms of the visual structure of the Matrix so clearly a is a 4X six Matrix four rows by six columns B is a 6x10 matrix and you can multiply them because these sixes match right and you get a result C of 4 by 10 now let's take a look at how you would do that so this is basically uh to be able to multiply these uh what you really need to do is is to take the dotproduct of each row Vector of a with each column Vector of B so what you're doing a matrix multiplication is really the to get one location in the C Matrix you need to take the dot product of this row zero with this column zero in B okay so let's take a look to be able to do that you need to load a0 into Vector register V1 or whatever Vector register meaning this a z through A5 offset zero through offset 5 and these are consecutive in memory each time you increment the address by one to access the next column right or next element in the row clearly it's the next column so this is an a case where we have a stride of one consecutive elements are in consecutive locations the distance in memory is one apart in terms of the address now if you're loading B column zero that's not the case as you will see if uh basically we want to uh load B 0 offset zero B10 B20 B B 30 b40 b50 into the second Vector so that we can actually do vector vector dot product now clearly therefore far apart from each other right so each time you need to increment the address by 10 to access the next row and accesses have a stride of 10 in this case okay so basically uh you need to have different stride supports you can say that oh why don't we lay out B in column major order yes you can do that but what if you start doing multiplying B with some other Matrix right you cannot basically have the best of all Worlds at a given time uh you will you'll always need to have some support for strides and this is something that has actually bothered many people people have tried to develop many many ways of overcoming this issue as we will see later on but right now I'm just defining the stride I'm not talking about why this is good or bad but stride of one can be good because this is bad in C better caching characteristics also as we will see later on okay so that's stride essentially now let's talk about some other characteristics of vector processors a vector instruction performs an operation on each element in consecutive cycle that's how we defined it so because of this we can pipeline the vector functional units and each pipeline stage operates on a different data element essentially and Vector instructions allow deeper pipelines actually you can have a thousand uh well if you have a huge Vector let's say aill a million element Vector a thousand uh deep pipeline is not that bad right because you don't ever flush this pip line at all so basically the main reason is you don't have any dependencies within the vector inter Vector dependency you don't need to check for them there's no Hardware interlocking in the vector we're talking about pining the function there's no control flow within a vector and there going to be a lot of good things as you will see we're going to eliminate a lot of the control flow from a scaler program soon uh and there's a known stride right you know the stri this allows easy calculation uh address calculation for all Vector elements like if you go back over here uh you always calculate the address by incrementing it by one over here right here when you're loading uh this uh column you always calculate the next address by incrementing it by 10 so you have a known stride essentially so that helps actually easy loading of vector registers into reg vectors into registers cache or memory or even early loading or prefetching because these are really predictable addresses we're going to talk about prefetching at one of the latest lectures but this sort of predictable stride or constant stride uh helps a lot but we will see that the address generation unit is essentially you have a base address and then you keep adding The Stride to it and I'm going to show you an example of that everything is clear so far now it's beautiful right as you can see all of these things are good for designing a high performance processor heavily pipelined no dependencies to check no control flow what else you would ask for you should ask for basically large Vector lengths if your vector lengths are smaller then these unfortunately do not work very well well basically you you cannot pipeline very deep clearly well you can pipeline but you don't benefit from it let's say okay so what are the advantage of vector processors essentially this this the the lack of dependencies within a vector enables pipelining paralyzation uh pipelining is actually in time if you will and then paralyzation is in space so you can actually do array processing as well as Vector processing which is nice and you can have very deep pipelines without the penalty of deep pipelines except for of course the uh sequencing overhead right and the hardware overhead I'm kind of ignoring that but the performance penalty other performance penalty like flushing the pipeline you don't need to do that each instruction generates a lot of work lots of operations basically and as we discussed earlier this is this reduces the instruction fetch requirements you don't need to fetch and decode uh an instruction for every operation you can advertize the cost of Fetch and decode that cost lots of work essentially and this this actually leads to High Energy Efficiency per operation and if someone asks you why our gpus is energy efficient this is the main reason basically they can advertize the cost of uh lots of operations across a single uh control let's say otherwise there are a lot of reasons why gpus should not be energy efficient I can name you many but because of this they can actually expose a lot of work without doing a lot of overhead that you're really spending a lot of energy on the computation if you will okay no need to explicitly code loops and we will see this very soon with an example you can have very few branches in the instruction sequence because we're going to eliminate branches because if you think about a scalar program uh what you do is you Loop and you do the same thing on many many elements of a vector right now we're going to specify this as you do one operation on a vector and then the next operation the next operation and then branch is only one thing at the end that you do after 1 million operations let's say per vector and you have a highly regular memory access pattern because of the strides that we have looked at earlier and this going to benefit us uh soon of course there's a disadvantage like every idea we have downsides and upsides and the huge downside of vector process is it doesn't always work right it works if parallelism is regular uh data and simd parm if you can fit your Paralis to this Paradigm it works nicely uh meaning Vector operations but unfortunately it's very very inefficient if parallelism is irregular like some of the programs that we have looked at if you're searching for a key in a link list lots of dependencies and you cannot parallelize for example very well too bad so you need to learn you need to know if you can vectorize the program we're going to talk about vectorizable loops so your Loops need to be vectorizable uh for this to work and this is exactly why uh the V paper uh says this to program a vector machine the compiler or handcar must make the data structures in the code fit nearly exactly the regular structure built into the hardware that's hard to do in the first place and just as hard to change one tweak and the lowlevel code has to be Rewritten by a very smart and dedicated programmer who knows the hardware and often the subtleties of the application area and this actually a very nice description of why some applications don't fit Vector processors but if they fit they fit greatly if they don't fit these problems happen and uh Josh fer poses VW as a solution to this but it has its own problems as you know right now right and if he continues often the rewriting is extremely difficult etc etc okay and that's the recommended paper you can read more about it okay so we should before moving on we should actually uh not get too excited because there's MD's law which you have reviewed uh and we looked at this before basically vector processing works well in the parallelizable fraction of the program right in the serial part you don't have a lot of data parallelism uh in the parallelizable fraction ideally you should be able to speed up with the number of processing elements that you have so this is MD law mdal law says that the paralyzable fraction of the program can be sped up by n assuming you have n processors even this is ideal if you take the advanced architecture course we're going to talk about why that's ideal but assume that this is true for now as n goes to Infinity this goes to zero so your speed up at the limit as n goes to Infinity is really Bound by F which is a paralyzable fraction if your paralyzable fraction is 50% your speed up is at most two that doesn't sound good now right with infinite number of Crossing elements your speed up is only two if your paralyzable fraction is 50% so your paralyzable fraction needs to go up very high if you really want to benefit if it goes up to 99% now your maximum speed up is 100 right if it's 99.9% it's, 99.99% it's 10,000 that sounds great now now we're talking but the problem is 99.99% is very hard to get right basically that means that you need to do very very little work in serial part of the program and serial part of the program can be just Distributing tasks to many many processors right and that can take time right you need to really really minimize it so that's why law is so strong uh so basically that's what this says maximum speed up is limited by serial portion this called a Serial ball neck and this serial ball neck term especially came about because of vector processors people basically said oh we have the serial ball neck how what do we deal with it as we will see Cray's solution was to actually design the fastest scaler machine of its time cray designed the fastest supercomputers they were actually fastest Vector processors and they were also fastest scaler processors at the time so that they could get away get out of the serial bck very quickly as much as possible so basically all parallel machines suffer from the serial Balck whenever you're paralyzing code it doesn't matter how you're paralyzing it uh you suffer from this okay that's why this paper was read earlier right and uh how many people did this review or you don't remember some people okay maybe it's been so long that you don't remember okay okay so uh this sort of issues actually are really really interesting uh and this is not the the only thing this is a a lecture from our computer architecture course where we discuss a lot more because it's not just about the serial ball neck I'll give you just some critical thinking this F divided n is not true also in single instruction multiple data processing F divided n by n is actually relatively true because you're really paralyzing the uh operations concurrently across multiple data elements and that's the closest you can get probably to F divide by n in simd processors but we will see a lot of other issues okay and if you're interested you can look at those lectures right now okay so that was one limitation uh basically a paralyzation uh if you cannot fit your application uh to Vector processors you suffer you don't get benefits and MDA law is one reason why you suffer there could be other reasons why you suffer because you could not really get to a paralyzation level uh uh that with your programming skills right and gpus actually suffer from this also if you want to program a GPU you'd better be an expert to actually maximize your parallels okay another limitation or downside is as we will see memory bandwidth can easily become a BAL Nick especially if you don't have balance between compute operations and memory operations and especially data is not mapped appropriately to memory banks we're going to talk a lot about this actually starting today but we're going to see memory structure a lot more but one of the major reasons uh major places where M became a ball neck first was really Vector process because these are pushing the limits right you're really with one single instruction you're unleashing so many memory operations right I want to load a billion a billion elements from memory how does your memory system support that the most sophisticated memory systems initially were in Vector processors and we're going to talk about some of those ideas but now of course they're in all processors today because yeah uh we're executing all sorts of code today in all processors essentially okay so let's go into Vector processing more depth uh unless there are questions okay so we're going to talk about Vector data registers a little bit more uh we have essentially each Vector data register holding n mbit values and you can see that uh you have the vector register zero Vector register one you can have a vector register set essentially Vector register file as we have seen uh and then we have vector control registers Vector length Vector stride we already talked about this and then I'm going to introduce Vector mask because you should be able to do conditional operations on vectors meaning if you have a vector of how do you basically do conditional operation on a vector if something is true uh increment this value in this element otherwise do not increment this value for example basically you have a vector mask register that says that computes a mask as to whether the operation should be done on the vector or not and then based on that mask you apply the operation to the vector and if the uh bit in the mask is one the operation is done if the bit in the mask is zero the operation is not done this is a form of conditional execution or predicated execution and this was also developed in Vector process first and then they went into high other high performance processors but they were not very successful in other high performance processors okay so let's we're going to get back to this we're going to see how to program using Vector masks so maximum Vector length can be n this is as I said earlier maximum number elements stored in a vector register and you can see n over here in these Vector registers Vector mask register V mask indicates which elements of a vector to operate on and this is set by Vector Test instructions this is very similar to compare instructions in uh scalar isas for example here we set every element of the vector mask by comparing every element of a vector register K to zero now what is the result of this operation if the element I of vector register K is equal to zero then element I of vector mask is set to one otherwise it set to zero okay so basically this a bit Vector this a bit Vector specifying uh whether the condition that you tested was true for each element of the vector and then you can use this to perform conditional operations essentially after this after you set the mask if you do a vector add the vector ad the addition will take place only for those elements where the vector mask with the corresponding element of the vector mask is set to one okay that's beautiful we're going to get back to this now let's uh let's look at Vector function units a little bit more essentially we feed data from Vector registers into two ports of the vector function unit that are pipelined as you can see and eventually the results comes out and each element in a pipeline manner gets to the next Vector reg to the destination Vector register right this is the operation here I'm kind of pictorially depicting uh Vector one multiplied by Vector two stored into Vector 3 register three and in this case we're looking at a six-stage multiplication pipeline right after six Cycles uh the first element result is available and placed into the destination register or zero element I should say the zero element result is available and it's placed in the destination register Z elements in the in every cycle because of the pipe line you get one element per cycle right initially you fill the pipeline but you get one element per cycle throughput from the function in based on all the pipelining principles this is true right and here the control of the Steep pipeline of the functional units is simple because elements in Vector are independent okay good and this is an example of a real machine K one you can actually see a version of this in the cab building how many of you have seen it that's the yellow machine okay good have you operated it I don't know if it's operational it's all old as you can see uh but uh let's take a look at what this entails so aure procor aure procor is not ever a vector processor only a vector processor I should say it's also a scalar processor because you need to do both operations right as you can see over here there's a vector unit Vector registers similarly to what we have described and this particular processor had eight 64 element Vector register each element had 64 bits you can see that at for its time that's actually pretty powerful that's very similar times we were talking about four bits 8 Bits in micro processors right uh and the goal was really to do super computer type of applications right uh scientific simulations Etc um okay uh 16 memory banks we're going to talk about that so memory needs to be heavily banked so that you can actually get one element per cycle from memory how do you sustain one element per cycle is important for Vector procer for every procor you need more than one element per cycle from memory okay for Vector procor we start with one element per cycle but this also has scaler register so there's actually a scaler unit over here you can see that it has eight 64-bit scaler registers very similar to a scaler Isa like myips we have seen and then there's also an address unit that does address computation and address registers and then there's a floating Point Unit at the time also so it's a pretty beefy processor for its time it's actually well it's actually multiple chips because of times right 1970s uh but it's a processor overall uh and you can see that it's very heterogeneous so heterogeneity is really built into this processor to begin with but the main core is really the vector uh processing part that does hopefully most of the computation if you're doing computation here you're really doing exercising the serial ballet and that's why these folks Design This scaler processor to be the fastest of its time which is fascinating right you should be doing operations always here but when you go here you got to get out quickly so the scaler unit to be fastest okay okay that's the importance of MD's law okay so if you're interested you can actually read this beautiful paper uh that describes a cray1 system and this is the machine that you may have seen and you may have read about it I'm not going to talk about it it's a block diagram of the same thing in a nicer way let's say from I think you can actually find information of that maybe there's a booklet over there maybe I'm remembering some other place there might be a booklet uh okay so you can read more about it C basically has designed many many Vector processors uh it's not that important in terms of what they designed here but it's important historically and they have a lot of functional units as you can see let's take a look basically these are address functional units scalar functional units Vector functional units which we're going to talk a lot about today but there's also floating Point functional units as you can see and they specify the time in terms of clock periods right Vector function units for example uh yeah okay there's more and you can see that the memory is actually important and iio is important so it's a full system essentially we're going to see more of this later okay I'll I'll give you some story over here uh which actually gives the idea of MD's law really nicely uh uh Seymour cray was a very principal designer if Fel he was the early designer of these supercomputers and he asked this question uh which is always good to think about if you're plowing a field which would you rather use two strong oxen or 1,24 chickens and this is a fundamental problem that keeps coming up right do you use two powerful single core processors or to use millions of powerless simple processors and the answer is it depends right as always basically maybe pling a field these guys are better they know what they're doing and they're higher performance right and they can be a lot more energy efficient also as we discussed yesterday right because they finished the job much earlier whereas these chickens will run around and they will never finish the job so you'll be expending energy on the chickens forever and then eventually you decide to eat them maybe I don't know maybe not but basically they will never finish the job right okay so this is the joke of it so it's good to think about this picture when you have a problem at hand right what do you use what kind of processor do you use uh Vector processors uh try to do this and heavy out of order super skill process may be trying to do this right anyway but C more cray was a smart person or the designers of early super computers were smart people so they actually design both they want to tear both chickens as well as uh the strong Ox okay let's talk about memory banks uh because this is going to be important how do you load and store vectors from and to memory and this requires loading and storing multiple elements you don't want just a single uh element we want many elements and we want to the throughput right we saw that the throughput of the functional units can be one element per cycle really easily now how do you ensure that M gets you one element per cycle and every cycle because we we're going to be loading millions and billions of elements potentially right so elements are separated from each other by a constant distance we're going to assume this initially this called a stride if this is not true it's going to become a bit harder as we will see we're going to assume a stri of one for now for Simplicity later I'm going to generalize this uh and basically elements can be loaded in consecutive Cycles if we can start the load of one element per cycle so basically every cycle you start the load of one element and hopefully you get the result after nend cycles and is the memory latency memory latency is hard to reduce for reasons we'll discuss later so we should be able to start one element per cycle and get one element per cycle uh one memory operation per cycle and get one element per cycle every cycle essentially so that's basically our goal so uh the problem is how do we achieve this with a memory that takes more than one cycle to access right if you have a monolithic memory system you ask memory give me one element per cycle let's say 11 Cycles later it gives you one element but during these 11 Cycles it's busy you cannot give it another request so you need to wait for 11 Cycles so if you have a monic memory system with a single port your two put is one Element every 11 Cycles so now how do you get to one element per cycle if your memory memory uh is 11 Cycles let's say and the answer is really banking the memory of course you can say I more ports to memory yes and that's expensive you can basically have uh lots of ports 11 ports so that you can start one element per cycle such that once you get the first element you started you can start the next one uh because one port will become free so you need 11 Imports essentially if your memory is 11 Cycles but another solution is banking the memory which is a cheaper solution and Inter the elements across Banks and we're going to use this solution and we're going to look at that again uh so if you miss this today which hopefully you won't we're going to see this again when we talk about the memory systems but this is the idea of memory bank so uh remember we had memory address register and memory data register and we assumed memory is monolithic now we're not we're not going to assume that we're going to have 16 banks in this particular case and have a memory address register for each and a memory data register for each but we're going to have a shared data bus and shared address bus AC cross do because these are actually very very expensive so if you think about memi this is really the memi chip uh well I guess this is the memory chip boundary I don't I should have had that over here and this is the CPU boundary and they communicate with each other with the address and datab bus and increasing the size of the address and datab bus actually hurts your cost a lot because these actually add pins to the to the chips and if you add pins your cost increases uh in this case we're not going to add pins we're going to start or send one address per cycle it's going to Target one particular bank and then we can get one data element per cycle from the MDR only from one particular bank but we should be able to send one address per cycle to different banks that way we can start uh the operation or the memory access of a bank in the first cycle and then if we need data from the next Bank we can start the memory access from that memory access from that bank in the next cycle we can start the memory access from the next Bank in the next cycle so we can actually start memory accesses uh one memory access per cycle from different banks and if each bank takes 11 Cycles you just need 11 Banks right but of course 11 is not a great number to uh distribute your addresses to because our addresses are fundamentally powers of two so that's why I have 16 over here okay so basically memory is divided into banks that can be accessed independently but Banks share address and data buses to ensure that memory chip pins are not increased we're going to talk about that a lot when we talk about memory design next week we'll start next week this way you can start and complete One Bank access per cycle that sounds good so far but under what conditions becomes an issue we'll talk about that uh basically under under this condition you can sustain and concurrent accesses if all and go to different banks so I'm going to finish the vector memory system and then we're going to take a break uh almost done basically if you look at a vector memory system you have the next address is previous address plus the strip so address calculation is very simple you have a base address and then add stride to it and that's your next address and then you keep adding stri to the next address that you generate and every cycle you latch it initially you start with the Base address hopefully that goes to this Bank you increment it by strip hopefully it's one let's say that goes to this Bank in the next cycle you increment it by the stri that goes to this bank and then this bank this bank this bank so consecutive accesses go to consecutive Banks as a result you can sustain one memory access per cycle and one word one element per cycle but this of course requires what I just formalized over here right if your stride is one and consecutive elements are allocated to different different banks in other words in memory terminology consecutive elements are interleaved across the banks and number of banks is greater than equal to bank latency then you can sustain one element per cycle throughput if any of these is not true this equation gets broken and you cannot sustain that through and I'm going to generalize The Stride later on stride one is nice but stride one is not enough if consecutive elements are all in the same bank somebody did not do the data mapping well too bad you're back to one element every 11 cycles and this is what Josh Fisher was talking about in that paper he was saying if you don't have the programmer knowing where to map the data you're going to get a terrible result and that terrible result could be one element every 11 Cycles right so data mapping is this is what is being referred to it's not really mapping of operations to functional units it's really mapping of data to memory banks and if you mess this up everything else is messed up in a vector processor including a GPU if you're programming a GPU how many people program gpus here okay you know the importance of this probably data mapping somewhat okay depends on what what kind of performance you want to get if you want to get the highest performance this is really extremely important and we will see this soon again okay this is a great place to take a 10-minute break so we're going to be back at 25 past and then we're going to actually go through this code and execute on a uh Vector processor but this a scalar code so we're going to see the scaler execution first uh there were some good questions about programmability of systolic arrays uh and how much performance you would get we did not cover a lot of programmability issues but even for systolic arrays if you really want to get high performance you need to orchestrate your data as we discussed uh and make sure that you actually uh fit your Matrix operations in a pipeline manner into this Matrix unit if you will that we discussed with TPU for example so programming these more Specialized or different paradigms is not as easy as programming uh uh the one Norman architectures that we have seen okay so now uh now that we know about Banks uh don't forget about the banks let me go back actually just to remind you so just to make things a bit more clear over here here we're dividing a monolithic memory into 16 pieces right so imagine the memory that we kind of thought as monolithic it has 2 to the N locations we're dividing it into 16 different pieces so this has 2 to the N divided by 16 locations this has another two the N divide by 16 this has another to the N divide by 16 dot dot dot right so each piece has some part of the memory main memory and now we need to lay out the data such that we can sustain one access or one element uh provided by memory per cycle and for that you need essentially these to be satisfied if stride is one consecutive elements need to be inter Leed across Banks and number of banks need to be greater than equal to bank Laten then you can sustain one element per cycle throughput and this is what I'm going to assume when I say you have n Banks if you will okay but now let's start with an example of the benefits at least the performance benefits of uh a vector processor and this is some scaler code uh and we're going to look at the performance of the scaler code on simple machines first and this the element wise that I showed you earlier if you actually write this elementwise average Loop in scalar code you write a loop basically right and this shows each instruction and latency in clock Cycles so this is uh you basically initialize uh 50 the loop boundary and then addresses of a b and c into registers and each of them take one cycle so this is the loop initialization and then within the loop what you need to do is load one element from array a and increment the address with which you load uh from array a and here we actually combine multiple instructions using an addressing Mode called Auto increment addressing if you actually are always incrementing the address by one uh while you're using the old address value of R1 you also increment R1 basically so this instruction does multiple things it loads uh the memory location specified by R1 which is really the address of a initially and then increments are one by one so that you go to the next element in the vector we're assuming that the elements are stored consecutively so that way you load the first element from Vector a and then you load the second first element from vector or zero element I should say zero element from Vector a zero element from Vector B each of these take 11 Cycles as you can see and then you add them up so there's a dependence between these two and put the result into R six these are scalar code there's no Vector happening over here uh there's no Vector code this takes four cycles and then there's a dependent shift you shift it by one meaning divided by two one cycle assume that and then you store the result of the shift operation to the memory location specified by C into element zero to begin with and then you increment the address by one okay so that takes 11 cycles and then there is this uh monstrous decrement and Branch if not zero instruction which a nice instruction you decrement R zero and Branch if it's not zero meaning keep itting the loop until uh r0 becomes zero right so you count down from 50 uh to execute this on every single element and this takes two cycles okay so this is what you would do in a scaler processor right you could write it this way and we actually saved some instructions as you can see using Auto increment addressing we don't have these instructions we have as relatively special purpose Branch Etc and this also Auto increment by the way okay so if you look at this you execute 304 Dynamic instructions right this is a lot we will we will reduce it down to seven with Vector code it's going to be beautiful from that perspective so it's going to nice it's going to be nice from the Energy Efficiency perspective but before we get to that get to this let's let's do some rough analysis in terms of how long it takes to execute this code on a scaler process in order poster with One Bank uh monic memory basically essentially if you have monic memory first two loads in the loop cannot be pipelined so each of them takes 11 cycles and roughly the entire Loop takes a uh the the sum of these Cycles over here so it's really 22 26 27 29 40 Cycles you could potentially pipeline the spren but eventually you need to resolve it okay let's take 40 cycles for the loop and you do the loop 5050 iterations so it takes 2,000 cycles and initially you need to initialize these instructions so it's four instruction so it takes basically 2,4 Cycles okay we're going to reduce this with Vector process but you could reduce this actually if you if you have a uh in order processor with a better memory system uh let's take a look in order pipeline processor so scale execution time on an in order processor with One bank but two memory ports such that two two different memory accesses can be serviced concurrently or two Banks where arrays B and C are stored in different banks okay no that should be a a and b right okay uh what happens okay I'm fixing online okay because A and B those are the things that we actually uh load with the load instructions where are we okay so basically A and B are in different banks such that you start uh the first uh uh load in the first cycle and then you start the next load pipeline the next load in the next cycle such that uh both of the loads together take 12 Cycles right one of them takes 11 cycles and the other one takes one more cycle after that right but then you cannot paralyze anything else because everything else is dependent afterwards if you remember this code so basically these two together take 12 Cycles now and then 16 17 29 31 okay sorry I I missed something over here 30 I think right 12 16 17 yeah 19 30 okay that's better so 30 cyes you could potentially do some more pipelining but it's not going to significantly reduce this but without going to the machine details too much it takes about 1,500 Cycles right total now we're going to reduce this with Vector process of course auto execution Etc can help you a little bit more right if you have a larger window but in this case out ofo execution is not that great because everything in the loop except for this branch and these two loads are independent but this there's a dependence chain over here right the next iteration of the loop is independent of course autoo execution enables you to go to the next iteration of the loop so it can paralyze Loop iterations that's one of the benefits of aut execution but it's going to be expensive clearly with award execution so you can do more performance analysis which we're not going to do right now we're going to compare this to a vector pass so if you uh for this to work on a vector processor the loop needs to be vectorizable and what does this mean a loop is vectorizable if if if each iteration is independent of any other in this case that's true absolutely so if you look at this Loop you will say each iteration is independent because it's operating on different elements right so if you really want to uh program a vector procer you need to do this Loop dependence analysis nicely if you want to design a compiler that automatically vectorizes parallelizes code that compiler needs to do the dependence analysis and turns out is not easy unless the code is as simple as this for example so if you vectorize this Loop this is what the vectorize loop look Flex it's going to be much simpler this is it basically so Vector length our Vector length is 50 we set initialize a vector length register that way our Vector stride is one because we laid out nicely our uh arrays and then the first instruction is Vector load into v0 Vector register zero starting from address a and this automatically says that load the next 50 elements starting from address a into Vector register Zer so how long will it take the first element will come after 11 Cycles assuming the same memory system and you get one element per cycle in the next consecutive Cycles so basically this will take 11 plus 49 cyle right Vector length minus one because the first element comes after one and then the remaining 49 elements takes 49 more Cycles to come okay so if your vector length was 30 you would basically plug in your vector length over here now if it was actually larger than your vector register length you need to restructure your program as we will see later but it's not that bad the core of your program will still be this okay this is this instruction uh the next instruction is Vector load uh basically load the elements of uh array B into Vector register one the next instructions a vector add adds Vector register zero to Vector register one put the result into Vector register two again this is each first result will be available after four Cycles because that's the latency of the adder and then the remaining 49 results of the addition will come in 49 consecutive Cycles Vector length minus one cycles and then next you take the results of the uh uh the result that's in Vector register 2 shifted each element by one and this takes you can see 1 plus 49 Cycles because the first shift finishes after one cycle and then you get 49 other shift results in 49 consecutive cycles and then the last one is a vector store which takes the result of the shift in Vector register three and places it uh in the memory starting from uh the address of C okay which is very similar in terms of latency to the vector loads as you can see because it's takes 11 Cycles to write the first element and then uh 49 more Cycles to write the consecutive elements okay this sounds good now and you can see immediately that there's no Branch here we got rid of the branch and we have only seven Dynamic instructions two of them are setup instructions and the remaining five are actually the instructions that do the bulk of the work now this is where the Energy Efficiency comes from you can see that the control over head of 300 I don't remember 14 instructions is gone and this is a very small Vector right you're doing this only on 50 elements imagine you're doing this on much larger vectors okay so now let's take a look at how long it takes to execute this code initially we're going to assume a very simple Vector processor not even forwarding of data so this is called chaining in Vector processing I don't like this terminology necessarily but you will see why it exists uh there's no data forwarding between Vector functional units you have to finish finish all Vector length of operations in one functional unit before you can start another operation on the same Vector register essentially you cannot forward individual elements from one function unit to another function okay basically the output of a vector function unit cannot be used as a direct input of another uh the grity of uh weighting is really the vector entire Vector register if you will yeah basically the entire Vector register needs to be ready before any element of it can be used as part of another operation and we're also going to assume one memory Port one address generator per Bank this is a good assumption in general but we're going to fix that we're going to add more to this later on and we're going to assume 16 memory banks in this case a word inter lead meaning consecutive elements of an array are stored in consecutive Banks and this is going to help us uh to get uh consecutive elements in consecutive Cycles clearly because each Bank access Laten is 11 Cycles uh and uh the first element takes 11 Cycles in One Bank zero the next element takes get comes back after one more cycle in bank one uh next element comes back after one more cycle in Bank two Etc right so basically we're pipelining the memory system as well as you can see using Bank okay so this is the execution time of the code basically if you remember these are the two setup instructions I don't show them over here Vector length and Vector stri one cycle each this is the vector load so Vector load of a into Vector register zero the first element comes after 11 cycles and because memory system is intered nicely and the data is mapped nicely consecutive elements come in consecutive 49 Cycles right serviced in consecutive Banks essentially and then it wraps around of course right because you keep interleaving across Banks but because your access latency is 11 your memory bank uh number should be 16 if if your memy bank count was eight you you would have a stall of three Cycles you would service eight requests and then stall three cycles and then you keep doing that basically uh but that's not what we depict over here okay so that's Vector load and because we have only one memory Port we cannot start the next load concurrently because all of these banks are now busy servicing uh the elements for this Vector load right so the next Vector load can start after the uh at this point I this you could potentially schedule a bit earlier but I'm making it a little bit easier over here to analyze oh uh basically the next Vector load comes uh starts the next element of the next Vector load comes after 11 cycles and the consecutive elements come after 49 Cycles now after this data is done uh you can start the add operation which adds Vector zero to Vector one the first element finishes after four Cycles the consecutive elements come in 49 cycles and then we can start the shift there's no data forwarding we're going to fix this with data forwarding uh you start the shift in Oney one cycle the first element comes and then uh after 49 Cycles in consecutive 49 Cycles you get consecutive 49 results of the shift and then the store the same story basically basically without doing a lot of effort this is the result you get 250 85 seconds now you can say maybe you can start this load earlier and you would be right right you can actually start this load a bit earlier but it's not going to save you more than 10 Cycles let's say okay so not bad but now we have a b memory system clear so we've already reduced the 1500 Cycles to 285 cycles and we are also more energy efficient probably because we're not decoding a lot of instructions we don't have any branches Etc so why 16 Banks hopefully this is clear right now we have a 11 cycle memory access latency having 16 Banks ensures that there are enough Banks to overlap enough memory operations to cover the memory latency and the above assumes unit stri basically that's what we're assuming and this is correct for our example program if stride is greater than one then how do you ensure you can access one element per cycle when memory Laten these 11 Cycles well now you need to be more careful we're going to get back to that okay so let's talk about how to improve the performance of this uh people are fine with this so far right so we're going to improve the performance of this by Vector chaining to begin with so if you look at this uh let's take a look at the result of the first elements this is the ad Vector register zero Vector register one the first element is really available over here you can actually start the shift right after that and shift the first element by one right but we're waiting for the entire Vector register to become ready because we're waiting for the entire Vector register we can we don't have to do that we can basically forward this result to the shift unit and again similarly shift we can start the store right after the first element basically we can do elementwise forwarding of results and that's what we're going to do that's the idea of vector chaining it's the fancy word for data forwarding we're going to forward data from one vector function unit to another and this is one example over here you do a load into Vector register one and forward it element twice to this multiply unit and forward the result of the multiply unit element twice to the add unit so if you look at this this is Vector one load unit is loading and it's also sending the result of the multiply unit I mean this is pictorial of course in Hardware you need have Maxes Etc uh which we're not showing over here just like the forwarding that we did for the pipeline processor right basically this is what you do every element goes to this multiply unit in a pipeline manner such that you don't wait for this entire register to be full similarly the result of this multiplication unit is also supplied to the ad unit every element consecutive Cycles such that the ad unit can overlaps execution with the multiplyer which M overlaps execution with the load Okay so so while uh the vector results are coming out they're coming into these different function units and this is going to parallelize the back end of our program meaning actually after you get the first element of the second load you can start the shift uh sorry this add right because you have the first element of vector one you already had the first element of vector zero you basically add them and then in consecutive Cycles you get the second element third element fourth element Fifth Element up to the 50th element or 49th element so basically you overlap the latency you start the vector ad right after the first element comes you start the vector shift right after the first element is produced by the ad Vector ad function unit and then you can potentially start uh the store instruction first element after the shift is done but unfortunately we have a resource conflict over here in the memory uh we're going to talk about that you cannot start this Vector store because we have only one port to the bank that's what we said so that's why we wait over here but even if with that weight even without beefing up our memory system more this Vector data forwarding brought us to 182 Cycles from 285 that seems like a good win and again if your vector lengths are larger the benefits are going to be larger okay not bad now let's take tackle the memory system a little bit so we could not paralyze this load and this load and uh deficiencies in our memory system actually caused us uh to wait in the store so these two loads cannot be pipelined and these this load and store cannot be pipelin essentially we don't have enough ports uh in the memory system for this basically we had each memory bank has a single port uh that was our strict assumption and this led to the memory band with bottleneck and we assume that this port is shared by readen right so we're going to assume now two load ports and one store Port in each Bank more expensive memory system uh and this will enable us uh perfect pipelining right now so uh this is basically you start the first load in cycle one the second load is completely independent of the first load it can go to the other Port other load Port so you start the next load so basically you almost completely overlap the latency of these two loads because they act as two different ports in memory of course two different ports is expensive right clearly two different ports is really uh uh yeah two different address buses and two different data buses to and but then of course it's shared across uh Banks but still it's expensive but you could parallelize and you don't have the problem with the Spectra store you can start the Spectra store anytime because it goes to its own Banks right um sorry on Port not on Bank clearly uh different elements in the Spectra store go to consecutive Banks as we have discussed so basically uh you can paralyze almost completely any number of uh loads as long as you have one port servicing it and stores also so this brings us to 79 cycles and this is if you compare it the 1504 that's 19x performance improvements which is not bad you could say that okay this is maybe unfair comparison we beefed up the memory system yes but you could also say that the in order processor was not going to utilize that memory system anyway because it's not capable of doing so out of order processor maybe you need a bigger instruction window but in order processor give this memory system to the in order processor it's not going to do much it's going to do a little bit by by the first two loads actually but not that much does that make sense okay okay so this is the advantage of vector process as you can see large performance improvements and large efficiency improvements now let's tackle some uh questions that make programming a bit more harder also so what if number of data elements are greater than number of elements in a vector register this is relatively easy you need to break the loop such that each iteration operates on number of elements in a vector register for example if you have 512 data elements uh that you're operating on for some reason and you have 64 element Vector registers you basically need to have a loop of the code that we kind of shown you need to have eight iterations where Vector length is 64 and you to have one iteration where Vector length is 15 now the downside is these eight iterations where Vector length is 64 is nice because you don't need to change the vector length but when you change the vector length a lot of things in the machine changes right you were assuming some Vector length and you change the vector length you need to actually wait a lot it introduce a lot of stalling I'll let you think about that uh but this is called Vector strip mining uh and it's actually coming from the mining terminology uh you have some really interesting part of the mine gold to reach the gold you got to get rid of the chaff right so you basically strip mine the top of the mountain and then you reach the gold what is gold gold is eight iterations where Vector length is perfect the rest is dust or whatever you're strip mining right if you if you read this basically it's a you remove the soil and rock overlying the mineral deposit which is the gold okay people are very inventive as you can see right okay so now let's talk about some uh more thorny issues what if Vector data is not stored in a strided fashion in memory what if you have irregular memory access to a vector uh basically the idea is to use indirection remember the load indirect that we introduced in lc3 it's going to come back to us right now we're going to do indirect loads so we're going to use indirection to combine or pack elements into Vector registers these are called scatter gather operations they're very popular today because of sparse matrices especially uh which is used in machine learning quite a bit uh basically yeah this is this says that doing doing this also helps with avoiding useless computation on sparse vectors basically uh instead of if if you have a vector you have a billion elements only three of them are non zero it doesn't make sense to go through all billion elements and do operations on a billion elements right assuming that zero doesn't contribute to your computation which is not the case in most cases multiplication for example examp Z is useless right you gather only those three elements put them pack them into a Rector register and operate only on those three elements that's the idea basically and then you scatter them back to memory where they were supposed to be that's the idea basically so this could be extremely advantageous in those situations as you can see right but it's not the only case where it could be advantageous as we will see I'm going to actually show you another example which is these inject accesses so let's assume that you have you want to vectorize this Loop somebody gave you this Loop and somebody asked is this vectorizable your answer should be yes why because Loop iterations are independent of each other clearly there's an indirection over here that makes things a little bit more complicated so we need to do something about it which we did not do before but whether or not this is vectorizable is not questionable it is vectorizable so basically we're going to in introduce this index load instruction what will be what we will do is so D Vector this D Vector uh is essentially we load the indices into this D Vector uh essentially yeah and then we do a load inir from RC based what what this instruction will do is uh so here you have a vector of indices that are specified by D and then we do a load indirect where you take RC and add the index to it and uh for each element of each Vector of this vector and then load it into the consecutive elements of vector C over here okay and then you can actually do the rest the rest is easier of course the rest is what we have SE these are this is the different part basically you need to load the indices and then you do a load indir now let's take a look at an example essentially these are often implemented Hardware to handle these sparse vectors as we as I mentioned or indirect indexing Vector loads and stores use an index Vector which is added to the base register to generate the addresses let's take a look at example of a scatter not gather so this is our index Vector these are the locations from a base address that we're going to store this data Vector to now this a simple example we have only four elements but what this means is that the vector should be stored this way in memory at base location which needs to be specified somewhere plus zero we should store 3.14 at base plus two we should store 6.5 at base plus 6 we should store 71.2 at base plus 7 we should store 2.71 and the rest we don't touch okay so B basically this is a vector this is a vector you just need to do store base plus uh basically each the operation is a vector store and this Vector store takes a base address and adds it to the element of the index Vector calculates the address and stores the data over there basically you're scattering the data using this Vector scatter scatter store operation that makes sense hopefully right okay yeah is very similar except you to the opposite you can actually use the same example to gather from memory you don't have this data Vector you're going to populate the data Vector you give it an index vector and you a load using the base addresses each each uh consecutive element as the Base address consecutive operation as the Base address to these different indices and you load from these different indices into the data Vector okay you can convince yourself that this is relatively easy and clear this example doesn't do justice to it because it's an eight element we we show eight elements and four of them are we don't care but whenever you look at an example like this think about the example I provided you have a billion elements only three of them are really numbers that you operate on nonzeros let's say okay let's talk about some other things conditional operations we've already talked about this actually but I just want to give you an example what if some operations should not be executed on a vector based on a dynamically determined condition like this sort of again if somebody asks you is this vectorizable you will say absolutely yes because Loop iterations are independent of each other but there are some conditional flow execution if an element is not zero then you store uh the result of this multiplication into the destination uh B Vector as you can see if an element is zero of a AI uh then you don't actually do the store so basically the idea is to use MTH operations you set the vector Mass register it's a bit Mass bit Mass determining which data elements should not be acted upon so this is how you would translate this code you Vector load v0 V1 A and B as we normally do and then we set the vector mask register by checking whether each element of vector zero is not equal to zero as you can see based on this condition this sets the vector M register elements corresponding to not equal to zero being true are set to one and this V M everything after this takes into effect depending on the vector mask so for example when you do this Vector multiply you do this multiplication but only the elements for which The Mask bit is set actually updates the corresponding elements makes sense right the multiply takes effects only for elements for which The Mask bit is set basically for B for all of those elements where the MK bit is not set the multiply doesn't take effect so this is conditional execution that's true for store also actually so in order to be able to go back to unconditional execution you need to reset the vector vector mask register to all ones so if your vector mask register is all zeros you these are actually no Ops right makes sense right you still execute them but in effect there are no Ops because all of the Mask all of the elements in the back max reg are zeros so this is actually predicate execution we did not cover predicate execution much when when we talked about branch prediction but essentially this is converting a branch instruction into data flow right that Vector mask is providing conditions and those conditions decide whether an operation takes effect on an element there is no Branch it's all data flow right it's all dependent on the vector Mass register so this is beautiful and this is where predicate execution has been extremely successful execution is really predicated on the mask bit if the mask bit is one but actually it applies to all operations in a vector proc if the I did not talk about this but here the vector mask is all once that's why all of these are unconditional but once you change the vector mask and Vector mask changes a little bit if if some uh element has a corresponding element in the vector Mass that's zero then that doesn't take into effect okay so this is another example with masking uh this actually more sophisticated but I won't give you the code you can write the code and this is again vectorizable but now we're test we're checking if AI is greater than bi if that's the case we set CI to be AI otherwise we set CI to bebi what we need to do is we need to compare A and B to get the V mask and then we do a Mas store of a into C and then we complement V mask that's else and then we do a m Mas store of B into C and then once you go out of this uh loop let's say you set the vmass to all once and this is what you do so you can actually do operations on the mask Vector also so basically you're doing predicate computation on the mask Vector mask Vector this is very powerful in fact arm Isa had conditional execution all these processors but then they got rid of it because it was very difficult in a scaler processor but in a vector processer it makes a lot of sense because it gets rid of these bench instructions completely okay so basically this is the execution if this is your a vector and B Vector this is what your V mask will be and you can again convince Yourself by comparing AI greater than equal to bi so how do you implement these Vector Mass Vector instructions a simple implementation would be execute all of the operations but turn off result right back according to the mask so these These are the mask bits elements corresponding to the elements going into the function units so let's assume that it's a multiplication you do all of the multiplications but before you write to the write the elements into the vector register you check whether the mask is zero or one if it's zero you don't write the element to the vector register element entry if it's one you do now this may be very energy inefficient as you can see because if your vector mask is zero for almost all the elements you're doing lots of computation in your functional unit but you're not writing Vector result so another implementation is this density time implementation where you first scan the mask vector and you figure out which ones are actually ones and only execute those operations in the vector that are once that's the idea which one is better it depends on basically the complexity of the scanning and also how much how many zeros you have in The Mask okay so I'll let you think about that so there are some other issues so stride and bank count let's revisit this uh so actually we assume stride one but as long as a stride and bank account are relatively prime to each other such that they're not divisible to with each other and there enough Banks to cover the access latency you can sustain one element per cycle through this is the real generalization you need your stri and bank count to be relatively prime how do you make sure that happens good luck especially if you have a two to the power of two number of Bank two to the power of n number of banks this is very difficult toch if your Banks can be a prime number well that becomes interesting now but then how do you do the address mapping with a prime number okay so basically uh we talked about the row major column major I'm not going to talk about this I want to talk about a few more things before we finish uh you can leave if you need to leave uh but I'm almost done so basically you need to change the stride when accessing a row versus column and we've already seen this basically this is very difficult to achieve the purpose of the initial example was very difficult to achieve different strides can lead to bank conflict this is a stride of one this a stride of 10 but now when you start doing something else when you start multiplying this Vector with this array with some other array uh then you may actually start getting B confidence so basically how do we minim Isam has actually concerned Vector processer designers a lot and a lot of what has been implemented in Vector process trickled into our systems today because we need actually to minimize Bank conflicts more Banks we have more Banks today expensive more ports in each Bank expensive better day lay out to match the access pattern how do you do that you need a hero programmer for that uh is this always possible maybe not better mapping of address to bank now this becomes interesting people actually proposed a lot of mathematical randomized address mapping schemes you take an address and memory controller randomizes the address mapping in a predictable function of course such that the probability of conflict reduces as much as possible and this is a beautiful paper that talks about the idea uh you may recall the name Bob R this is the person who developed the VW processors also but he had actually worked a lot on Vector processors in the past and uh you can see that there are a lot of mathematical theory that go goes into randomizing the address fun address mapping function to Banks which address should go to which M which bank basically we're going to get back to this in memory system so let me uh shall we take a couple more minutes I think we're already a couple more minutes more who says we should take a couple more minutes to finish this part otherwise okay some people who says I want to go if you want to go you can go no problem with that okay let me finish since more people rais their hands uh this is interesting because uh this going to finally put things together so array versus Vector procer is really a distinction a purest distinction most modern processors are actually combination of both they exploit data parallelism in both time and space now let's take a look at this and one of the reasons is array processors are actually too idealistic right if your vector length is a thousand it's very difficult to get a thousand procing elements well today you can actually let's say a million a million of vector length a million procing elements that do everything that's very difficult to get so in the end you really need to pipeline like a vector processor does so let's take a look at this instruction Vector at ABC so this is a vector execution right consecutive elements uh go to uh go to uh the same functional unit in consecutive Cycles that's time now this is what really happens in existing systems you have some number of functional units that can execute the same function and you feed some elements in parallel as you can see the first four elements in parallel but then you pipeline the remaining months in parallel across these functional units so you can see that you're exploiting both space a parallelm in both space and time that sounds good now you can also customize a register file registers that are numbered or elements that are numbered 0 4 8 12 16 20 24 go to this one the other ones go here basically there's a function there's a mapping there's a partitioning of the register file across these functional units now not all registers need to feed all functional units this makes it a little bit more harder to program also actually later so basically This Is What GPU kind of looks like today you have a partition Vector registers as you can see and you have what's called a lane a lane is essentially functional units that operate on a piece of the register file and you can actually have multiple function units as you can see over here okay so uh well you can also think of this as a functional unit that is like an array processor array functional unit but then each there's a lane in each one and you actually pipeline uh the results in each one but then you also exploit the space parallels so let's take a look at an example over here you can overlap the execution of multiple vectoral instructions we have an example machine over here that has 32 elements per Vector register and eight lanes and we're going to look at how you do the load units so eight Lanes in the first cycle you sent eight element elements in the next cycle the next eight elements in the next cycle the next one next one and then you can start the multiply concurrently and then you can start the add in the next cycle basically so if you look at this the instruction level parallelism you get in the steady state is 24 operations per cycle while you issue one vector instruction per cycle so this is quite uh good if you will okay I'll do this also very quickly as I said uh you can automatically vectorize code this is a perfectly vectorizable Loop if you look at the scalar code this is what you do you load the first element from a you load from B you add them up you store into C this the iteration first itation I'm ignoring the branch over here this is the second iteration now automatic vectorization or if you're vectorizing your program you realize that these are independent and you put the loads across different interations together that's what we're really doing actually this is scaler code but you can take scaler code do the dependenc analysis and put the loads together and vectorize them so this is a vector load this is another Vector load a vector ad a vector store you're really fusing iterations when you think scaler when you think Vector you don't even start with scaler of course right that's the idea so vectorization is really a compile time reordering of this operation sequencing but this requires extensive Loop dependen analysis to make sure that different iterations are independent of each other so let me summarize basically vector and simd machines are really good at exploiting regular data parallelism you perform the same operation on many data elements this improves performance simplifies design improves Energy Efficiency but performance Improvement is limited by vectorizability of code and scaler operations limit Vector machine performance and always remember M law 99.9% and cray one was the fastest scaler machine of its time scaler was written in B in large in capitals for a reason so many existing isas actually were influenced by Vector machines they started including these simd operations that's what we're going to talk about next in the next lecture and then gpus are clearly modern Vector plus array machines that actually have a different programming Paradigm and we're going to talk about that but these are actually these have been there before gpus became popular let's say but these are actually Vector instructions in a limited way okay so this is where I will stop today we're going to talk about GPU architectures everything is fine online okay uh GPU architectures uh we're going to build on simd processing that we covered last time uh and clearly this a fascinating topic gpus are parallel computers that you have in your pocket that nobody had the luxury to add in the past people used to build supercomputers that are much much much less powerful than today's gpus today you have it in your pocket so you can imagine the computational power uh we have today uh it's fun to learn about gpus but I'm going to first uh remind you about what he covered simd processing and then we're going to talk about simd operations in scaler isas or one architectures so one architectures actually incorporate simd operations as well as will see and then we're going to jump into gpus so you can see the impact of simd processing everywhere in our lives okay these are some readings uh and this is where we will uh start remember simd processing is all about single instruction multiple data we're exploiting regular data level parallelism uh and we discussed uh the seminal paper that divided the Computing Spectrum paradigms in terms of the instructions and the data on different axes and single instruction operating on multiple data elements uh falls into simd and we discussed two types of pure simd processors array processors and Vector process and I mentioned in last lecture that existing gpus combine both of these types so this is a really a purest distinction today existing gpus are both array processors and Vector processor if you get the trick trick question in interview for example so if someone ask you is GP a vector processor you say yes is GP an array processor Yes except it's programmed in a complete different way as we will see today compared to traditional simd and just to jog your memory what's an array processor what's a vector processor essentially these both operate have multiple processing elements and they can execute multiple instructions at a given time uh except array processor has much much more powerful processing elements where each processing element can do any operation and this way you can actually uh execute uh the same operation at the same time on all processing elements on different data elements different data inputs you can see that at a given time all the processing elements are executing loads in the first cycle adds in the next cycle multiplies in the next cycle and stores in the next cycle Vector process is more frugal meaning the processing element can be simple as we discussed you can have different processing elements specialized load unit add unit multiply unit store unit for example you each unit doesn't have to do everything and that's why it's actually easier to build but here you cannot execute the same operation on multiple processing elements at the same time you have to pipeline it over time but if you look at space you look at you basically execute one operation in the same space but different operations in the same time that's how you exploit the parallelism basically and in real processors we combine both of them and we said that simd is very powerful because it enables a single instruction to specify huge amounts of work because you're doing the same operation that's specified by the instruction on many many elements billions of elements potentially right uh maybe you can turn off the admitting over here I don't know why Zoom has this I can hear myself here also okay um yeah basically this is a bad design in software as you can see right the software should not really interrupt uh the presenter why should the presenter be admitting person some person over there as you can see right okay so there are design principles you can see and the is broken in this piece of software if you will okay uh so basically you can see that uh Vector processors are more frugal but both of them really execute uh a lot of a lot of operations okay what was I saying basically single operation uh specifies a lot of work this way you don't need to decode fetch the same instruction many many times right and we've seen that but because of the same reason a single operation specifies a lot of work on a lot of amounts of data a lot of a large amounts of data uh we need a more powerful memory system and the remaining six lectures we're going to talk about the memory system and six lectures are really not enough to talk about the memory system as we will see this is really the most important part of today's machines and a GPU has a very sophisticated memory system and we talked about Vector processor memory system basically we said that you need to be able to sustain sustain and concurrent accesses uh uh if you actually want n elements at a time in an array process you need n elements at a time if you're doing a ve array load right Vector load on an array processor and we will see that gpus actually exas because they follow the same Paradigm okay you told you remember Mery banking we're going to revisit that uh next week probably maybe tomorrow okay so this is how we combine the Paradigm so you have a vector addition instruction and this is how it's executed on a vector processor it's executed on a addition unit this way each uh let's say element of the vector is operated on in consecutive Cycles in time but now you can actually do both space and time in a combined array and Vector processor you can see that the first four elements first four additions uh of the first four elements are done concurrently across different four ad units and then the next four elements are pipelined next four elements are pipelined next four elements are pipelined so this is basically an array processor where the array with this four and then a vector procure also where the vector length is specified somehow by the vector register okay you see of this and we've discussed that this enables us to divide the register file such that elements 048 come from this piece of the register file elements 159 dot dot dot come from this piece this way you can have partition Vector registers as opposed to having a huge monolithic register file which makes the design easier it also clearly improves your parallelism and this is called a vector Lane and we're going to see that GP GPU is going to be very similar to this so I'm just going to change the terminology a little bit so these two pictures this picture and this picture keep keep that in mind you're going to see it again in fact this picture you're going to see it again also here the terminology is uh Vector processing or simd processing you can overlap the execution of multiple Vector instructions so in this particular case uh you have 32 elements per Vector registers and eight Lanes you have eight uh load units if you will uh you can do eight loads per cycle and how do you actually do 32 loads per cycle you start the remaining loads in consecutive Cycles one single load operation does eight loads in one cycle and the vector length is 32 so it basically launches three more batches of eight loads and then you have multiply and then you have ADD and then if you have another load it starts another 32 operations uh and you can see that these are both pipelined and parallel this way you can sustain 24 operations per cycle in this particular example if you look at time at a given point in time in the steady state you're you have 24 operations going on 24 different date elements okay these are all review we've seen this we've also seen the disadvantage I we talked about advantages so I'm going to focus a little bit more on disadvantages here this works nicely if Paralis is regular right if you have a data parallelism or simd Paralis what's called uh operating on very large vectors this is great unfortunately if Paralis is irregular you have problems it's very hard to uh get efficiency out of this machine it's a general purpose machine so it can actually work uh it you can actually program it you can actually write any program with this but unfortunately the efficiency is very low your your simd uh Vector length can be let's say 32 but you may be utilizing only one so you may be running scaler code on a vector processor and that's clearly very very inefficient right that's why people have built special scaler units inside the vector processor such that those scaler instructions do not get executed on the expensive part of the processor okay and I also giving you this quote from Josh Fisher on VW and he's absolutely right uh in what he says over here we've discussed that last time and we've discussed that you can actually if if your program is nice you can automatically vectorize your program for example if you have this Loop which is vectorizable because Loop iterations are independent of each other uh you can execute it like this we're going to see this actually very soon again SC in a scaler machine or you can identify that Loops are Loop iterations are independent of each other by doing some analysis on the program using compilation techniques called Loop dependence analysis you can basically parallelize the iterations so basically you put the iterations together multiple iterations together let's say Vector length iterations together and then each instruction is a vector instruction over here this a this green instruction is a vector load this green this uh Brown instruction another Vector load this pink instructions a vector ad this yellow instructions a vector store and you're you're doing 32 iterations in parallel using Vector instructions so it's beautiful okay just to summarize we've also discussed this uh this is is very good at exploiting regular data parallelism and performance Improvement unfortunately is limited by the vectorizability of code so I'm going to remind you of amal's law again and this is what we're going to cover next many existing isas include simd operations they don't go full way of actually having a full simd processor inside the one no machine but they actually have instructions that operate using the same principles as we will see uh so before that we should always remember MD law if someone says M law doesn't exist you know what to say as long as you have some sequential code as long as your f is not zero you're bound by uh uh oh sorry f is not one f is a parallelizable fraction of a program as long as if if f is not 100% you are limited by the speed up that you're going to get so you want to get uh make a paralyzable fraction of a program using algorithms or different compilation techniques as high as possible if it's not high you know that you're dominated by uh this part uh of the speed up equation okay we'll see this again very soon with gpus any questions this was a very quick overview uh of the last lecture now we're going to jump into how last lecture applies to Modern processors uh in your pockets again uh basically modern processors One noyman machines have single instruction multiple dat extension instructions essentially these are special instructions uh where the instruction acts on multiple pieces of data at once as opposed to acting on two two registers register one register two one piece of data each one element you basically act on four elements at a time here you can see an example at the bottom from an earlier uh Intel multimedia extension instruction and why why were these introduced uh as your suggested recommended reading describes very clearly uh people wanted to accelerate Graphics multimedia image processing applications that have been becoming more popular in the 1990s uh and uh essentially they did not want to change the ISA significantly or the execution model significantly so they basically added these instructions that can perform short arithmetic operations also called packed arithmetic so if you look at this example it's a packed ad eight what does that mean basically you're adding four 8bit numbers in a packed register register is 32 bits itself uh each element that you're adding is eight bits each and you have two registers as you can see register Z s0 register one S1 and you can see that this operation result is add the bottom eight bits together and put the result at the bottom eight bits of register S2 destination register now do the same thing for the next eight bits for each of the registers and put the results into destination register's next eight bits and keep doing the same thing you can see that this a simd operation right single instruction operates on multiple data elements where the data elements each happen to be eight bits uh and the good news is you can modify a 32-bit ALU uh Adder very easily chop it up into 8bit pieces ignore the carry operations such that they don't actually feed into the next uh let's say 8bit chunk this actually almost zero cost if you will in terms of the uh additional Hardware you add so it's beautiful as you can see right of course if you want to have a larger Vector length if you will in this case the vector length is only four if you will it's not specified because it doesn't exist Vector length doesn't exist because you're you this a single instruction saying this single instruction operates on four dat elements that's it there's no Vector length or anything as we will see in the next slide also uh uh but if you want to actually increase that if you want to actually operate on 64 elements at a time your register needs to become wider so it looks more like a vector register at that point right if you want to incre if you want to operate on 64 64 bit elements at a time well you cannot do it in this case right if your registers are only 32bit wide clearly you cannot do it so it's limited in what you can introduce so this was introduced first by Intel in this uh as this paper describes basically they called it the multimedia extension operations initially now it's named something else as you will see soon but basically a basic idea is simd one instruction operates on multiple data elements simultaneously in this sense it's really very much similar to array processing right this is not a vector processing if you if if somebody asked you the question does this look like a vector processing instruction no it's really array procing because you're really operating on four elements at the same time right uh and as I said is designed with multimedia operations in mind so let's take a look at this example over here here we have a 32 64-bit register uh clearly in a scaler operation you can actually add two 64-bit values by referencing two registers that's a scaler operation but there are instructions added into the instruction set architecture that can treat the 64-bit register as two 32-bit data elements so that you can actually add two 32-bit data elements or as four 16bit data elements that you can actually add four 16 bit elements or eight eight bit elements so that you can actually do eight operations or eight ads per cycle on 8 bit byes uh bytes basically so that's the idea and it's beautiful as you can see there's no Vector length register the up code is different for each of these instructions uh an OP code determines the data type whether you're operating on 8 8bit bytes 4 16bit words two 32bit double words in Intel terminology and 164 bit quad word and if you're doing a load Etc stride is always equal to one these are basically consecutive okay does this make sense hopefully it's simple right and the operations are very similar to Vector instruction so these are the instructions that you can see this is called packed compare greater than word so basically you have four you chop the register into four words and let's assume that this is the source register and these are the values of the four words in the source register and these are the values of the for words in the Second Source register the result of this operation is this let's take a look at the ISA definition Compares packed 8 bytes well in this case I think uh four 16bit words result is mask of ones if true or zeros if false so you basically compare 51 to 73 51 is not greater than 73 so the result is all zeros in these 16 bits three is greater than two so the result is all ones in these 16 bits five is greater than five no that's not true so results are zeros 23 is greater than six so results are all one so now what you're seeing here is you're creating a mask remember the vector mask register we had that as a special register in a simd processor well we have it in a general purpose register over here the destination register specifies the mask and you're we're going to use that as a data element uh to actually decide which operations take effect or Not by programming this in a nice way does that sound good okay so you can see there are many many operations these are actually more advanced uh at the time this is 1990s they basically add the P multiply add word to double words so basically you have a very similar view of the same register four 16bit words another four 16bit words what this instruction does is uh it basically uh does this operation it's multiplying at it multiplies the corresponding 16 bit words a0 time b0 goes over here a * time B1 goes over here A2 * V2 goes over here A3 * E3 goes over here and then it adds the top Parts basically this is the addition and this the addition why this you can actually do a matrix uh Vector multiplication using these operations as we will see in the bottom part so you can see that you can actually do Matrix multiplications and this is uh how they describe how to do it I will not go into this in detail but you need to do uh these operations this called the packed multiply and add uh word to double word unfortunately the instructions are very ugly as you can see but they make sense you basically uh um do this multiplication addition and then you do another packed ad so that you can actually uh reduce the ads so that you can do the Matrix VOR multiply if you're interested you can read the paper for more but it's not magic it's all uh doable with instructions as you can see okay let me give you another example uh this is the image Crossing example that they provide in this nice paper Unfortunately they don't have colorful images in this paper so we tried to add a little bit of color basically these are the these are the types of operations they were targeting essentially you have the image X let's say and then you have a background image Y and there's a blue background in image x uh uh now the idea is to replace that blue background doing some image processing with some Blossom background it looks like a tree over here and then obtain this image so this is standard image processing very very simple operations if you will to by today's standards uh but if you don't have parallel operations it's actually slow so this is the loop that you would write uh to do this you go through every pixel in the image you check if the source pixel is blue if it's blue in the new image you replace it with uh a pixel coming from this second image that's the idea otherwise New Image stays the same as the original image right that way you will have this lady behind a nicer background of trees as opposed to a blue background okay so how do you do this uh with uh U scalar operations basically this becames all scaler it becomes all scaler but with MMX multimedia extensions you can operate on eight pixels at a time essentially assuming each pixel is eight bits and you can see this nice instraction that we saw earlier packed compare equal bite basically you take uh you load eight pixels of this image in mm1 that's the register name assume that all of them are blue over here nice uh and then you uh oh sorry uh you basically this is the this is this is uh mm1 this is basically you're comparing everything whether it's blue or not this is basically uh the thing that you're going to try to mask with and then you load eight pixels of this image into mm3 and some of them are blue and some of them are not blue as you can see this these are coded and what this instruction does it Compares whether the value over here is equal to this value or not so and then uh the result is a bit mask so if these two eight bits are equal you'll get ones over here in this case they're not equal because this is blue and this is not blue it's encoded some way of course uh these are zeros so these two pixels over here are not blue so you get zeros in the bit mask these two pixels happen to be blue so you get all ones in the bit mask and these two pixels are blue you get zeros and these no not blue and you get zeros and these two pixels are blue and then you get ones as you can see okay so how do you uh take advantage of that bit mask now basically remember what we were going to do is we were going to uh take uh the blossom image and replace uh replace the original image uh when original image actually matches uh the blue uh parts so if the bit mask is one in the final image we should actually uh the pixel should come from uh the blossom image if the bit mask is zero in the final image uh the pixel should come from the original image makes sense hopefully right and this is what these instructions essenti do uh basically you take the blossom image uh eight pixels at a time again and then you end it with the mask and then you get a result and this is a partial image that contains the pixels that are Blossom as you can see everything else is zeros and then you take uh the original image and end it with the uh mask uh sorry uh and not with the mask because basically you you need to pass uh uh pass the uh pixels from the original image which which are marked in red over here so uh if the if the uh if the mask is one uh the result is zero if the mask is zero you pass the original image pixels that's why this is n not okay and then you basically or these two things uh and hopefully because of the way we have constructed the program uh in the final image over here or final eight bits eight pixels of the image you have the blossom image in places that were blue in the original image and you have uh the original image in places that were not blue in the original image that's the idea okay so you can do this essentially with this instruction sequence these are loading uh eight pixels from the original image and then the blossom image to two registers and then you do the pack compare equal which sets the mask in mm1 and then you do the packed end to get the uh Blossom image in appropriate locations and then you get the packed and not to get the original image in appropriate locations and you or them into register so that you get you can actually construct the final eight pixels on the image and then of course you need to iterate if the image has million pixels you have to go through all of them eight at a time right but you can parallelize it right now before you were going through one element at a time now you're going through eight elements at a time so hopefully you can speed this up by close to 8X right assuming it's perfectly vectorizable well it is perfectly vectorizable but then there's some serial code that you may need to deal with clearly right does that sound good so it's beautiful as you can see it's not magic but it's very similar to what we have done earlier except we didn't deal with Vector length because Vector length is encoding and quoted each instruction uh and we dealt with vctor mask in a very different way right Vector mask is really a general purpose register over here and we use it as part of the opens to the instruction okay you can read this paper but basic basically uh they say that MX speed UPS computation intensive inter Loops or sub routines on average between two three to five times uh uh and they give some examples for example an EG one video decoding application at the time today we're actually much further uh we have or Beyond EG actually in some Cas in many cases on a Pentium class processor with MX technology executes 1.5 times faster than the same application on the same processor not using the MX technology and then you can see that Intel plans to implement MX technology on future penum Intel architecture processors blah blah and they believe that they it will scale well with CPU operating frequency and future Intel microarchitecture Generations so you can see that these folks actually were prudent to write this paper at the time uh right around the time they were actually releasing it so it's good that people are doing this in fact if you know the backstory at Intel uh this was actually a very controversial thing to put out because they said we're changing the ISA significantly who's going to use these operations and then uh the people who are let's say more Progressive one they said release it and there will be a lot of people because this is actually uh used in many systems and clearly uh they did the right decision at the time so this is always a decision right you put something into Hardware will people use it a lot of companies worry about that if actually if you actually have good evidence that uh there are a lot of applications that may potentially use it then go for it right and today actually people don't use it just for graphics we do a lot of research on bionformatics for example there's a lot of parallelism that you do when you compare different genomes you basically can use all those operations very nicely when you compare different elements text comparisons for example clearly machine learning also so these extensions have improved a lot over time I will not go through this and you don't need to clearly know all of this but initially you had 64 bit MX registers over time Intel changed their names like streaming simd extensions for marketing purposes Etc basically they added uh a lot of new instructions like Shuffle operations and double Precision floating Point operations larger registers so that you can exploit parallelism AVX extensions are also the older version right now now you could operate on 512 bit things now 512 bits is if you're if you're operating on 8 bit values that's 512 divide by8 right that's 64 elements at a time if you're operating on 64 bit value still eight elements at a time not bad right but of course think about 512 bits now now your memory system needs to support all of that as well and more recently uh Intel introduced these Advanced M Matrix extensions like everybody has been introducing right here you have two dimensional registers and a til Matrix multiply unit and you can read about it also basically this supports matrix multiplication and we're going to see that gpus also introduce these specialized units soon even though they're fimd uh simd machines okay any questions we're still talking about extensions not gpus yet but let me give you an idea of simd operations and their popularity uh other machine learning xrayers take advantage of simd operations as well we've talked about cbus as wafer scale engine uh I'm going to very quickly go over a very brief overview just to show you that this actually takes advantage of multiple paradigms so they have this number of cores uh and they map a neural network onto a whole wafer and uh basically uh they have many processors to map things onto so you you cannot see all the processors over here but it's multiple instructions operate on multiple data elements it's a multiprocessor is multiple instruction multiple data machine as we may see later on also again uh but you're operating uh you have basically different uh programs running on different processors but if you look at each processor uh so this is the basically this is the mimd machine it's actually very interesting uh it incorporates some Concepts that are more advanced for this course but for future courses we cover a lot of these basically you connect many many processors using a two-dimensional mesh interconnection fabric such that they actually communicate with each other but each processor looks like this you can see that there's a floating Point multiply and accumulate unit and you guess it's a simd unit basically each of these have some memories and they communicate with each other but each of these processors is a simd processor actually it has four-way simd for 16 bit floating Point operat not very powerful you can do four floating Point operations per cycle and and it has a local uh memory as well and you can do floating Point multiply and accumulate and the basic idea is you map your neural network computations to these floating Point multiply and accumulate operations very similarly how we did with systolic arrays except this is a simd machine right now remember that picture that I showed you when we talked about systolic arrays you can you can you can you can convert a convolution to a matrix multiplication essentially that's what they're doing they're converting convolutions to Matrix multiplications and doing them using these uh uh four four wide Vector operations or simd operations okay if you're interested there's a lot more out there that talks about this but I wanted to point out that simd operations exist in machine learning accelerators specialized for machine learning as well now let's jump into gpus how many people have program gpus here okay I think I asked this question somewhat before not many hopefully you'll program them it's fun they need to be a b better programmable but basically gpus are simd engines underneath what we've discussed apply to gpus I'm going to show you how it's different in gpus uh the instruction pipeline operates like a simd pipeline in in other words like an array processor as we will see however the programming model is completely different it's not done using threads uh well it's done sorry it's not done using simd instructions it's done using threads and we're going to see this in a little bit so let's go back to our parallelizable code example which I really like uh but before going to that let's distinguish between uh different things programming model and software versus the execution model in Hardware so these two things are very different things and we've seen examples of this right you can have a one based scaler Isa programming model sequential code you can execute it data parallel right we converted the one noyman program to a simd program uh uh in the earlier lecture data flow we've discussed this and multi-threaded you break things into threads how many people have done multi-threaded programming here okay I expect more people that's good uh so multi-threading you break the program into tasks basically and parallelize them on multiple processors or multiple threads Hardware contexts now execution model refers to how the underlying Hardware executes the code underneath it doesn't have to do anything with the programming model of course if it matches the programming model that's better right because you can exploit the type of parallelism expressed by the programmer but it doesn't how to so for example you can have out of order execution like we have seen Vector processing array processing data flow processing multipressor multi thread processor Etc an execution model can be very different from the programming model for example one no1 model is implemented by Auto order process today underneath the execution model is not sequential well strictly right underneath it's data flow as we've discussed except because the because you have to obey the contract you have to eventually give things back sequentially to the software as we have discussed right a GPU is somewhat similar it implements some model programming model is called spmd as we will discuss single program multiple data it's a version of multiple instruction multiple data programming but it's implements GPU actually executes this code on a simd processor single instruction multiple data now this may sound like magic but basically GPU takes advantage of the fact that there are many threads executing at the same in same instruction and it gathers them together somehow because the programming model first expresses it nicely as we will see and even if the programming model doesn't express it GPU can actually gather many threads together that are executing the same instruction and put them into the pipeline okay so let's take a look at these examples this is our code very simple uh vectorizable and this is the scaler sequential code this is how you might program it right I don't show the instructions but you can imagine what the instructions are and we don't have the branches clearly over here uh so let's examine three programming options to exploit instruction level parallelism in the sequential code sequential single instruction single data data parallel single instruction multiple data and multi-threaded multiple instruction multiple data or spmd as we will see now sequential looks essentially like this there's a branch also Etc this can be executed hopefully you can write the code sequential code myips code for example for this this can be executed in a pipeline processor right it exploits some instructional parallelism we've seen that this can be executed an out of order execution processor what out of order execution processor does is it can it can find out that these two loads are independent right it can execute them together it can also find out that this ad is independent of anything else over here it can execute that so outut of word execution by looking down in the instruction window it can find that all the independent operations theoretically basically independent instructions are executed when they're ready and the big benefit of a word execution in this piece of code this kind of code is different iterations that are completely independent of each other we know that aut word execution does a lot of work to find that out out it doesn't even find that out right it just finds out which instruction can be executed when that's it but the end effect is that it can take all of these different iterations in the instruction window and can execute them in parallel in multiple functional units so it's really parallelizing it's really taking advantage of the fact that these different iterations are parallel and also even within an iteration there's parallelism between instructions so in other words the loop is dynamically unrolled by Hardware if you have heard of the term Loop unrolling you basically unroll the loops uh you don't need to know that but essentially dynamically the hardware is unrolling the loops iteration by iteration okay a supercal or VW pressor which we have seen also goes to the next step right about it's orthogonal to out of order actually both of these concepts are orthogonal to out of order uh but basically you can fetch an instruction multiple instructions per cycle same iteration different iteration doesn't matter you could do that okay so I can using the sequential scalar sequential code we've seen a lot of paradigms to accelerate it BW of course does the compile time analysis to figure out which ones can be executed parall and packs them together and then executes different operations in parallel in a given cycle so that's good and we've seen all of these paradigms let's take a look at simd we've also seen simd uh we know that these sequential uh these iterations are independent of each other uh so we can actually vectorize them or create simd instructions so you put a bunch of iterations limited by the vector length in the end uh either the programmer or compiler generates a simd instruction to execute the same instruction from all iterations across different data basically uh this is a simd instruction right you realize instructions iterations are independent these loads can be executed in parallel and only a single load instruction can be ex can can express a vector instruction so vectorized code looks like this this is a vector load a goes to V1 of course you need to have a vector length specified right but we are assuming that n is large so you can actually have a huge Vector length if you will and then the next load is another Vector load B goes to V2 Vector register to the next instruction V instruction is a vector add you add V1 and V2 the result goes to V3 the next instraction a vector store V3 goes to C it's beautiful what we have just done is automatic code vectorization basically the compiler needs to do the hard work to actually make it happen but the comp I can do it in this easy piece of code because you can see that there is no dependence across itations right this a very easy to analyze code so this is best executed by a simd processor a vector and array and we've seen that we've seen that you actually have only four instructions over here beautiful now the same code you can execute it as multi- threads basically uh each itation is independent again we're going to take advantage of that and we're going to say different iterations are executed by different threads right programmer or compiler generates a thread to execute each each iteration each thread does the same thing in this particular case right but on different data but sounds good you can execute this in a multi procer also right you can basically say each each eration goes to a different core and all the core is execute and of course you need to have some in the end something to make sure that everything is done uh so this can be executed in a multier nicely and if you have a million iterations a million of them can go to a million different cores sounds good and of course can be very simple potentially right that's the beauty of potential multi processors now we're going to also realize that there's a special aspect of this uh particular model here each iteration does exactly the same thing meaning each thread does exactly the same thing on different data elements exactly the same thing meaning that they have exactly the same code this doesn't mean that they execute exactly the same operations clearly but they they execute the exact same code they may be branches so they may actually change what they execute we will see that that will complicate things but because each thread has the same code this is also called a single program multiple data uh Paradigm you have a single program you replicate it and each of them operates on multiple different data elements and it can be executed on a simd machine and that's what a GPU is basically a GPU is programmed using the single program multiple data Paradigm and a GPU acts like a simd machine whenever it realizes that these threads are actually executing the same load it combines them okay so this is also called single instruction multiple thread in Nvidia terminology not a bad terminology I think basically single instruction comes from multiple threads uh at a given time so basically jpu is a simd or simt machine except it's not programmed using simd instructions like we did yesterday it's programmed using threads that everyone is familiar with simd is a bit harder if you will uh so that's but the threads need to obey this spmd programming model each thread executes the same code but operates on a different piece of data now this looks very nice for graphics right remember the image we're doing the same operation same set of operations on many many elements of data right that's why this fits nicely each thread has its own context it can be treated restarted executed independently that's a benefit over here it's more flexible than a simd machine in a simd machine it's very hard to change the vector length for example easily so a set of threads executing the same instruction are dynamically grouped into what is called a warp uh this is NVIDIA terminology way front is the AMD terminology I'm going to describe this in a little bit but basically the hardware groups these threads that are executing the same instruction meaning that're at the same program counter of course initially this is easy if your programming model expresses this right and it does uh but you can do more so a warp is essentially a Syd operation formed by Hardware as we will see so let's take a look at this so let me Define Warp warp is a set of threads that execute the same instruction IE at the same program counter so this is warp zero at PCX assuming the program counter of this load is X right because these different threads are executing the same uh uh code they're both at program counter X they're they're fetching the code from the same location in memory based and this is warp zero that goes to the next instruction PCX plus one this is warp Z that goes the next instruction PCX plus2 and warp Z PCX plus 3 okay so basically you can treat these uh different threads that are executing the same operation as a set of warps that execute the same operation okay we're going to see how that is executed soon so basically Graphics processing unit are simd that are not exposed to the programmer if you will threads are exposed to the programmer but underlying we do something else so let me uh discuss a little bit more what is simd simd what we have done yes uh last week we have a sequential instruction stream actually we did it a few minutes ago also we have a sequential instruction stream of simd instructions Vector load Vector load Vector ad Vector store right we've seen that earlier each instruction specifies multiple data inputs you need these instructions and you need a vector length I'm ignoring the vector mask for now we're going to treat that interestingly soon whereas simp is multiple instruction streams multiple threads in other words of scaler instruction so what you do is really sequential scal programming you don't need to think about a vector length you just say this is the code I want to execute on a single element and if my if I happen to execute this code on a million elements that's great I'll specify the size of that block and execute this code on that M elements I don't need to deal with elector length I don't need to deal with meor mask GPU will do that automatically okay so basically this is what we do these are scalar operations I don't care about vectors load load ad store just like I program a sequential Oney machine and I say number of threads well we will see number of blocks actually so there's a little bit more over here but it's the same thing uh so what are the two what are the advantages of this so we just changed some things but this brings about huge advantages and this is why gpus have been hugely successful whereas Vector traditional vector and simd process or simd process have not been that successful again it's all about the programmer what you expose to the programmer programmer can program in one nman let's say they're not that educated to program in something much more sophisticated they're not that used to in other words well this is one no I'm going to extend just a little bit right that's the idea Hardware does a harder task so okay so the two major advantages is first you can treat each thread separately in other words you can execute each thread independently on any type of scalar pipeline you can do actually multiple instruction multiple data process you can have a multiprocessor and gpus exploit that also and you can also group threads into warps flexibly meaning you can group threads that are supposed to truly execute the same instruction you can dynamically obtain and maximize benefits of simd processing this is actually very very interesting because if you find out that okay I have uh I have 64 threads lying out over here uh they're executing I have 20 let's say 20 threats that are lying out here you need to keep track of the softcore they're executing program cont X and they're supposed to execute program cont X and I have another 40 uh let's say 44 threats over here you can combine them and you can have a 64 thread execution to maximize SMD parallels and we're going to see that soon so let's start with the first one well what is how can we treat each thread separately uh or what benefit do we get from this you remember the notion of f grain multi-threading right we're going to go back to this we're going to basically multi do fine grain multi-threading of these threads on a very simple pipeline scal pipeline so assume that we have a warp that consists of 32 threads because somebody actually specifies uh that right if you have 32k iteration because this is the hardware limit there needs to be Hardware limit inside this not exposed to the programming GPS actually GPU automatically breaks down a block into warps but assume that a warp this is what the GPU can execute in parallel that's the vector length that's not visible to the programmer consist of 32 threads assume that we have 32,000 iterations over here in this uh code and we do one iteration per threat that means we have 1,000 warps right these 1,000 PPS each execute 32 threads and each of them are at the same program counter at a given point so these warps can be interleaved on the same pipeline so you can actually do fine grain multi- threading of warps I'm going to show you an example so this is warp zero at PCX for example that's the PC of the program counter the load this is iteration zero executed by one unit over here iteration one executed by another unit so basically you execute uh 32 iterations concurrently 32 of these loads uh in different iterations now then you may fetch warp one at PCX this actually executes iteration 32 33 dot dot dot up to 63 fine uh I'm going to show you the fine gr multi trading of these soon but if someone tells you for example you have warp 20 at PCX plus2 you can actually calculate which iteration it is executing right in the scaler code basically we've divided 32k iterations we have 1,000 warps each warp consists of 32 threads warp 20 should be at iteration 20 time 32 right and then uh the uh well War 20 should be executing the ad because we set PC at x plus2 uh in iterations 20 * 32 two oh sorry 20 * 32 + 31 right so all of those 32 iterations will be executed uh by this uh warp that's executing this app and then this warp will move to the next instruction which is store because each warp is executing the iteration okay so let's take a look at the fine grain multi- threading it's the same example over here all threads in a warp are independent of each other so they can be executed seamlessly in A fine grain multi-threaded pipeline let me give you the example so this is warp zero uh so this is a field pipeline initially you fetch warp zero warp zero executes program counter X uh meaning the first 31 iterations and just a slot and then next cycle now not next cycle actually once it's done it gets refreshed and it it executes uh program counter X+ one because it it incremented the PC right to the next instruction uh for the next uh uh for for the same 32 iterations so warp zero is always executing iterations 0 through 31 but it gets into the pipeline once in a while uh so that it can execute different instructions does that make sense so uh this warp zero in the steady state if you will you can have warp zero executing these stores at PC X plus 3 across iteration 0 through 31 in the add unit you may actually have warp one at PC X plus2 executing iterations 32 through 63 in the load unit because you may have multiple function units right you may have ex warp 2 executing uh PCX plus1 which is load over here but iteration 64 through 95 and then in the other load unit which you may have again or pipeline load unit you may have Warp 3 executing uh iterations 96 through 127 so you can see that different set of iterations are getting executed on different functional units I don't even show the decode and fetch stages but you can have you can have something warp four being decoded for example and then warp five being fetched right because these oper these are these are doing exactly the same things they follow through the pipeline nicely and they're completely independent of each other so hopefully this makes sense so this this resembles A fine grain multi-ad pipeline we had right of course it was scalar pipeline before but you just extended to a vector pipeline in this particular case remember each pipeline stage has an instruction from a different completely independent thread we've done that right this is the scalar pipeline each of them well assume that this is kind of pipeline multiple function units are an example of pipeline clearly uh but basically you you essentially have different uh instructions from different threads and then you happen to have multiple functional units many of them executed concurrently in a simd manner that's the idea okay so hopefully that's clear we're combining multiple Concepts here and that combination actually makes gpus even more successful because you can have a very simple scaler Pipeline and don't do any dependence checking for branches or any other instruction because you know that the threads are independent of each other so you take advantage of fine grain multi-threading and you have lots of threads okay and you remember the lecture on fine grain multi thread uh so basically to put things together again uh warp is a set of threads that execute the same instruction on different date elements this is the Sim single instruction multiple thread model all threads essentially run the same code why is it called a warp because in English warp happens to mean this is uh really actually when you're actually uh doing manufacturing of a fabric like uh viewing this short for example these are the threads that run lengthwise in a fabric so they basically decided that these are the threads and the each of them execute the same thing basically when you're when you're viewing this all of the threads are doing the same thing right or you're doing the same thing while you're viewing this that's the idea basically that's the analogy that's why they call the warp so pictorially this may look like this you have a thread warp and you have a common program counter and a thread a warp consists of a hardware limited number of threads that's a function of the hardware that you build each of them is scalar and then you may actually buffer many of them you may have thousands of them buffered and then you feed them into a simd pipeline internally in modern gpus you actually tens of thousands of these threads buffered if you will so which one do you choose to feed into the pipeline well you make sure you need to make sure that the same thread doesn't go into the pipeline until it finishes one instruction in the pipeline basically no two instructions from the same thread should be in the pipeline that's the basic principle of fine grain multi- threading right that ensures that our pipeline stays simple we don't do Branch prediction dependence checking modern gpus don't do any of that no Branch prediction no depend checking uh assuming that they uh the code is regular and nice if they want to actually execute more sophisticated code they may need to do some of that going into the Future Okay so this is the high level view of a GPU today uh it consists of a bunch of what's called Shader cores I don't know if people really use this terminology anymore but you may hear this terminology uh because of the old Graphics Graphics does shading for example to actually uh do manipulation of images that you display uh on a graphics uh unit uh and then there's a huge beefy memory Subs system which we're going to talk about next uh well next week well starting tomorrow but if you look at the Shader core it's a very simple simd processor in the sense that uh you have a bunch of threads that are over here warps and then you fetch the warps and warps are including PC and mask we're going to talk about the mask soon you fetch them you decode them and then you a bunch of scalar pipelines that operate in parallel in a simd manner so if you fetched uh a warp uh in this case let's say uh you can see four pipelines that means that the warp sizes Hardware warp size is four uh you can actually execute four of these same operations in different pipelines and that uh that warp goes to the pipeline and eventually gets out goes back to this pool of warps that you can schedule so what a GPU does today is it schedule these warps uh into the pipeline and the pipeline remains very simple uh let me finish this and then we're going to take a break so what is the benefit of this warp level fine grain multi- threading we've already discussed warps but basically you have one instruction per thread in the pipeline at a time so the pipeline is very simple no interlocking this gives you a little bit more detail so you have these warps available for scheduling you have a simd pipeline uh and then there could be we don't we didn't talk about memory hierarchy that much but uh in fine grain multi-threading memory hierarchy is handled nicely also if you get a cache miss you get out of the way you basically have another buffer over here where you wait the warp weights essentially so here you have warps available for scheduling here you have warps that are waiting for data to come back from memory okay and the rest is the pipeline that we have seen earlier so what we do is we inter leave warp execution cycle by cycle every cycle we fetch from a different warp such that we have only one instance of the warp present in the pipeline at a given point in time uh and that hides the latencies so you may have millions of warps you may have an I don't know 10 cycle uh 10 10 cycle pipeline 10 stage Pipeline and you keep the pipeline busy all the time let's say you can actually make the pipeline even deeper right as we have seen with Vector processing we can have very deep pipelines so register values of all threads stay in the register file we're going to talk about that it's actually partitioned like we have seen in Vector processing and fine gain multi-threading enables to things one is simple pipeline no dependency checking as we have discussed uh but of course you need to figure out whether you need to wait for memory and long latency tolerance meaning you can have many many warps while one warp is waiting another warp is executing right hopefully not all warps are waiting and this is why this actually a great fit for graphics you have millions of threads operating on the same large image and video doing the same operations okay recall uh uh basically this is what happens in a GPU this this was a simd instruction right we had this Vector at a BC this is a nice way of looking at a vector procor over here and combined vector and aray procor we've seen this what a GPU does is you have a 32 thread warp executing this code add a TI b b TI C ti ti is thread ID so we've converted this simd instruction to thread operations uh each element uh each uh um each thread ex ex operates on the elements that are allocated to it if you will on a vector and the execution is essentially the same if you look at that you're really uh multiplexing different threads that's the idea that's the beauty of it there's no difference if you look at the underlying machine this is simd again I'm showing you it's programmed with using this add instruction here we're programming with things with threads but you can see that the thread ID immediately specifies which uh elements we're accessing okay okay so go back to the vector unit structure that we have seen this is the vector procor recall we've seen we partitioned the vector registers we created Lanes we said elements 048 are over here elements 1 59 are over here blah blah okay sounds good we have the same structure in a g this is a GPU execution unit structure essentially we have register for each thread so thread IDs 048 dot dot dot have their registers over here the zero thread ID 159 have their registers over here dot dot dot basically so we partition the register file across threads essentially not that much different and then we call the vector instruction level parallelism this picture is going to be very similar instead of having these instructions Vector load Vector multiply Vector add we essentially have warps warp zero warp one warp two and then warp three warp five warp five and then you eventually get back to warp zero again doing some other operation right hopefully that makes sense so basically uh what I've shown you hopefully at this point is that a GPU is a simd machine it's a combination of array and Vector process except we don't program it that way okay so this is a great place to stop and take a break so I think uh we've got multiple questions I'm going to tackle one uh are the processing elements pipelined in Vector processors and yes the answer is yes to that uh so if you look at this picture uh this is warp execution and the processing units can be deeply pipelined just like vector processors are uh in GPS as well okay uh so hopefully everything is clear okay let's talk about memory access uh very quickly uh uh even though we kind of uh talked about over here right this code over here if you're doing ads like this you need to load uh the instructions uh first uh uh to registers uh so this kind of is an abstract way of looking at the addition but you all you first need to do a load and whenever you do a load you need to load a particular elements in a vector or uh each thread loads particular elements to particular locations in the register file so if you look at thread zero it it it loads elements from iterations zero over here right if you look at thread uh one it loads elements over here so basically where you're loading uh depends on which thread you are if you will so let's take a look at this so same instruction in different threads uses the thread ID to index and access different data elements that's the key so if you assume a thread count of let's say uh 16 and you four threads per warp uh what you're really operating on is these 16 threads over here and each of them is essentially operating on 16 data elements and each warp essentially looks like this the first four threads are loading the data elements at location 012 3 the next four threads are located uh loading the date elements uh at locations four 5 six 7 and then you can keep doing this basically what we're doing is basically we're part partitioning while we're partitioning the program iterations across different threats we're essentially partitioning the array that the iterations are operating on across different uh threads as well and each threads are grouped in a warp so this doesn't change uh what data each thread is accessing clearly hopefully this makes sense now this should also uh you should also start thinking clearly we have four threats per warp and each warp uh is executing the same name operations uh and if you have a partitioned register file this thread can execute in a in a lane that holds the register file uh elements 0o 4 8 12 Etc right so basically when you when you want to operate on elements 0 4 8 12 it doesn't nicely fit over here unfortunately you basically need to go over here right well it it's nice if it's over here actually sorry this those threads are operating on these elements over here so when you want to operate on particular data elements you you'd better schedule the threads to those locations this limits which threads you can combine in a given warp clearly as we will see soon also okay but I'm getting ahead of myself basically when you're programming you need to make sure that you you partition the data to different threats but of course as we've seen uh yesterday I'm not going to talk about this more to today except for some examples later on from real systems if you want to get the maximum performance memory should really provide enough bandwidth so if you for example want to execute all of these four warps in parallel you should really be getting 16 different data elements per cycle from your memory and we're showing only one element for example if you're adding actually two vectors together then you need two data elements uh for uh each thread right two loads for example so you really need a significant amount of memory bandwidth to ma to satisfy uh the computation units or to supply the computation units with in other words the elements per cycle throughput of your memi should match the computation unit throughput if you have I don't know 32,000 ad units uh or 32,000 load units in your GPU how what fraction of them uh can you actually Supply instruct Supply data with it's good to think about that this is why with the memory system is a huge bottleneck in these processors because your computation bandwith is huge you can actually do a lot but can you actually Supply the data uh to those computation it becomes a problem that's why we're going to talk we're going to dedicate the remaining six lectures to the memory system I should say that in existing gpus warps are not exposed to programmers I mentioned this but let's let's clarify this uh so you have CPU threads in gpus is also called kernels for example these are sequential or model parallel sections on CPU but you have massively parallel sections on GPU these are called blocks of threads so when you program actually you have some serial code uh so usually it makes sense to execute the serial code on the CPU because the GPU is kind of a waste uh to execute them in a later slide we're going to show you some example where you can calculate simd utilization if you will we're not going to calculate it I'm going to talk about that so for example if you have a Serial thre serial code over here one thread EX executing uh on one data element if you will uh and if you have a warp that's capable of executing 64 threats in Hardware your utilization if you execute the thread on a GPU is one over 64 right one over 64 is a very small number your Hardware is capable of executing 64 uh threads concurrently but now you're executing only one so it makes sense to actually execute the serial code on the CPU host and then the serial code may be doing some things it could be some very uh dependent computation or it could be launching of these massively parallel sections so this simple code may be preparing some massively parallel sections to be launched on a CPU GPU and this is usually how you launch it in modern gpus but there are different gpus also so I will not go into a lot of details in programming but for example you launch a kernel uh in GPU you need to specify the number of blocks you have and number of threads you have in each block and then you have some arguments and then these get executed and then there's some serial code that gets returned to the CPU and the serial code executes and then launch launches another current this looks like the GPU execution this very similar to the mdal law picture that I showed you earlier right so if you're actually doing this sort of computation hopefully these parallel portions of your program are the dominant part of your execution time otherwise you're going to be limited by how fast you can execute the serial code and that's amda law and don't never forget that right if the picture looks like this you're in trouble if this actually shows a timeline that serial code is taking too much time right as you can see so it doesn't matter how big of a GPU you have what you're doing is you're waiting for the GPU to return data so that's why when you actually offload code to GP you try to parallelize a serial code as much as possible with a parallel Kern you try to overlap the serial code with kernel execution on the GPU overlap computation and communication for example you do a lot of parall programming techniques to minimize the impact of the serial execution essenti which I'm not going to go into but I'll refer you to some uh lectures if you're interested in learning about GP programming it's a good idea but you can see it so let's take a look at uh sample GPU code and how it looks like in a let's say generalized GPU kind of inspired by Nvidia if you will it's very similar to CPU code and that's what has made it very successful so this is CPU code similar to what I showed you over here right this is you're adding two different vectors and putting the results in another Vector basically you have 100,000 threads over here and you just need to specify uh a thread ID that's computed using blocks uh and some thread ID x uh which I'm not going to go into but basically you you specify a thread ID and that thread ID gets communicated to the GPU but if you look at this code and this code there isn't much difference over here you just add a thread ID over here as opposed to and uh an induction variable I I the reason is this is programmed using threads each thread executes an iteration there's no scaler code that has multiple iterations if you will okay so this is a little bit more sophisticated uh but again uh the the the whole point I'm showing you this is not because you should understand what's going on exactly over here but to show you how similar these two programs are to each other right this is a CPU program you're uh doing Matrix addition you're adding two matrices uh essentially this is the scaler program to Loops uh you can see that uh and this is the GPU program in gpus you basically care about the thread ID over here so these are the indices over here and then uh you call this uh from this main code a ad Matrix you basically need to specify the block the number of blocks you have and the number of threads you have okay again I I don't expect you to learn about GPU programming the whole purpose over here is really to show that these codes are very similar whereas if you look at CPU and simd they're very different from each other you don't program it this way okay so if you're interested in GPU programming you can watch a lot of lectures in fact we have a heterogeneous systems course that we hold every semester that goes into a whole lot of detail about how do you program a GPU how do you make it work for different algorithms how do you get high performance very very interesting stuff not the subject of this course okay so let's talk about I said these blocks so how do you get from blocks to warps if you look at a GPU core it's a simd pipeline essentially in a modern GPU it's called a streaming processor again you may hear these terms they're fancy terms but essentially a streaming processor SMD Pipeline and a modern GPU has many such processors it's called a streaming multiprocessor I'm going to show you break down a later uh new GP old and new GPU soon but it looks like this so this this is uh one of nvidia's architectures if you will uh you can see that they call a streaming multiprocessor there's an instruction cach there's a warp sched there are multiple warp schedulers there's a huge register file and these are the streaming processors when you see these These are simd units essentially simd pipelines uh simd pipeline means you actually have a bunch of scalar pipelines Opera in a simd manner like we have shown earlier and then you can see that there are lots of load store units and some special function units and then there's a cach hierarchy so you can see the parallelism over here I'm going to show you more numbers later on but basically when the programmer specifies the blocks uh these blocks are divided into warps and sent to simd simt unit so each simd simt unit the width of this specifies uh the warps so you may actually have 32 threads in a warp for example and then you may actually have uh different blocks specified by the programmer so blocks are specified by the programing these are blocks of code that do essentially uh the same thing using different threats and these blocks are divided into a bunch of warps block zero has some warps block one has some warps block two has some warps Etc and they get basically scheduled onto uh streaming multiprocessors if you will again I don't expect you to know what exactly blocks are but you should really know what how warps are executed there's also block scheduling that we're not going to cover over here okay so uh without going into further detail and complicated things let's talk about the advanc and disadvan warp based simd versus traditional simd again and then I'm going to show you a particular advantage that we cannot get with traditional simd another particular Advantage so traditional simd really contains a single threat right this is really sequential instruction execution in the end you have lock step operations and simd instruction we've discussed this uh programming model is simd there no ex so what does this mean we've discussed I said basically you have a simd operation you have a vector load for example when when you do the vector load you need to finish that Vector load you need to finish every single element of that Vector load before you can move to the next instruction yes you can pipeline instruction yes you can do data forwarding in in fact you can actually out of order execute the instruction you can combine the uh concept of simd with out of order execution it's going to be a quite heavy extensive Hardware costly processor but you could do that okay but traditional simd doesn't do that let's say programming model is simd there are no extra threads soft Ware needs to know the vector length that's what we've discussed earlier instruction set architecture needs to change significantly you need to have vector and SD instructions on the other hand warp basedd alleviates a lot of these issues instead of having a single thread you have multiple scalar threads executing in a SD manner same instruction executed by all threads it doesn't have to be lock step each thread can be treated individually in fact you can take one thread and place in a different warp this is not visible to the programmer programming model is not simd basically simd is very strict I'm doing a vector load on these 64 elements done you cannot move you cannot you cannot operate individually on each element unless you do masing right here you can actually move the threads uh as we will see soon so software doesn't need to know the vector length and this enables multi-threading and flexible Dynamic grouping of threads and also Isa is scalar simd operations can be formed dynamically you don't need to have special instructions for vectors essentially essentially as we've discussed earlier this is spmd programming model implement on simd Hardware so let's go back to spmd again this is single procedure or single program multiple data this really a programming model rather than computer organization as we've discussed each processing element executes the same procedure except on different date elements and procedures can synchronize at certain points in the program in barriers for example which we're not going to go into also but you've seen it when we talked about data flow essentially multiple instruction streams execute the same program uh so what's the difference uh between these instruction streams each program or procedure works on different data and can execute a different control flow path at R time so this is interesting now I've given you a very simple examples but if you actually have a thread branching on a condition it tests a data element and based on the data element takes one path another thread in the same warp may take another path right because you're testing different data elements you're testing for example if the data element is greater than zero one thread May uh test that to be true another thread May test that to be false now these diverge meaning they need to they're no longer at the same PC now you have a problem right you group these Inu group these things together they're operating on the same PC but when they execute a branch instruction they diverge so what do you need to do you need to mask some of them out meaning you may treat them actually as uh still executing the same PC but some instructions some threads should not be executing that operation based on something very similar to a vector mask like we've seen right that's the idea basically but you can also do something even more sophisticated as we will see so clearly when you mask something out you have some warp let's say 64 uh threads that can be operated uh that can be executed concurrently if one of them is supposed to execute the instruction and the remaining 63 is not supposed to execute the instruction because that's how the data values are you say if uh uh array a thread ID is greater than zero and only one of them evaluates to two you're basically wasting those huge execution units right your only one of the threads will do something useful all of the 63 will be something useless meaning your simd utilization is 1 over 63 okay we're going to calculate that again so okay I kind of got ahead of myself but this is really important about control flow we're going to see this pictorially soon so uh but to pull back many scientist applications are actually programmed this way and they run on mimd Hardware uh uh this U this programming models actually spmd is quite popular even before uh graphics people programmed large scale Vector instructions this way and if they don't have a vector large scale uh let's say Vector operations this way and if they don't have a vector processor they cannot afford it they run on multiprocessors using mimd Hardware so modern gpus enabled these programs to actually run on simd Hardware to be more efficient and what's the difference if you execute the same threat across many many different processors you have overhead what is the overhead you fetch the same instruction decode it execute it Fetch and decode you basically do the same Fetch and decode operation across many processs in a simd processor you don't do that you Fetch and decode only one instruction and enable many operations that's the difference between mimd and simd Hardware if you execute this programming model on mimd Multi processors it's much more expensive in terms of energy for example and also potentially performance whereas in simd Hardware it's cheaper because you save the area as well as energy okay I've shown you this example before right so I will not go into this again but let's talk about this second part you can group threads into warps flexibly in a GPU in other words you can group threads that are supposed to truly execute the same instruction dynamically this way you can dynamically obtain and maximize the benefits of simd PR so let's take a look at this now more complicated example each thread can have conditional control flow instructions as I foreshadowed threads can execute different control flow path so this is our complicated code each thread is has this code so it's the programming model is still spmd each thread is doing the same thing except on different data elements so thread one may happen to take this path right in this code thread two may happen to take this path thread three may happen to take some other path and thread four may happen to take a complete different path you group them together initially they're all executing a initially they have a common PC so how do they execute over here let's take a look so essentially uh there's a problem because thread execute different control FL pads but some of them are not supposed to execute those instructions that they're at right so in a simd processor we handle it through masking if you remember the masking we're going to handle it in a similar way over here except it's going to be purely Hardware based so we already discussed this right the GP us a simd pipeline to say area on control logic you group scalar threads into warps uh this is a more simplified code if you will so you this is this problem is called the branch Divergence problem when threads inside warps Branch to different execution paths you have Divergence inside the same warp across different threads so let's take a look at this let's assume that we have eight threads in a warp I hope this is eight initially they're at the same program counter that's good they all are supposed to execute this right and then they all are supposed to execute the branch instruction that's good once you go to the branch some of them execute this path now what happens to threads over here they're masked out so when you execute this Branch instruction you also create a mask saying that oh the threads that are going to this PC the ones that are actually executing the ones that are supposed to execute this program counter have a mass bit of one essentially and this is created by the hardware stored together with the warp essentially this is a different warp right now you actually have the same PC but also a mask what identifies a warp is not just a program country anymore it's also a mask associated with it okay so that's one and then the path B has another program counter another mask over here right and then eventually there's some logic in GPU that figures out that these merge and then they actually form a single warp at that point and then the PC is the same and then the mask is all ones clearly right okay so I'm not going to go into the details of how they emerge but there's some Hardware that does this nice and more actually there's some Hardware that does more so this is basically the uh what we've seen earlier right this is actually conditional execution or predicated execution or MK execution we've done something very very similar in Vector process except it was not flexible the programmer needed to handle this using Vector Mass here the programmer doesn't need to handle it Hardware is handling it yeah Vector mask and M Vector equations okay let's let's talk about what else can we do over here now take a look at this uh picture this picture is showing a small number of threads eight threats now imagine that you actually have millions of threats yeah millions of warps over here what a GPU can do is a GPU can examine which ones are at which program counter and merge the different threads to populate the warps as much as possible where the MS are all one right that's the idea and existing gpus do that so if you have many threads you can find individual threads that are at the same PC you need some Hardware support to do that and group them together into a single warp dynamically this reduces Divergence in other words it improves simd utilization so let's go back to the simd utilization so if you look over here in this cycle the simd utilization is 100% all threads are executing this operation here it's 100% here it's 50% immediately here it's 50% also here it's 100% so we want to actually have 100% all the time right ideally if you have many threads you can get closer to 100% by grouping threads together okay simd utilization is essentially the fraction of simd lanes executing a useful operation executing an active thread an active thread is a thread whose mask bit is one okay who really supposed to execute that program counter so this is the idea of dynamic warp formation or merging the idea is to dynamically merge threads executing the same instruction in other words at the same PC for example after bench Divergence but there could be other applications essentially you form new warps from warps that are waiting this is why you have a lot of warps that are waiting and you can examine them and enough threads branching to each path enables the creation of full new warps full new warps meaning a full warp is essentially a warp with 100% simd utilization all of them are active so let's take a look so these warp X and warp y are waiting they're not in the pipeline and you have these masks uh these are the masks basically uh these threads are supposed to execute an War wi these threads are ex supposed to execute this instruction in warpex and the GPU figures out that oh these can be combined like this and if I can combine into one warp so I execute only this warp and it's nice because the simd utilization of this is quite High higher than both of them okay that's the idea so similarly you can combine these two unfortunately here you cannot eliminate warp why because you can see that these two threads in these two different warps happen to be have to be in the same Lane remember you partition the register file you cannot move uh you cannot move this over here you cannot move this over here also because these are the lanes so so the the location the location of a thread in the warp determines where you go in the vector Lanes in the simd lanes as a result if you move the over here you you're not going to access the reg correct register hopefully that makes sense right I'm going to ask you a question later on that basically have I have answered right now okay so but if you have more warps you can do that so now you can actually combine three warps into two so if you have million warps you actually can be in luck you can combine many of them together that's the beauty of this sort of execution model uh so this is the idea of dynamic warp formation uh let's take a look at this example you get to this branch that we saw earlier you get to this path uh but then if you have another warp that warp is supposed to execute this path so these two warps are executing uh the same path and you can combine them right like this in this case you didn't say much but let's take a look at a more complicated example relatively quickly since we don't have a lot of time uh but basically this is an example that I will actually let you study on your own we have two warps warp X warp y each of them has four threads and these are the masks you can see and in the Baseline they execute like this so we're inter leaving the two warps together as you can see and you can see that the simd utilization drops down where the masks are 001 for example right so you can do this on your own it's beautiful and then eventually Sim utilization goes back up because masks become one one one one so basically you have Divergence and then you converge at the end now this is the Baseline if you don't actually try to actively combine the warps together this is the amount of time you will take to execute this but now you can create a new warp from scalar threads of both warpex and warp y executing at some basic block so in this basic block D for example it will look like this assuming you can combine them in time so in Dynamic warp formation what you can do is you combine the ones that are at B you need to do it before feeding them into the pipeline of course right because once you execute you figure out the warps you figure out what next instruction will come and what is your mask and then you can actually do this before you feed the warp into the pipeline again H you can see that b is combined nicely but unfortunately one of them cannot be combined because they conflict uh as a result this is a poor warp that's with low utilization but these two SE can be combined you can see that instead of having two warps with 25 utiliz per utilization you have one warp with 50% utilization these two D's can be combined as you can see and then es can be combined but you don't save one warp FS can be combined you save one warp and then the rest is 100% anyway so basically you can see that you save cycles first of all you save time to execute the same operations on this number of threads but you also save efficiency you're not wasting some functional units okay so as I said you cannot group this thing arbitrarily if two threads are accessing the same register file you cannot put them together in the same War so flexibility is limited basically can you move any thread flexibility to any Lane the answer is no you can move threads uh to a lane only if there's no other thread in that particular Lane in a given warp that's the idea so there are Hardware constraints okay analyzing gpus is fun it's so much fun that you can do it in your exam and probably you're going to do it in your exam in August sometime I will not go through this because I'll let you have the fun uh basically we almost always have this sort of uh question once you understand how this uh uh idea of Divergence is working which instructions are being executed which instructions are not being executed in a threat this is relatively easy uh to solve okay this is just foreshadowing let me see how much time we have okay we don't have a lot of time so I'll go through this relatively quickly you're not responsible for this but uh we've actually been working on gpus a long time ago when GPU micro architecture was really really important to look at so Branch Divergence is very interesting but there are other issues long latency operations are actually a problem in gpus also so even though you may have so many threats you need to schedule them appropriately that's why that Schuler warp Schuler is really really important in a GPU so this is a work that we did with Nvidia and the idea is used in a lot of Nvidia gpus today I believe all of them but the issue is you may have all warps once you get the long latency instruction that's actually foreshadows the next lectures also once you get a long latency instruction these all need to wait and if all warps arrive at the same long latency instruction at the same time your your whole machine Waits and this is a problem and the model that we' discussed is single program multiple data model is actually very amable to this problem because everybody executes and eventually everybody gets to the same load and they all wait so if you do the scheduling this way such that all warps come to the same instruction at the same time this is not good you to wait a long time so the one of the ideas in this work is to do the scheduling at a smaller Grand LT essentially schedule some warps and then schedule the other warps such that they reach the long latency instruction in a distributed way as opposed to at the same time that's the idea again I don't want to go into the details of this there are many many interesting ideas here but this saves a lot of Cycles so internally the GPU does a lot of interesting things basically that's my whole point internally GPU does scheduling of warps at a smaller grity so this is I'm going to skip over here if you're interested you can look at it but this shows that by having a large warp and breaking into subw warps you can actually find a lot more parallelism internally such that you can keep these subw warps filled such that you increase simd utilization again I don't have time to talk about it because I want to talk about some real gpus relatively quickly so this is just to foreshadow that there's a lot of interesting uh things going on in GPU uh systems now let me give you an example GPU at a high level uh very quickly uh since you have a version of something like this in your uh cell phones uh well this is an old one so you you actually have something much more powerful over here so uh this is an example GPU over time they become beefier and beefier so this GeForce GTX 285 had 240 stream processor I've already defined stream processors essentially 30 streaming multiprocessors and eight stream eight simd functional units per core that's what they had essentially so this is small by today's standards even small it looks big uh so you can you can take a look at this core which has eight simd functional units you can see that that requires a lot of registers actually 64 kilobyte of registers and then there's the memory hierarchy as we will discuss starting tomorrow but you can see the simd function unit it can do multiply and add and multiply Etc okay so essentially uh there warp uh um size was 32 threads they they were able to group 32 threads to share an instruction stream each group is a warp essentially as we've discussed and they were able to interl up to 32 warps in an fine grain multi-thread manner essentially you have 32 threads in a warp and then you have 32 warps your pipeline is much smaller it could be five they don't disclose it but it could be five or six stages for example so we actually have a lot more warps than you can feed into the pipelines uh so you you can feed into the uh pipeline stages so up to 1,24 thread conts can be stored that's why you have a huge storage in the register file and it employs everything that we've discussed partition register file different Lanes Etc and this is what happens if you scale up to 30 course so 30 times that is 3,720 threads that's a lot as you can see right but this is nothing compared to what I'll show you soon so Evolution has actually this is what we've seen in 09 gtx285 and this is the number of functional units and this the gig flops I'm going to change the slide later on essentially it's been increasing you can see that people have been adding beer and bear gpus so this is v00 which is essentially here uh they increase the stream processor to 5,120 essentially 80 cores and 64 simd function units per core each core is more powerful in terms of simd and you have more cores and they also added these tensor cores for machine learning which I will briefly talk about this is essentially very similar to Advanced Matrix extensions for example it's a specialized unit for matrix multiplication so the block diagram looks bigger as you can see later we're going to see the memory system and the memory system is getting bigger and bigger also uh and you can see actually uh the uh Tera flops uh Tera floting Point operations per cycle is increasing significantly and you can see that the tensor cores the reason they added these specialized tensor cores for Matrix uh M Vector multiplication is because they cannot get enough uh bandwidth computation bandwidth using the general more general purpose cores right this is again specialization versus general purpose so if you look at these tensor cores they look like this in this case this is a little bit old but basically there you can do Matrix Matrix multiplication using floating Point uh values Etc nothing that uh you haven't seen this is the tensor cor microarchitecture again I don't expect you to know all of these but you can see that even a GPU a GPU is not enough to do specialized computation because it's general purpose if you want do specialized computation you add a specialized functional unit it's called a tensor core over here so in each tensor core interestingly they have other simd units 16 simd units per tensor core and essentially in each cycle they can do 4x4 Matrix multiply and accumulate you can read more about it now uh there's some differences from conventional Syd uh they can share registers inside the warp inside the tensor core that helps the programming easier and so that helps efficiency also which I'm not going to talk about here also so there other tpus also uh basically the reason is they're they're building these specialized exelator for machine learning and the reasons you know uh I'm not going to talk about this but you can learn more about this on Research lectures some of those are so the Google CPU are systolic aray based right so that's a different approach if you look at the tensor cord it's a very different approach than uh the TPU okay let me give you uh where we're getting towards this is a100 this is still not the state-ofthe-art and it's increased even more you can see that there new data types being added tensor tensor data types floating Point data types and then support for sparcity because you operate on actually many matrices that have zero values in machine learning so they actually have mechanisms to not operate on zero values that's the one way of actually accting machine learning ignore zeros as much as possible because you have a lot of zeros in some of these computations so this looks much bigger as you can see and the number have all increased so it was 125 Tera flops now it's 312 for example for deep learning and the tensor cor beef here and the also the gpus is be here and this is I think I believe this is the state ofth art right now h100 I can send to be corrected of course uh you can see that the gig teraflops is actually much higher over here this is general purpose teraflops not the machine learning teraflops and the functionals are much higher so I can see that theyve been increasing so machine learning Tera flops are actually getting to the next uh place which is exop flops right is that correct Peta Terra no P I don't know you figure it out okay so you can see that people are adding more data types why they figured out that operating on 16bit data 32-bit data floating point is expensive we can actually do a machine learning inference on sometimes training on smaller data types because we don't waste a lot of computation so they've added these floating point8 units for example so they actually play a lot of tricks you can see thaty people are trying to make this accelerator much more efficient now this is an interesting combination even if you look at a GPU it's not a simd machine anymore it's a heterogeneous architecture right it has simd and it has this tensor core accelerators if you will I'm going to end this with this food for thought so gpus and systolic arays these are actually things that are both used for machine learning exploration right now but even gpus employ simd for uh tensor course there are many questions over here that are not answered yet uh so it's good to think about the tradeoffs and again I will mention that if you're interested in these things uh you may be actually interested in taking uh future courses because these things are things that we cannot cover in one course okay we already uh but also if you're really interested I would recommend looking at other courses as well let me conclude over here

Transcript for:[Lecture 30] Understanding SIMD and GPU Architectures

Transcript for:
[Lecture 30] Understanding SIMD and GPU Architectures