[Lecture 29] Systolic Arrays and VLIW in ML

so VW took a bit later than expected let's see if we can finish systolic eras today but now you know the concept and you know its impact uh I actually believe that vliw is a concept that needs to be reexamined uh especially within the context of these new machine learning models I believe with the use of sophisticated machine learning models you can extract more parallelism uh through the comp but that's something that needs to be seen of course so so some of these Concepts need to be reexamined over uh decades in my opinion that's why these are so fundamental uh this is another concept uh that I'm actually really proud of having uh uh been teaching for before they were really popular uh I mean when I started teaching I said I have to talk about systolic arrays they are fundamental they're extremely important uh and there's no that implements it nothing at the time today a lot of machine learning accors implement this why because it makes sense it fits the model it has a principled execution model that fits the execution of machine learning workloads that's why I really that's that's the beauty of computer architecture in my opinion and uh that's why we really need to study these Concepts regardless of whether or not they're out there implemented in a sense as a purist I don't really care whether they're implemented today right the question is really does it make sense and what kind of trade-offs they provide and that may actually enable someone to really implement it down street okay but before I go into systolic arase I will advertise our Bachelor seminar in computer architecture if you're really excited about these topics and if you enjoy these uh this is a rigorous seminar on fundamental and cutting Topics in computer architecture where people I know you have to take seminar courses uh where people uh present papers critically we review them and we discuss them and as I mentioned in this course also we focus on critical thinking and brainstorming uh this is not an easy seminar course so if you want to actually take a seminar course and just pass it this is not the course because this will make you do a little bit more work and you can learn about that from people whove taken it before uh but uh I think if you enjoy these Concepts this is something that you may want to consider and you can see some of the papers covered and I don't have time to look at them but some of the neural network papers that actually we're going to mention today are going to be covered in the seminar in the past okay I will also mention that if you're interested in doing research and I'm a big proponent of Bachelor students uh doing research uh I was I was that Bachelor student in my university University of Michigan uh doing research uh please email me with your interest don't be shy this is how you get into research really uh you can C see of course some important Tas let's say uh I would also suggest taking the seminar course and the computer architecture course that comes after this and do readings and assignments on your own and talk with us basically it all boils down to this red three words talk with us or email us basically that's the way things get started that's how I got started in research basically uh and there are many exciting projects and research positions and this not just me I should say you can email any professor in my mind uh I mean of course some professors work closely with Bachelor students but everybody needs to do a bachelor sees this in the end in my group we focus on a lot on novel Computing systems as you can see everything about Computing systems new execution part paradigms like in memory Computing Hardware security safety reliability predictability everything that I mentioned in the first lecture so I'm not going to repeat these in detail right now but maybe in a later lecture we can discuss more openly but there's a lot of interesting things here basically and if you go to the thesis website that we have it's unfortunately not very well updated it's very hard to keep this updated you can look at the topics over here and figure out what excites you and then contact us and you can learn more about us uh on these this is my Publications page at the end that actually has a lot of information also and you see a lot of people actually teaching this course okay I've seen I've shown you these sides you can also learn more about our research online this is a short lecture that talks about our research and at some point some of you may decide actually you want to do a PhD or postdoc and uh we actually have a lot of phds and postdocs and also bachelors and master students that we're proud of but phds and postdocs that to fit this list and they won a lot of awards during their phds or postdocs as you can see okay you can join us but the key is contacting us you can also apply here but since you're at eth this is for mostly outside people that at eth at eth since you're here you kind of have an a direct or easy access or privilege uh to join more easily without going through a more heavy application process if you will taking the scores actually is a big uh part of the application process if you will okay so if you have questions about this please email us I don't want to take questions right now maybe in a later lecture because systolic are are quite important okay so let's talk about systolic R what are these uh well basically if you want to know what are these I say these required reading but as I said there's no reading required but this is a very nice reading by H Kung who who's one of the inventors of systolic race there's a recommended reading from Google I will mention and next week we're going to talk about gpus so may want to start looking at that but systolic race basically uh we I've shown you the slide before we've been talking about general purpose systems to for a long time CPUs uh and we've actually looked at the internals of at least the CPU over here a lot actually this is not just a CPU but this a heterogeneous system on a chip you can see that there's a GPU there's a newal network accelerator uh there's a video processor Etc that they don't talk about but on the other hand there's a special Pur system on the opposite end and systolic area is actually fall to the opposite end so we're going to look at this opposite end today next uh tomorrow and next week we're going to look at this end uh and fjs you've been looking at for a while in the future we may look at this if we have time but basically general purpose systems can execute any program they're easy to program and use Unfortunately they don't provide the best performance and efficiency on any application you can always come up with a special purpose system for a given application these are extremely efficient and high performance for the application domains that they're suitable for unfortunately they're not easy to program or use and by definition uh they cater to a limited set of programs not all programs can be executed on these fpgs and gpus are somewhere in between uh they're actually fpgs are quite flexible but programming them is extremely hard gpus are more general purpose over time uh but still not as general purpose as CPUs okay so let's start with the systolic arrays systolic arrays are actually an execution model it's not just an accelerator it's really an execution model and we've seen multiple execution models one moment model uh data flow model and then more recently VW which is a variant of one noyman model systolic arrays implements systolic computation which is going to be more clear when we actually look at uh what it looks like Matrix computation is actually some systolic computation uh you basically feed the data into a set of functional units and data flows to those function units and you get the output you can think of it that way and this is different from one no different from data flow although it has characteristics that are somewhat similar to data flow but not exactly like the data flow that we have seen okay so initially they're designed as special purpose accelerators to accelerate specific applications what are these convolutions filtering pattern matching special purpose Matrix Vector computations in image and vision processing signal processing pattern recognition this was the reason why people designed these things these were specialized computations on important applications and they're currently heavily used in another application machine learning which happens to be the important application of the day uh so we it it goes to specialized machine learning xrayers that said not all machine learning accelerators are systolic arrays and not all systolic arrays are machine learning accelerators right there happens to be some uh V diagram where these two sets intersect of course and but an important set intersection and their General execution model can be generalized actually their or their bigger execution model can be generalized and is generalized as we will see later in this lecture but they're not as general purpose as the volman architecture okay so motivation why were these designed uh basically uh folks who designed systolic wanted to design accelerator that has three properties one is simple regular design you want to keep unique Parts number of unique Parts smaller and regular High concurrency high performance that's what you want for an accelerator and high efficiency hopefully but at the time these were designed power these were designed power and energy were not that important we're talking about 1970s 1980s power and energy became much bigger concerns in 19 late 1990s 2000s and clearly today it's one of the most important concerns and also uh we want balanced computation and IO or memory band basically whenever you bring some data you need to actually process uh uh you need to have some balance between the processing and the bringing data data bringing should not be a bottleneck so they were actually looking forward about the memory bottleneck if you will so basically idea is very simple replace a single processing element with a regular array of processing elements and carefully orchestrate the flow of data between the processing elements and this is really the idea of systolic computation if you will such that they collectively transform a piece of input Data before outputting it to memory we'll see examples of this so if this is too high level you'll see an example from convolution soon and the huge benefit is this maximize the computation done on a single piece of data you bring a piece of data from memory and you feed it into a array of procing elements that the first procing element does something gives it to the other one the other procing element does something gives it to the other one the other procing element does something gives it to the other one so you have this array this could be a matrix also this could be a two-dimensional thing maybe a threedimensional thing actually okay so the big benefit is this so let's take a look at this so pictorially from this paper that I recommended earlier uh you have memory and a procing element and you basically bring one element and then the write write the data back to memory you can think of this as the one noyman processing elements but we're going to be much more specialized of course whereas here you have you bring the data to a processing element and then that does something some computation on it gives it to the next processing element which does some other computation it gives it to the it gives the resulting data to the next processing element which does some other computation and you keep doing this until you finish whatever computation you want and write the data back to memory this could be taking an element uh each Crossing elements May basically uh take a piece of the image for example and you may be taking an element that you're going to uh multiply or or or filter the image with so for example if you're looking for uh I don't know the blue parts of an image each processing element may be doing some computation on uh different pieces of the image or maybe looking for different things in the image right okay you may actually pass the what you're looking for as the element from memory you may also pass the parts of the image from memory so it really depends uh on what you're doing but basically this is the basic idea you bring one piece of data and you operate on it in each passing element and passing elements communicate with each other their outputs uh and then eventually right back to memory okay so why is this called historic because people thought that this looked like uh our uh let's say uh system of uh how blood flows you have memory and that looks like the heart and data is blood let's say and processing elements are cells uh essentially memory pulses the data through the processing elements now this is important it pulses the data there's no instruction that says I want the data right the processing element doesn't demand the data memory pulses the data it sends the data okay so we're not going to have instructions initially uh to send uh to ask for the data basically you keep ping the data if you if all you're doing is let's say you have a pattern that you're trying to detect on different parts of an image you pass the pattern across many many different parts of an image that's let's say uh distribute across these passing element and then you uh you do something basically on those I'm going to give you a better examples which is really the convolution example but convolution you may not know how many people know about convolutions here okay not a lot of people how many people do not know okay when you learn when you take a machine learning course you have to learn about it of course okay but why systolic architecture essentially the idea is data flows from the computer memory in rithmic fashion passing through many procing elements before it returns to memory similar to the blood flow you have heart uh pumps blood to many cells and eventually the let's say transformed blood goes back to the heart to be cleaned up of course at some point analogy analogy breaks down right different cells process the blood and many veins operate simultaneously and this can be many dimensional just like systolic aray it doesn't need to be single dimensional it could be many dimensional okay it's a beautiful Paradigm in my opinion so why is this good because special purpose architectures need what we have discussed earlier simple and regular design High concurrency and balanced computation and iio so uh the basic principle is to replace a single procing element with a regular array of procing elements and carefully orchestrate the flow of data between the processing elements to balance computation and memory bandness we're going to see this carefully or at the flow of data you'll also see it in your homework you need to input the data at the right times uh you need to input the data at regular intervals to make sure that the computation that's done on the PE matches the data right if you're doing matrix multiplication for example you need to have a way of actually inputting the rows and Columns of a matrix at the appropriate times in point appropriate clock Cycles if you will and if you do this you don't need to have any instructions you BAS Bally send the data to this Matrix or vector and after some time you look at the output that's it now this a very different model as what we have seen so far because there are no instructions if you will over here I mean at high level you can build instructions so that you can Integra software but one instruction can actually initiate makes multiply for example okay so a difference is from pipelining you may think oh okay this looks like pipelining and yes at the high level it's pipelining but it's a different type of pipelining if you will the pipelining we have seen uh pipelined uh parts of the instruction process these are complete processing elements as we will see they can be very specialized for a function they're not really executing part of an instruction basically so that's major difference the other difference is array structure may be nonlinear and multi-dimensional as we see depending on the computation that you specialize things for programming element connections can be multidirectional and different speed so we will see some example processing elements that are connected in different ways let's say in pipelining you don't do that right you because pipelining it's very specialized for instruction processing and you break the instruction processing into different stages and also in the generalization of this model of systolic array each processing element can have local memory and can execute kernels rather than rather than what is done in pipelining which is piece of the instruction so let me give you an example systolic computation this is convolution convolution is mathematically defined uh as we will see in the bottom part of the SL but this is heavily used in many operations like filtering of images or filtering in general pattern matching correlational analysis polinomial evaluation Etc a lot of image processing tasks and currently machine learning at earlier times image processing was the reason why systolic ARS were developed but now machine learning happens to have compiltion also so in MA modern machine learning actually has up to hundreds of convolution layers in convolution neural networks so this is the convolution operation basically given a sequence of weights and this input sequence you compute the result sequence as defined by this equation it's basically you multiply the weights by the input in a particular way okay again there's not it's not that important exactly how it's done it's going to be defined uh if you get exposed to it uh but this is very useful so let me give you more motivation in terms of how it's used in machine learning uh today so uh this is a particular convolutional neural network which was very popular in the earlier times let's say it is used for initially for handwritten digit recognition uh it was developed by Yan Lun and his students and you can actually find uh videos of him showing this let's say basically you have this input image it could be some size uh and then you you you use convolutions and apply some filtering essentially using convolutions uh to go to the next step so a neural network is a multi-layer network in each layer you apply some operations to the inputs in this case the input is the image you try to recognize the image right in this case we're trying to recognize that there's an A so you subsample the image you look at different parts of the image and perform some convolution to this mathematical operation to detect what's going on you can think of it as a sophisticated pattern recognition system right that's what's happening essentially so you can think of this image in this particular image it's a 1,24 times 8 bit input if you really want to figure out what's going on in each piece you need to have a truth table of two to the 8K entries which is a lot instead you have a multi-layer network to try to piece together what might be going on and even if you have this truth table that doesn't mean that you have the full picture in the end right because you need to still map the output of the truth table to something but basically yes you uh this is this is the lenet network uh uh you you do a lot of compilations that's my takeaway over here my goal is not to teach you how this network works you're going to learn that in a machine learning course but uh a multi-layer network if you think about the perceptron that we've seen it's a single layer Network it does some operations in a single layer and then it gives you a binary output right here the output is uh I believe 10 bits I don't remember don't uh quote me for it something uh because I don't remember that number over here but basically you do some operations and then you try to make sense out of pieces of the image and then you do some subsampling more convolution more subsampling more convolution some other operations to really try to piece together what's going on and eventually the output is a classification of that this is an a maybe with some confidence interval okay so this is another maybe more easier to look at example so this is your image you can think of it that way this is the input feature map or you could think of this as one layer and the output layer within a network and essentially this is a two- dimensional image 5x five as you can see and kernel is what you use to convolve this with 3 by3 and essentially you apply this gray kernel that's gray to all elements of the image image or pieces of this input and then you form an output and there's a mathematical function that you're applying this 3x3 filter to every single location over here and the output is something and then you do something to that output basically you keep doing this eventually to get the classification hopefully that gives you an idea and this is really more let's say fancy way of doing this as you can see so you're applying by some I think in this case you're taking the dot product but don't quote me on it also or you may taking the matrix product uh you're multiplying two matrices as you can see okay if you want to learn more about this there's a lot over here uh but I'm not going to uh spend more time you can learn more about neural networks but essentially what you're doing is Matrix uh matrix multiplication if you will or vector vector multiplication depending on how you actually pose the problem so uh this is uh you basically divide your image into input features or divide a layer in the new Network into input features it could be these three different input features and then you have some convolution filters that you designed to try to figure out what this is uh and that's the inference path basically you're trying to figure out you basically trained your network such that you decided these weights of the convolutional filters right these are the weights that you apply at a given point in your network and those are trained to recognize some particular images for example so you don't that's how you come up with these there's a training process that's done SE separately assuming the training process gave you these convolution filters this is your input you apply the convolution filters to pieces of your inputs and then you get an output feature and then in the next layer you apply another convolutional filter potentially but you can basically Express these as Matrix Vector multiplications or convolutions uh this is convolution Filters over here this is the input features looked in a different way you can see that these are these input and these are uh these convolution filters and what we're doing is multiplying this Vector with the input features okay so basically this is Matrix Vector multiplication convolution can be implemented as Matrix Vector multiplication it could be implemented using systolic arrays as well as we will see that's why people are doing this on gpus as well as systolic arrays today gpus hold on we'll see but I will also mention the importance of people doing research uh uh what enabled uh this uh new networks to actually take off has been some courses that have been taught uh like in 2010 some architecture courses were taught at University of Toronto uh that talked about these massively parallel processors and if you know the name Jeff Finton he's a touring Award winner for neural networks essentially he developed a lot of the early neural networks in 1980s and some of his students took the course and they basically said oh there are these gpus uh that we can use to train these neural networks much more efficiently and to do inference on these networks and they basically were able to do that after this course uh and they won the competition imag net competition is a image recognition competition this is 2012 we're talking talking about and this is what the paper they wrote at the end uh and they want the competition with significantly higher accuracy than state-of-the-arts you can read the paper uh and there many many other papers that are written to make the convolution neural networks better basically earlier networks had eight layers and then they moved to 22 layers over here as you can see and they were high more accurate and this actually moves to even more layers so basically uh this was their paper it was not at the level of human accuracy their error rate is still 16% in terms of image recognition and later works uh surpassed human level accuracy as you can see human level accuracy is supposed to be uh 5.1% that's not my number that's somebody else's number okay it depends on the human probably also but basically you can see that first convolution neural network is here and modern neural networks are many many layers of course there are other types of machine learning as well so uh this is the one of the motivations of why people are designing specialized accelerators there's a lot of computation specialized computation let's design specialized accelerators for this and you can read more about neural networks you can see that there are a lot of convolution layers but there are also other layers also so other layers are also accelerated in specialized accelerators today some of them okay now let's take a look at convolution because I'm going to to use this to describe systolic arrays uh today it's also used in machine learning but in the past it was used for other tasks it's still used for other tasks and this is the equation and this is a simple systolic array that actually calculates that equation now let's take a look at what this does uh now it's a it's a linear array of processing elements and it's actually designed over described over here this is one particular design uh you can see that W's weights stay in each process element each procing element is exactly the same this procing element this procing element this procing element W's are the weights that they store these weights don't move they're programmed over there somehow that's the programming effort and you feed the variables that you want to calculate from the right to the left and you feed the input variables input Vector X1 X2 X3 X4 X5 from the left to the right and this thing calculates y1 Y2 Y3 while you feed y1 Y2 Y3 from the left to the right this sounds like magic right but it's not magic it's very simple because each building block looks like this it's a very specialized computation engine uh and it's it's specialized specified over here basically every cycle it does this y out equals to Y in which is the input here time W time x in so you basically take Y in multiply X in with the w which is stored over here and Y out gets that value okay but it's a specialized multiply and add right instruct it's not an instruction it's just this is what the hardware does and you design the multiplier and adder and put them together you just need to feed the inputs nicely such that they the data flow enables you to get the outputs right and also at the same time X Out is equal to X in so every cycle X Out Just transfers basically X gets transferred to x out every clock cycle okay now if you know that this is what operates you will hopefully convince yourself that you can connect them this way and Supply the inputs and outputs appropriately so that you get the right values that are specified by here now you can do this yourself I'm going to do this very quickly but basically because of the design of this you need to supply the inputs and the outputs every other cycle that's how it's designed unfortunately you can actually fix that problem and if you read the paper it's fixing that problem basically your output you you generate one output every other cycle if you will so in the first cycle you supply things such that X1 reaches here at the time y1 reaches here so y1 reaches Crossing element one let's call it that way at the time X1 reaches it in this cycle there's no nothing here and there's nothing here so this procing element is doing nothing useful if you will so there needs to be some control logic to make sure that oh make sure that that's make sure that that's not doing anything X2 is in the same cycle supplied to this Processing Unit okay except it's not doing anything because why y has not reached over here yet okay so basically every other cycle you're supplying X's every other cycle you're supplying y's let's see let's take a look at what happens in this cycle in this cycle what what what's happening to this input uh to this output and we know that right y out equals y1 + W1 * X1 so this will get W1 plus X1 so you will calculate this part of this equation okay and then X1 goes here but it's not connected anywhere so X1 kind of dies over here right and that's good if you look at these equations because X1 is not used anywhere else okay that's this particular convolution okay so let's take a look at what happened at the end of the cycle y1 is here X2 is here right yeah X2 mod over here okay good uh because this procing element didn't do anything in the prior cycle while this procing element was doing over something over here so at the end of the cycle y1 is here X2 is here and what is y1 it's W1 X1 now this procing element will take uh set y1 to here use y1 which is W1 X1 set this output to w1x one which came from here times W2 time this input X2 okay and then pass it over here right and by the time this gets over here in the next cycle X3 will reach here and in the next cycle you will add W3 * X3 over here does that make sense so this is basically calculating that and you can basically do the same thing while this is happening Y2 is actually also coming and you can also look at those calculations yourself so actually just to tell you what's happening while y1 is uh here uh okay uh when Y2 comes here X2 will come here okay and this output will be W1 * X2 right and similarly when Y3 comes here X3 will be here and this output will be W1 * X3 so by supplying the data in a nice clean fash this simple array of processing elements is calculating convolutions for you no need for expensive CPUs expensive gpus or whatever you just Supply the data nicely right that's the beauty of a specialized architecture this is what it does if you ask it to do something else it will say sorry I cannot do that well it won't even say that you'll need to figure that out okay so there is a lot of optimization you can do to uh do things but I'm not going to talk about these in detail but now you've seen the concept basically this is historic computation and you keep feeding y1 and y1 is actually the inputs are used many many times as you can see you don't do a load so if you look at a general purpose CPU you need to load y1 X1 uh well x1's many times w1s many times and do the calculation with instructions and write the result back this eliminates all of that A specialized eras structure does all of that for you okay so you can optimize so the key is one needs to carefully orchestrate when these data elements are input to the array and when the output is buffered so this is really the key well of course you need to design the elements but once you design the elements it's really about data orchestration uh and this gets more involved especially when array dimensionality increases programming elements are less predictable in terms of latency so people don't try to do that these are very very predictable elements we will see in matrix multiplication that's used in Google tpus for example they're very similar structure to what I'm going to show let's take a look at the structure so this is going to be the structure let's take a look at how you can multiply two 3x3 matrices so and keep the result final result in procing element accumulators so this is my basic processing element uh basically there's an r r is the accumulated value where the uh where each of these will be stored sorry where each of these outputs will be stored C 0000 this is the multiplication right uh matri these are the two matrices that are multiply these are the results R is going to accumulate and eventually store these results there's another way of doing this here I update inside the accumulator and every cycle it calculates this P gets M basically you pass the left input to the right every cycle and you pass the top input to the bottom Q gets n and the computation is R is equal to R plus m time n so you really multiplying two values and accumulating them in R which is essentially what matrix multiplication is if you want to do a product that's what you do now if you want to do a 3X3 Matrix Vector Matrix Matrix multiplication you have 3x3 elements and the result will be stored in each of these and this is how you supply every cycle different uh different elements of each Matrix Matrix B Matrix B's first draw goes first Matrix A's first first row oh sorry First Column Matrix A's first row first row goes this way left to right Matrix B's First Column goes this way and the first element that's computed is a00 time b0000 as you can see and that's going to be accumulated in the next cycle you're going to add a01 b102 here in the next cycle you're going to add a02 times B20 and that's exactly what you want you want to take the dot product of this row with this column that is c0 right now this is beautiful while this is happening of course in the next cycle this element will get a0 because it gets in the next cycle in the first cycle you get zero you send zero over here right to make sure that it doesn't do anything in the first cycle until it receives the first element over here true over here also this processing element receives v z in the next cycle so in the first cycle it should not do anything no harm similarly the this element this element receives e0 and two cycles later so it should not do anything for two cycles that's how you orchestrate the data inputs essentially and you can go through the calculations but in the second cycle what this will calculate is actually let's do a column calculation in the second cycle what this will calculate is A10 time b0 0 am I correct yes A10 time okay something yeah exactly A1 Z time B z0 right which is going to eventually get you C1 Z okay and you can convince yourself that this works and in your exam you may have a question that basically asks you to input uh some things or uh there are variant of these questions that you may see in your homework for example uh basically we can ask you what this element does give you the computation we can ask you how you feed the thing into the into the uh systolic array given that this element does something so there are many many interesting variations in the end as long as you know how data needs to flow and how a systolic array operates it's very easy okay so uh I'll give you more examples this is essentially what Google has built basically they have built a matrix Vector unit as or Matrix unit as they say and we're going to take a look at that and this is very powerful in their systems uh but systolic areas are can actually be two dimensional we discussed but they can also look like this they can also look like this my favorite example is from the earlier paper one my favorite examples so basically you can chain together these systolic arays to form form powerful specialized systems for example this systolic array is capable of producing on the Fly least squares fit to all the data that has arrived up to any given moment it's a realtime processing system uh from 1980s right early 1980s so what this systolic array does so you can see the carefully orchestrated data inputs this systolic AR does orthogonal triangularization that's one way of actually helping uh with least Square estimation a lot of Matrix operations let's say Matrix operations are not just Matrix Vector multiplication you do a lot of other Matrix operations and then you input the data into this weird looking systolic array which solves triangular linear systems and then you need to buffer the data carefully so that the data arrives at the right point to the right processing element so it's all about orchestrating the data so okay so I've given you the concept the huge advantages are this is a very principal design uh it efficiently makes use of limited memory bandwidth balances computation to I band avability that's one of the biggest benefits of this you basically chain lots of elements orchestrate data flow so that you don't need to access memory hopefully you can Implement systolic arrays on fpgas also fpg are a substrate to implement things as well but that's subject for something else they're very specialized as you can see computation should fit the processing element organization and functions and specialization buys a lot of things efficiency simple design High concurrency and high performance and it's good to do more with less memory B requirements the big downside is the same thing which is they specialized it's not that they're principle that's not a downside principle being principle is not a downside but being specialized is a downside basically because they're not generally applicable some computation needs to fit the processing element functions in organization and if you want to make them generally applicable it becomes more complicated okay so on more programmability people have tried to make this more generally applicable each processing element in a systolic array can actually store multiple weights weights can be selected on the Fly using multiplexers and this imp eases the implementation of more sophisticated techniques if you just want to implement convolution a particular way okay you can do it with the technique we talked about earlier but if you want to do adaptive filtering adaptive sampling Etc now you need to actually decide which weight you're going to use depending on what you're doing so this enables more programmability but it comes with more cost now if you take this further it enables the systolic computation concept each Pro processing element can have its own data and instruction memory actually data memory can store partial and temporary results or constants and this leads to stream processing or pipeline parallels more generally stage execution so I'll give you the idea so you can have a loop that looks like this you can have a code in stage a Stage B stage three in each iteration you have a b c these could be quite specialized computation I'm not telling you what they are I mean one of them may be matrix multiplication fine but basically if you keep doing this many many many many times instead of EX it like this you may actually have uh or I mean you could execute it in a single processing element like this you may actually have three different processing elements specialized for each different type of computation AI B CI and then feed the data to them and this a simple Loop we talking about this is a general purpose PR programming model so you can actually transform a general purpose programming model to more systolic computation and this is called stage execution you can read papers about it we don't have time to talk about it but basically you can divide Loop into stages or code segments called stages this is essentially what I showed you ABC these could be very special and these different stages can execute on different cores and cores can be specialized for the computations that execute on them okay okay so I kind of generalized the model this is not what people who wrote the systolic arrays envisioned in 1970s 80s but today this is what some things are being done for example video processing that Google built that has some systolic components to it file compressions another example again for example you can have file uh that you you you're compressing lots of files and you want to build a specialized compression engine for files whatever videos for example so you may actually divide you're processing into stages and each stage may be very specialized hardware and then you basically operate on different data that's flowing through these stages okay so uh I talked about a lot of things you need to think about a little bit more of course but this is really the high level systolic computation example uh this is not exactly what was envisioned in the 1980s but those folks also didn't Envision machine learning accelerators right clearly so let's talk about advant and disadvan and I'm going to talk about how this is employed in the field today so basically huge advantages you have your special purpose you get high efficiency you make multiple use of each data item you reduce the need for fetching ref fetching better use of memory bandwidth High concurrency and regular design so it's easier to implement disadvantages it's not good for irregular parys forget about irregular code on this essentially uh special purpose not generally applicable it needs software programmer support to become more general purpose and it also needs software programmer support actually to make sure this works uh because you need to feed the data somehow and there needs to be some system level support to feed the data right inside there it's difficult to program if the problem doesn't fit well so uh these were actually implemented in at CMU HD Kung who developed the Paradigm actually implemented these uh I'll give you the history a little bit more they basically use this to exate vision and Robotics tasks and they had a linear array of 10 cells at the time each cell is very simple as you can see what 10 mega flops and it was attached to a general purpose host machine just like an accelerator just like machine learning accelerators are attached today and they had an high level language and optimizing compiler to program the systolic array which is interesting and this is what the system looks like basically it's a systolic array for image processing or Vision processing or robotics which is a lot of vision processing also and I'm not going to go into it but it's essentially a systolic array but this is a modern systolic array and this is the paper that basically says we have built a systolic array from Google and essentially it does what I showed you earlier 2D Matrix Matrix multiplication and you can see the buffers that buffer rows and colums of the Matrix and there are some additional structures over here to do additional operations so basically rows and columns in this Matrix unit uh if we had time we would actually go through this in more detail uh we don't have time so I'm going to flash some pictures over here so you can see that Google has improved this but there at the core these are all systolic arrays if you look at it and you can see that they actually use it for many many things over here reinforcement learning recommender systems natural language processing computer vision one exop plops per board it's amazing I think and I'll finish with some readings this is one of the readings read ings that they actually published they talk about this publicly uh this is a reading that's coming up soon uh we have actually worked on the hpus and this I will motivate later when we talk about memy but tomorrow uh I I should also say that machine learning experation is not just about systolic arrays tomorrow we're going to see simd Concepts and these machines that are built for machine learning exporation use the simd concept this this and these use fjs so basically all of these paradigms can be used to accelerate different things and tomorrow we will see the simd Paradigm so have a good evening I'll see you tomorrow

Transcript for:[Lecture 29] Systolic Arrays and VLIW in ML

Transcript for:
[Lecture 29] Systolic Arrays and VLIW in ML