[Lecture 12] Exploring Processing in Memory Technologies

hello can you hear [Music] me can you hear me do you can you start a live stream you already live okay um I'm hearing myself okay okay let's go so hello everyone um I'm Geraldo and today we are going to talk more about your favorite topic which is% Computing uh we are going to only already cover a lot on this subject so today I'm going to try to give a more Deep dive on some of the topics that he already covered and I'm also going to announce and show what you need to do for lab three which is going to be online the hand out is going to be online later today uh so you're going to have a lot of fun messing around with m Computing uh so just a really oops I cannot click okay uh really brief self- introduction so I'm Geraldo I'm a PhD here in Safar research group uh since 2018 I've been work with owner since 2017 and I'm about to graduate hopefully I guess uh uh I have dreams of that and if you I'm I do research mostly on process memory so if you are interested on those subjects or anything that I'm going to talk uh on the subject here today feel free to send me email I I eventually I get back to you uh because we are all owners student so it takes time so just a recap on what is we are going to talk about here the problem that we're trying to solve is that data movement is about on acting today systems and I'm going to try to be quite fast here so I'm not going to prolongate myself much but data movement bck are B but movements are bck in systems for a myriad of reasons for example your application might not have enough dat locality to make use of deep cach hierarchies or scratch Pad memories on your CPU or accelerator your memory device or storage or your cases might not might not enough might not give enough memory band for your application so that you can uh maintain height through throughput for your system or maybe your latency uh your trip latency to go to main memory or going to day storage when you sum all of those latencies is too high and then you cannot mitigate this data movement Bon so even though this is not a new problem nowadays uh data movement Boton X are becoming key in system design because we having a push from application and applications where the money is and we design uh following that uh so for example uh applications like um neuron networks or Transformers in this case have been grown exponentially over time for their uh data usage and we expect that to continue increase uh over the years so this trend is just going up and it's not showing that it's going to slow down anytime soon uh we also showed on our showing the lecture uh this work from Google that we did that the problem is not only on servers or in big uh big SC big scale systems but even your phone or your laptop uh this data movement Boton X contribute significantly to to the total energy consumption of your system or your accelerator for example in a Google htpu accelerator in this case and the main problem over here that is behind uh this uh Bon X both in terms of performance and in terms of energy is that that we we historically have been designed systems uh in the computer Cent qu and this led us to design really efficient processors so a processor doing a um uh double Precision computation uh takes uh on chip take some P Jew of energy for you to uh compute but if you need to go off chip if you need to go to your main memory or to your storage that is going to consume two to three more uh two to three the order of magnitude of doing the complex uh operation so the creates an imbalance on how we have the systems today so we we have been trying to mitigate these data movement Bing systems uh with many many many different solutions we are still trying uh every single generation of your iPhone you have a new infin here hard prefectures that tries to predict uh before you access the data if the data is and bring the data earlier to the CPU course we have deeper and deeper and deeper cat hierarchies with larger and larger caches over time uh so the trends of Designing computer and architectures that tried to uh patch this data movement B neck problem is still there it's not going to go away but this does not fundamentally solve the problem uh because the fact is that we are doing computation far away from where the data resides and one solution to mitigate this problem is to instead of thinking a comput cent QA we going to think in a more memory Cent QA on designing memory Cent architectures where we are going to move compete Computing resources nearby uh the data itself and one example that you mentioned in the lectures earlier this semester of search memory C architectures is what we call processing memory so a pro memory uh Hardware have the benefit compared to a comput centc counterpart or having access to larger memory bandwith abundant parallelism at the same time shorter memory access latency so if your application is going to be Bon by any any one of those components probably probably a process memory architecture can be useful to mitigate those Bon necks so we also talk about the taxonomy of pro memory architectures and we broke down processing memory architectures into two different uh subgroups uh the first one was uh processing near memory architectures and here uh we we we think about those process memory architectures as similar to what we think about a regular vum architecture where you're going to have a a distinction between logic and and memory but in different from a Compu Cent approach in AUM architecture we are going to take this log and move it closer and closer to the memory array itself the closer the the logic is to the memory array the larger is the memory bandwidth and the shorter uh memory exess time that that particular logic is going to be able to uh take an advantage of um the second approach that we mention in the lecture was processing using memory architectures and here uh the the the process is completely different from what we are used to in vum architectures because we are not going to distinguish uh the components of the system into uh computation and memory here the memory itself is going to be used for computation and we are going to just use the analog operating principles of the memory self themselves to perform computation so that was a extremely fast recap of uh what we saw on the early lectures on processing um in memory uh I don't know if there's any question I guess there was already some homeworks on that so you should be quite sharp on processing in memory uh but I guess the basic concepts are uh already everyone are in the same page so what I'm going to do today in this lecture is to elaborate more on those two uh subjects and we are going to talk a little bit more in details about some processing using memory architectures that we've been designing here in Safari and some design principles that we follow to do that design and we are going to spend a lot of time as well talking about processing near memory architecture that we didn't spend so much time talking and in particular I'm going to give some uh uh focus on this lecture on explaining some some of the real Pro memory architecture that are out there either as prototype or as commercial commercially available chips that you can buy uh in the market and that hopefully this can motivate you to uh want to work on this field after your Masters or your studies uh because those are quite a new opportunity that is out there now uh in the industry which was not the case when I started doing this five years ago so it's really good for me as well uh so uh starting with uh processing using memory before uh I talk about process using memory probably already know this by heart but I need to give you some refreshments on how Duram is organized otherwise nothing that I talk after uh this point is going to make any sense to you so again I'm going to give a really quick uh background on how theam operates and it's organized uh so if you have any questions related to that feel free to ask after I'm done so as probably you already know at this point theam is hierarchy organized as a hierarchy of components at the highest level of this hierarchy what we have is what is called a Duram met a met is a TOD array of Duram cells which are horizontally connected to word lines and vertically connected through bit lines so the word line shares a local simplifier and the bit line share oh the the bit line share a local simplifier and the word lines shares a local decoder a collection the next level of hierarchy we have a collection of those Duram Mets what we call a Duram subarray and a Duram subarray uh share a Global Road decoder and a global word line which spans across the several Duram Mets inside a Duram subarray and a global s simplifier uh which expands across the um uh is which share across the local simplifiers in the Duram subarray and some IO interface which is used to move data from the Duram Mets to the outside worlds to the memory controller the memory Channel and the memory controller so uh for you to access Duram two many operations is performed the for is um one activate so when is activates to a given Duram role the data in the Duram cells uh is uh read and Amplified by the Locos amplifiers and after this data is sled and Amplified by the Locos amplifiers the memory controller can issue read or right command to that data which brings portions of a column of the data to the global simplifiers and eventually to the memory channel so we also saw in the previous lectur that we can use these operate operating principles of theam this activation comment in particular to perform some uh simple indam operation so we start really simple uh with indam r copy operations so the goal here is to copy One Source row to a destinational role without involving the dram the the CPU die arrow and to do that we presented this R clone work which issues proposed to issues back toback activations to the dram row to to do this indam R copy operations so when you issu this Ro copy operations the first activates to the first row cops the search Ro to the local simplifier and then the second activates assert the the uh the word lines in the dur the target Duram cells which allows the loc simplifier to drive the uh the data that was previous led into the destination row hence realizing the indam r copy operations and also we saw that by doing so we can reduce the latency of coping four kilobytes of data inside the Duram chip uh to from one, 146 Nan to around 90 NS and also improve energy consumption uh significantly to do this interal copy operations so we also saw that we can extend this approach to do something more fancy so we can do IND majority of three operations by simultaneously activating 3D rows at the same time so when you activate those three rows at the same times the three cells involved uh in the computation simultaneously perturbates the local bit line and since what we have here is a differential sensier in the local sensier this differential sensier is going to pull up the voltage to the majority of the sense uh voltage levels in the uh trur cells themselves so in this case here the majority of blue yellow and blue is blue so again by doing operations in majority operations like this we can reduce the um can improve Energy Efficiency of induran Boolean operation significantly compared to a baseline uh CPU die and we uh we talk about this Ro clone and Ambit papers in details in the first uh memory memory Cent lecture so hopefully you remember uh those there was already some homeworks on uh on on this subject uh but now we going to elaborate a little bit more uh on how to improve um those process using the operations by doing some some more complex operations using uh those approaches and next I'm going to talk about some of those Works which uh is part of uh my PhD tees so the first one is uh srum again honor briefly mentioned this uh in the lecture uh and the problem that you are trying to solve over here is that okay we can do indam copy you can do indur majority operations this really cute that's really nice but can I use this is this useful for uh my to me to accelerate my my ker that does mostly addition or multiplication or division right so uh the problem is that uh we only support before this walk uh only we could only support some really simple Primitives to do uh uh indam computation which can be useful some workloads but they are not widely applicable to a wide range of applications also we have a limit and fix set of operations so you have those two Primitives right IND R copy and IND majority and that's it so what if I want to do something else uh how do I use a new how do I can provide a new flexible primitive um for my application to use uh this type of framework at the same time we want to do minimum modifications to the dram circuit itself because if you think about it if I want to do addition you can just tell me oh just put a adder in each bit line in the local simplifier and then you can do a addition right that is easy and then they solve and then the two first problems here are solved uh but this comes at extremely higher high area cost which um U makes this substrate quite uh impracticable uh to be deployed in in real scenarios so at the same time you want to keep uh the modifications of the diam chip as minimum as possible so uh basically uh the task that we went after to try to improve this work is to design this framework uh for processing using memory that allows the user to efficiently Implement complex operations in Duram in a flexible way at the same time without modifying the Duram architecture um or minimum M find the Duram architecture and the key idea uh of this work was Sim Duram endend to end processing using Duram framework that provides the program interface the instruction set architecture and the hard support to compute uh complex operations in Duram uh using a in Duram massively parallel Sim substrate so the to for us to do Sim we use two main key ideas uh the first key idea is to use a vertical dat layout for comput for process using the uh data so as probably you already know from the homeworks uh data is stored in a layout in the Duram chip in conventional systems in a horizontal data layout so if you have a cach line of 64 bytes you take those 64 bytes you divide them across the for example eight Duram chips that you have in a Duram Rank and each of those 64 bits in one dur one Duram Chip is going to be spread horizontally within one Duram row across many different columns so uh but if you the problem with that is that if you need to do some arithmetic operations that require us to move data across different D columns think about the rip carry right where we need to move a propagate a query from one bit to another bit uh we don't have a way of doing that without including some uh logic or interconnects that would allow us to move data across the different columns which comes to its own cost the alternative is to not store data in horizontal data layout we can store data as we want and we are going to store them in a vertical data layout so here we going to take those 64 bits inside a single diam uh chip and we are going to in this case here is four bits in the example and you're going to store all of those bits uh in a single Duram column spending multiple Duram rows in a Duram subarray which this allow us to do is now we can implicitly shift dat using R CL operations so if I want to move this bit here to this bit position here I just R copy this row to this other row uh so if this is bit uh this is your more significant bit and this is your less significant bit if I just want to propagate query from here to here I just copy this bits to this bit using double row activations as mentioned in R claw um another way another thing that this allow us to do is to think of a subarray as a massive parallel s substrate so if I keep doing IND operations here across the different columns what basically what I'm doing is that I have a single instruction this interam majority operation operating across multiple data each data across uh store in a single diam column so now I can see my di subarray as a single structure multiple data processing engine which I have as many Sim Del length as I have Duram columns in a Duram subarray so in a ddr4 chip we have around 65,000 Duram columns uh so it's a quite very wide simd uh engine and you're going to see the problems of that later on so the other key idea that we are going to use is to use majority based computation for um for arithmetic so in the prior work in the ed paper uh the main operations that we are using there was we are doing the tri activations to realize a majority operations but we always setting one of the inputs to zero or one so that we could do bulling and and or or or operations and we can compose for example a rib carat using and and or and nor gate right however since the Primitive that you have in indam computation is fundamentally a majority Gat is much more beneficial that uh for us to just use majority based logic instead of conventional Boolean algebra to compose uh a operation so this enable us to have higher performance because we need fewer T activations to realize a given operation and higher through throughput because the latency is smaller again so uh based on that we devised this three-step uh framework I don't know why all of my works are three-step Frameworks for something it's never four or two it's always three and and and this framework uh takes us uh input um uh use a desire operation from the user in the form of uh and or not uh um representation of the gates so this can be for example RTL version that describe the circuit it can be uh a graph based description of the circuit does not matter uh as long as is in the form of n or not U or I can infer n or not logic based on that input and then the first step of this framework we are going to convert uh this uh input into a input graph that only uses majority and not uh uh Gates and then what is left for me to do is basically map this graph into sequence of indur Ro copies and indur majority operations and the result of the result of this is what we call a micro program which is basically a recipe that tells the Duram chip um how to realize this initi user input graph inside the Duram uh uh for process using Dam operations itself so we store this micro program uh as um inside the Duram chip for future use and we add a new uh uh simam instructions to the CPU Isa so that the user can instantiate this instruction in its application then uh doing the execution of the application the user enables or includes this intrinsic inside its application and then we have some control unit the memory controller that um understand that this instruction is a process using the r instruction uh fet the micro program uh and then dispatch uh one by one the sequence of IND copies and indam activations that realize that indam operat in the Duram chip itself so I'm going to go a little bit in details on those three steps next so the first step here as I mentioned is just to convert the input and or not graph into its equivalent majority uh inverted graph Um this can be done naively by using the same equivalent that the ID paper uses so if you want to do a end uh gate and you only have a majority gate you just use uh one of the inputs as zero and if you want to do uh all Gates using majority gate you use one of the set one of the inputs uh as one oh okay it's correct so we can uh uh so you can take our initial graph do use this equivalence here and then each gate here is going to represent uh the other equivalent majority gate here uh the the ma the majority not inverted version of the same circuitry however as I'm going to show next if you just do this you're going to you're not going to improve the performance of your circuitry because you have the same number of Super activations to realize this graph as to realize the other graph uh but we can apply some uh uh grid algorithms so this think about uh logic simplification as you do in Boolean algebra uh there is the equivalence of that for majority based graph so you know like that there's the the Morgan law all of those laws um for Boolean based circuitry we have the same laws uh it's not it's not is not this exactly the same uh procedure but is the equivalent of that for majority inverted graph um and then we apply uh uh those laws in a grey optimization algorithm so that we can end up with a simplified version of that same circuitry so I'm not going to go through all of those inputs for this majority inverted graph here but uh you can trust me that uh the output of this uh circuit with four Gates is exactly the same output of this uh uh circuit here with a single uh gate so here we move for doing four trip activations to doing only one trip activations with if we apply those uh circuit simplification Lo uh Logic on top of the initial circuitry so the so just to summarize the goal of this first step in the Sim framework is to generate this optimized majority not implementation of the desire operation so then we move to the Second Step uh and the goal of the second step is to generate this uh recipe uh which I mentioned that we call Micro program uh which dictates how I can Implement that indam um that this majority not circuit indam using indam copies and activations only so this step is broke down into two tasks the first test task is to allocate the ram rows to the input of the of the of for each edge of that majority in gra inverted graph and the second task is to uh generate the micr program in a efficient manner so um I'm going to briefly mention uh describe those tasks start with task one the allocation of D rows to uh operations uh to operant and to do that we need to consider two main restrictions that we have for process using Duram architectures and these restrictions are actually engineering restrictions they are not uh uh they they could be lift if you are up to pay the cost for them but the restrictions there is that at least for the first one the second one is more fundamental so the first one is that uh in the embit design if you recall we take the theam subarray and we divide the theam sub array into three groups of rows the first one was the data row do rows just store your data uh some rows uh that are constant rows so is one row that stores all ones and another roow that St all zeros and then a subgroup of rows called B group or bitwise uh Group which are uh groups that are going to be connected to a special Ro decoder that allow us uh with a single uh address activates 3D Ram rows at the same time so but those to to simplify the design of the R decoder we only have some few of those uh compute R in the diam subay so when I'm allocating the the input rowes for the majority inverted graph I need to take into account that you don't have infinitor the the number of uh rowes for computations you can think about them as registers uh the number of registers that you have available for your operations is quite limited so it's like 16 addresses for example the second uh constraint is more fundamental and is the fact that when you do a tripo activation this tripo activation is destructive so if I activates the trium cells of uh doing a tripo activations the the the data store initially store in those 3D Ram cells are going to be simultaneously overwritten by the output of the majority and this happens because the when you do the T activation the the Duram cell is connected to the local bit line right so because is what the activation does and and then the the when the local simplifier finish the amplification process the data the the the charge restoration process is going to start automatically right after so the data that is Led in the locus uh is going to start uh being written back to those 3D Ram cells that were involved during the computation so if you need the data later on for future use uh you need to cop it to a temporary low otherwise you're going to destroy the data so uh taking to those those two uh restrictions into consideration um we designed this allocation algorithm which first uh Maps the uh sign as many inputs as the number of free Computing rows as we have available so you can think again actually actually what we actually use for this is a register allocation algorithm that is quite uh quite common in in compil routines uh I forgot exactly which one of them we use uh but it's it's in the paper anyway this is just uh register allocation algorithm uh and there are many of them in the literature with different tradeoffs so we pick one and then uh we treat those inputs as those rules as registers basically uh the second thing that we do is that since the the after the T activations the three Duram roles are going to have the copy of the data uh we are going to use this property to uh if uh in the following computation I need one of those inputs I'm going to use any any one of those trees here uh as the inut for the folling app computation and basically this relax a little bit the allocation because uh instead of thinking of having only one register with the outputs of my previous computation now I have three register available and then I can use any one of those three which one uh which one is more convenient uh doing the allocation so the second task is that now I now that I know how those uh inputs and outputs here are going to map to indur uh rows uh I can start generating the T activations and IND r r copies that would realize uh this operation here so basically we Traverse we Traverse the the optimized majority inverted graph and then we generate um inum row copies uh for the input uh row so that we don't destroy the data and then a majority operation for the uh to realize the input majority Gates following to activations and finally a copy to the uh to the temporary role to the destinational destinational role for the computation so this should work so this should give you a a a valid micro program or a valid sequence of indam operations that would realize that computation inside the Duram chip itself but we can further optimize those uh by following some of the properties of that special Road decoder that Ambit uses so that we can further reduce the number of indam operation that you need to uh follow so if you uh because of some of those properties of that of that inspe Ro decoder we can coales some of those Dam R copies to a single Ro copy and also qual ma a majority Follow by IND Ro copy into a single um uh activation activation pre-charge uh command because the first activat does the two pro activation and the second activates does the copy so this give us to a optimize micro program uh which can can do that Target indam computation in a more efficient way so what is L for us to do is that this operation does this this computation for only one bit so this uh and I forgot to mention um since we are doing this computation in a vertical D layout instead horizontal we need to compute in a bit serial manner so you go bit by bit so row by Row in uh of your data world uh uh and then in the end if you have 32 bits you're going to have to take 32 steps to to get the the output results that you want for the for the computation for addition for example if it's a multiplication is worse because uh bit serial multiplication sces quadratically with the number of uh input bits that you have so if you have um uh 32 bit multiplication you're going to have to have 32 to the times 32 steps for the computation so what is left here is to generalize this one bit uh of this operation that does a computation for one bits to n Bits uh if it's something s simple as a addition we can just repeat this process here as the number of bits that you have uh if it's a little bit more complicated because you might have some control or your computation might uh depend on some uh some of the bits that you just compute you might have some control as well here inside this micr program so in the end of this process what we have as I mentioned is a micro program that implements these inam operations in dram and then we store uh uh this micro program in for future use and we create a new uh um CPU instructions called bbop so that the user can interface with this instruction in application so the final step then is to execute this micro program uh and then basically you want to do it transparent from the user we don't want the user to from the application start issuing IND Duram activations because the the application doesn't have the ability to do that anyway so we basically uh uh create a control unit inside the memory controller which loads the when you issue a BB obstructions loads the micr program for the corresponding operation and issues the activation and pre-charge comments that realize that my given micr program uh one by one and in the and also deals with some control uh that is required for that and in the end we have the output inside the theam chip so in the paper you also uh discuss several system integration uh challenges that needs to be taken to account if you want to realize this uh in a in a real system and the most important one I would say is the data transposition that needs to be happened so as I mentioned in the beginning uh here we are for process using D operations instead of storing the data in horizontal data layout you need to store it in a vertical data layout um uh but at the same time you don't want to store all of your data in a vertical data layout in the system there are there is a reason why we store data for CPU in a horizontal data layout because we want to exploit the the memory level parallelism of your durab model so we don't want to break that that that struction as well so basically we have this challenge now that we need to switch between Sim Duram layout data in a vertical data layout and a CPU layout data in a horizontal data layout inside the same application uh and to do that in an efficient way we implement this transpositional unit that uh that sits between the last level cache and the memory controller and basically uh tracks uh objects or memory addresses that are going to be used later on for process using Dam operations doing the cach line eviction P so if you if I have an address a here uh that later on is going to be used for processing using dram operations uh I raes the address here and every cach line of this address that is evicted to dram because some your application is running right uh I I intercept that cach line I transpose this cach line and then I write that data in a horizontal data layout and in a vertical dat layout and then later on if you need to read that cach line from theam I uh I read the data and then I transpose it back to the horizontal data layout and move it to the cach already in the in the correct format so uh we did the Implement sorry have a question the en is wants to be used you transpose so it's not uh it's not when the p is going to be use so and it's not all it's not also the entire theam that is transposed so only the memory objects so this this this model assumes that you as a programmer knows which uh arrays for example are going to be later used in process using Dam operations so for those arrays that are going to be used for process using Dam operations you register its address to this object tracker here and every time that a cat line of that particular memory array uh is evicted to the memory chip we transpose that cach line but only the cach lines that are register here so if that if I'm going to do addition a plus b equal C you would register a and b and c and then the data will be transposed of A and B every time but if you if you if there is another if you do a a sign between D and F which has nothing to do with A and B those data are never going to be transposed it's going to always be kept horizontally uh so this is a best effort approach because we have two we could we could have done something different right we could have say right before doing the computation I'm going to transpose the data of A and B but this puts the transposition latency uh in the critical pth of the computation itself because I need to First transpose then compute but here since uh we don't expect the entire application to be executed in theam because is this has own limitation uh we expect that there there are phases in the application right so you initialize your data and then you move on with your life and then at some point you do the trigger the process using D computation and then the Hope here is that in that I I'm moving with my life cach lines of the data were already evicted to the Duram chip so I can uh transpose uh that data without putting that latest the critical PF for the process neur operation because it's going to be overlap with I don't know your reading a file for something else in the application for example in the initialize if when you just initializing the data the data is going to be written initialized to the traditional horizontal data layout uh the moment the uh the moment that the data start leaving the cach the data is going to be sering transpose because you don't initialize data inside the Duram right so uh we initialize data in in the cach in the CPU in the cach first and then eventually it start being evicted to the to the ram chip so but the same ABC also needs to be used to compute the yes yes nowu yes yes s so we don't we don't compute it twice so uh I didn't was not clear because I didn't gave the sequence for the r carry but basically since I know that that bit is going to be used both for carry and for the sum when I when I once I compute I also copy uh to a temporary rle so that I can use it later and don't destruct the bit for the sum uh we move we move A and B once we move a twice because of how the circuitry is designed but uh B is only moved one a needs to be moved twice okay thank you so much for the questions so uh we evaluate simam using this uh CPU simulator called uh Gen 5 uh and we Implement uh we compare it to a CPU and a GPU and also uh doing the same computation you following ID format so using Bull and or operations we also evaluate four configurations of syam uh using One bank for computation four banks for computation or 16 sorry dur banks for computation so we use this uh three-step framework to implement 16 different indam operations including addition multiplication division subtraction and we use those operations to accelerate seven real world seven real world applications including some neuron networks and some database Primitives so uh in terms of the throughput for those 16 different uh indam operations we see that Sim Duram uh can significantly outperform the CPU uh but it cannot outperform the GPU if you only use one as uh One bank for computation but if you scale the number of banks for computation eventually you can outperform the GPU as well and this is aggregated across all of those 16 different operations and you also see that uh it can outperform the uh St of the Ping system at the time Ambit as well uh in terms of Energy Efficiency the the results are much better we see that uh for those 16 operations in durot perform the three Baseline systems um for Energy Efficiency and when you consider the end to end speed up for the seven real world applications we see that simam cannot perform the CPU cannot perform the GPU depending on the number of durun banks that you enable for computation and can outperform init as well uh so there are many more studies in the paper uh which I guess I don't know if it's a required reading but maybe it is the consultant have a question sure how is there a l that happens is there so the we discussed this in the paper the this BB op struction is just integrated in the pipeline so there is no need for a lock because this is not operating as a side accelerator itself the SEC still controls everything so uh this BB instruction is treated as a any other load instruction so it's register in the MSI entry goes through the mem hierarchy blah blah blah blah and eventually needs to go back uh uh to the CPU as well so if the CPU is going to depend on that particular bbop instru it's as if it's depending on a load so is is is waiting for it to to to to move back basically so is is is uh in that sense this is not asynchronous uh to the CPU and also is there a limit to the size of the program EX because the subarray has a limed size yes and support forays mov dat not in this work here uh we don't uh for all of those operations the data fits or can be uh partition in a way that you don't need to move the data across the different d sub arrays but we discussed in the lecture here uh this Lisa paper H which connects the uh uh sub arays using some isolation transistors and we can move the data using that approach for example so if you need to move data from one subate to another one because it does not M may not fit has to be at least size of size of one room this to be benici it needs to be in in at least the size of yes to be for yes to be fully utilized I'm going to talk about this later on because it's the other work that we are going to mention so let's uh take a break uh before I enter to the next one and I guess I just going to follow the the the Bell so you're going to take a regular 15 minutes break until then for e e e e e e e e e e e e e e e e e e e e e e e e e e e e [Music] welcome back everyone um I need to hide that okay so um where we were so we were doing processing operations right so now we can do IND copy we can do IND majority operations using embit and we can do arithmetic operations using S so uh if I want to make this a compute uh engine uh can you tell me what type of operation that are missing yes that is one for sure uh other than control does anyone have any idea idea so how do I do a sign or cosine yes so if you want to do there is a class of operations called transation functions that you can approximate using um Al arithmetic operations but uh you not necessarily can compute them or they are tabulated uh or like they because the SPAC is they are coming from a continuous space rather than a discrete so the goal of the space paper is to break of this Pluto paper is to um to close this Gap so uh basically here we want to further extend the capabilities of process using the ram so that we can do uh uh complex operations here and complex in this case here in particular those uh uh that are require for Trans transcendal functions so the key idea of Pluto uh which is our solution for this is a quite old idea uh so instead of using arithmetic operations to do a particular computation we are going to precompute uh that computation and store that into lookup table and then we are going to replace the the operation with a memory read and basically a look up table access and then we have our our output so this is not new this is what fpj does for example uh but we are going to realize that uh in the ram in an efficient way in what we call Pluto lookup lookup table query uh operation so I'm going to illustrate how Pluto works with a running example so let's say that it's not necessarily a transation function but the same principle applies so let's say that uh you have a lookup table query where I want to return the second first second and fourth prime numbers in a natural sequence so before doing anything I tabulate what are the sequence of prime numbers in uh so the first p number is two the second p number is three the third p number is five the fourth p number is seven I store this in a table and I index them in order so the loot index zero is going to point for the first p number which is two the the lookup table index one is going to point to the second prime number which is going to be three so one and so forth then I'm going to translate this query that I want to return into a input Vector which are indexes to this lookup table in uh lookup table that I'm going to query so first one here is going to indicate the second prime number zero here is going to indicate the first prime number three is going to indicate the fourth prime number and I want to have as a output in this output Vector 3 2 three and seven which is the second prime number the first prime number second prime number and the uh fourth prime number so now I'm going to start mapping those structures here into indam structures indam subarray and so on and so forth so in one Dam subarray I'm going to have uh the input Vector that you want to return that you want to query in this lookup table query operation and in another di subarray we are going to have this output Vector which is going to be the output of my loot uh loot query operation then uh we are going to have uh the diam subate which is going to store this lookup table here and we are going to paint the diam uh bank with some mating logic which is going to compare the input or the the row that uh the row I did them I'm currently accessing so uh row zero one two or three this ID which with the inputs of the input vector and if I have a a match between the Rader I'm current accessing and one of the elements of this input Vector I'm going to assert some matching logic that connects the Rob buffer to the output Vector which is going to allow us to copy that data in that position to the output Vector so as I mentioned you're going to have uh the lookup table store in the Duram subarray and you're going to have multiple copies of that uh lookup table is store inside the dam sub so that we can explore the Duram parm so you can have many uh elements in this lookup table operation here as uh as we have in theory Duram uh as the size of the Duram row is not the size of the Duram row it depends on the bit Precision that you are using here but uh you get the idea so uh to operate this we are going to create a new struction called Pluto R sweep and this works similar to how Auto refresh operation works so we are going to start activating each one of those Duram rows one by one in order with a single Duram operation so let's see how it goes so the first uh uh one re once receiving this PL Pluto sweep operation we activate the first the row which operates as any other D row activation copies the input uh copies the data in that the RO to the r buffer and now my matching logic is going to start asking some questions all of those questions happens in parallel of course but I'm going to show here in serial so we can understand together so the first uh question there is going to start the this roing the me is going to start comparing the address that I'm currently activating so address zero with the with the element of my input vector and asserting the output the matching logic uh in that appropriate places that we have a match so um I'm checking the first input of my input Vector so I'm currently accessing row zero not I'm not currently access row zero so that that position is not going to be asserted and I'm currently accessing row zero uh which yes this is true so that particular position is going to be asserted uh and I'm going to copy the data from that that from that position in the row buffer to the output Vector am I accessing row uh one no am I accessing row two no so no other connection is going to be made in those isolation transistors so nothing else is going to be copy to my output Vector okay this activation is done I move to the next row uh then again this process repeats and the match logic start asking questions again now I'm going to have a matching between the first element of my input Vector I'm accessing Row one and the third element of my input Vector yes I'm accessing Row one again those positions are going to be uh in the matching logic is going to be asserted and I'm going to copy the data from my local Rob buffer to my uh output Vector in those positions that were asserted then I move to the next row row two I don't have uh the data scope to the local Rob buffer I have no matching uh in my input Vector so I'm I'm not quaring the the third prime number in my lookup table query and then I move to finally to the last one to the third POs to the third row uh fourth row uh index three uh and then I'm going to have a matching in the last position of my uh input Vector so then I in the end I have my input Vector the output of my lookup table query here so uh the first the second prime number loot index one is three uh then the first prime number is two the second prime number is three and the fourth by number is seven question yes just here just we went really good question we went back to the horizontal data layout so this two here just start like one zero this is just uh one z z z z one one1 so on so forth okay so the question that uh uh would be natural to ask now is how do I implement this matching logic here and because that would lead to different uh tradeoffs so uh because of that we implement we Implement three different versions of Pluto which have different tradeoffs for the matching logic so the first uh implementation is what we call the buffer simplifier design which is going to use is exactly what I just mentioned in this example here is going to use some auxiliary data to store the temporary uh data as I progress uh across the different rows in the output Vector so this is just some extra flipflop that when I'm asserting one particular role I'm storing that the the matches so here for example seven here in that particular flipflop and after the the competion is done I copy that data in that flipflop to the output Vector so this is one design the second design is what we call g s amplifier here we are going to use the S amplifier itself the of the output Vector to store uh that that data as a progress in the PLO sweep operation and but this is destructive uh of the data in the Duram cell that I have here because once I copy the data to the simplifier as uh the the uh charge restoration process going to happen and then I'm going to destroy the data that was stored here in the look in the in the Duram cell the third design is uh is we call gated memory cell and here we are going to have a a extra uh um excess transistor in the dam cell which connects to the matching logic so now I'm going to only destruct this data if I have a match uh before I always destroy the data even if I don't have the match uh so all of the three designs have different tradeoffs as I mentioned in terms of performance Energy Efficiency and air efficiency the the first design the buffer isier one which uses that those flip FL TOS uh having the being the mdle design has uh mle performance M Energy Efficiency and mle air efficiency sit in the middle between those three and the second one the G simplifier which again use the U the cells uh this simplifi to store the data have the higher a efficiency because I'm not including any extra transistor other than the MCH logic but have the lower performance and the lower energy efficiency because I need to do a lot of copies to to store the data as I go uh the third design the G gate simplifier has a higher performance and higher Energy Efficiency but the lowest area efficiency because I'm including one extra oh I have so many animations here one extra uh transistor per Duram cell so that includes a lot of area cost so basically in the we we cover those three designs and depending on uh for which we are designing for you can select the appropriate Pluto design so in the paper we also discuss some system integration which allow us to have some uh some like code using the Pluto apis which are converted to the some series of indur operations some or or operations bit shifting which are required to realize that those R copies or uh uh matching operations and finally some execution energy which triggers the appropriate indam r activations and pre charge to realize that computation and for more details on those uh I invite you to uh check out paper and this beautiful figure that I draw uh so again we evaluate uh Pluto in simulation or simulator is available online and in the paper we have a lot of analysis but here I'm going to focus on performance and energy uh we use uh Pluto to accelerate seven real world applications which were not previously supported by PRI Works include s run because I couldn't uh they would require some sort of transational computation so I couldn't accelerate them using prior approaches and and for synthetical works like that have does additional multiplication which were supported by Sim dur so here is the performance of Pluto compared to the CPU GPU and uh we also have a processing near memory architecture we use through this Tech memory with some Logic for computation and we see that Pluto depending on the config that we use Pluto cannot perform those architectures um uh with varying ranges depends on the Baseline architecture of course if you normalize this two area the the the results are much better because uh the area of thepu GP is much larger and also we have significant energy saves compared to the CPU and the GP so this was Pluto there are many more details in the paper so now we are our Goods now we have indam Ro copy we have indam majority bulling operations we have arithmetic we have transational functions right so now we I can go take my C application and map the whatever is there I I'm ready I can map my application to uh this proc using application to proc using substrate and have great performance benefits right uh no that is wrong uh and this was actually related to some of the questions that was asked before so the main problem that we identifying with this processing using theum substrate is that it's is actually one of also its main benefits the parallelism the parm that this system provides is really large so when you're doing bit serial computation or even bit per hour computation using this Pluto like substrate your grity is a dam row so you always need to operate at uh of mood post of Adam so and adro is really large 64 kilobytes in ddr4 chips right uh the question is do I always have multiples of 8 kilobytes of data in my C like application so that I can fully utilize that bandwith and the qu the answer is no uh so basically the way that I got to this realization was that when I was in looking to the application that I would accelerate in the Sim Duram paper I suffer a lot to find those seven real applications which could uh fully utilize the the substrate and then I realize that okay this is good for seven applications but this is not good for 200 uh because those two other 200 applications what actually happens is that applications and Loops inside the application have varying degrees of parallelism so sometimes you have uh mple sometimes you have 8 kilobytes of parallelism and then that's perfect but sometimes you have um 1.5 times the size of the dur row or 10.3 times so so what happened to that tail that is left right so that tail that is left is generating underutilization and hence you are losing throughputs and you're also increasing the uh Baseline energy consumption because you're activating theam rows uh that only going to generate uh zeros which is uh it's going to uh consume more power for your system so basically uh the problem that we identify in this followup paper that I'm just going to mention is that currently process using the system suffers from this severe s utilization problem because the ground larity of computations is fixed and rigid it's always 8 kilobytes um the second problem is that I don't know if you if you realize but I'm always talking about Computing an array so I have one array and I have a second array and then I generate a third array so what happens if I need to uh do a reduction so have if I have one array and I need to reduce this array into a scaler I cannot do this computation substrate because I don't have communication across the different Dam columns that would be required for that finally uh particular for the simam and some and also from the from Pluto you need to program this as if you are programming assembly uh so you need to tell the the you need to go to your code identify which app operations are going to be uh good for proc using your computation if we have a loop need to unroll this Loop we need to insert those U process using the operations for addition subtraction Pluto operation whatever you want to do and this is quite uh put a lot of Border in the programmer so the programmer need to be quite an expert see the programmer to efficiently uh uh use this so the uh to solve this problems uh we owner already mentioned this in his lecture so I'm going to be quite fast over here I just want to give more details on the hardware design uh we propos this uh system called MIM Duram uh so MIM Duram uses this idea of A fine grain Duram for process process using Duram computation so in find gr Duram basically uh we are we have this Duram subarray and as I mention the background in the Duram subarray we have several Duram Mets right but when you I activate one Duram row there is this Global word line that propagates across all of the DS and activate all of them at once instead of doing that we are going to segment This Global word line so you're going to put some isolated resists here which allows me when I activate a Duram row I only activate my animation is broken I only activate those durs that are going to be involved doing computation so basically this allow us to have a more flexible Sim ground larity so if you have a 1.5 vectorization factor of your loop I activate only half of my Mets I don't activate the entire Mets the entire set of Mets in a d so for the second portions of the computation so this improves throughput and improves uh Energy Efficiency of the substrate uh so the other thing that this allows is that if I only activating half of my Mets for one operation if I have another operation that is data independent from the first one I can use this other half of those Mets to execute those uh uh the remaining bits of of of that computation that is dat independent so now I can instead of looking at this substrate as only as a s instru multiple engine I can look it as a multiple stru multiple engine multiple engine is uh because I have multiple instructions operating across different uh D Mets and at the same time operating over multiple data store across the different D columns so the second thing that this allowed is us to do reduction so basically we can use that Global simplifier to move data from one g to another duramet so hence we can uh uh reduce the data into some uh um uh some scalar value and finally uh then this is it comes as a as um it doesn't comes for free but comes as uh not for because we need to change the design but the usually when the size of this Dura met here is 52 rows by 52 Colum depending on the D organization might be five columns by 1,000 rows but it's it's always around that because we don't want to make those uh columns and those uh word lines too long uh otherwise the impedance get high and the capacitance as well uh so because of that so since each column here is a Sim Lane basically uh we have five St Sim Lanes in this single durat here and this is the same number of sim lanes that you have have in Vector Isa structures like AVX 512 exactly that so we can reuse uh those uh AVX structions or vector instructions for CPAs uh and and and consecutively their compiler support to map code to uh the substrate here in a transparent way from the programmer so I'm going to briefly talk about the hard modifications that we do to realize this uh substrate so inside the the aray what we need to do basically is to segment that Global word line that I mentioned before with some isolation transistors and some latching so that I can address each one of those duramet as I want and I need to have some selector logic to tell which one of those isolation transistors are going to be asserted during the process using theam computation so uh the next thing that you want to do is to move the data across the different Dura Mets and for that I'm going to give a little bit more background on a Duram column request so when you accessing ad Duram column the data does not necessarily go straight from the to the local Rob buffer here to the io interface it is uh it is uh further amplifying across the the across that column access path uh so that the signal Integrity becomes higher and then you can move the data to the memory channel in a reliable way uh to do that the diam chip uses some x-ray structure so there are this thing called helper flip flops here those are just some flip flops that sits in the edge of the Duram Mets that once you read the Duram column the column data is first moved into those what happen into those helper flip flops so I'm amplifying the signal and then from this helper flip flops to the global simplifier and eventually to the iio interface so we are going to use this uh Global and local amplification path that already exist in the Duram chip to allow us to move the data within and across the different Duram Mets so to move the data across the different Duram Mets uh we are going to extend the global simplifier with some interconnect network uh which allow us to move data across consecutive uh neighboring uh duramet and to move the data within a duramet we are going to basically repurpose this helper FP flops here uh to uh to basically allow us to have a temporary buffer uh to copy the data so let's say that you want to uh copy oops the data from this column here to this column over here so you read a column as is read regularly for a column request and this is going to allow is going to lat that column data in this help of flop here next instead of uh erasing this data in this help flip flop here when we move the to another column we are going to keep this data in this help flipflop and just change the column address of uh to the to the destination column that you want to copy and this is creating a basically a PF between this ledge dating this help F flop and the that uh um the the target local ass amplifier in the in the Met structure here and another propose that we have is that the driving strength of this helper F flop here by Design is higher than the driving strength of this Locus amplifier here so this allow us to copy the data from this flop to the locus amplifier um by just changing the column address not really so this is in depend independent if for using art uh uh RDS or lrs uh this this mechanism is is employ regardless so this is a common mechanism to because basically these these local simplifiers here uh they are all of this design of this entire area here is quite area constrain and you you want to make this lo simplifier as strong as it needs to be to Bar sense the perb the voltage peration that the sale is going to put in the bit line so they are not they they don't want them to make them really big because that would occupy a lot of space in the Met design itself so what they do is that they put this extra say uh you can think about them at Le I like to think about it's not correct but I like to think about them as yet another simplifier which is a bit stronger than the previous one but it's only I don't have as many of those as I have in the local simplifier so the local simplifier here I have 52 of them here I have only four so then U I can make a stronger circuitry but at the same time not consuming much area because I have um less of them they are our dims and uh our dims are they are there as well for uh signal Integrity both for data and for for uh common and address but those are some buffers that are placed in the dim itself so we have your Duram dim and in the PCB of the dim you place some extra buffers uh so that before move the the data is or the column address that that is Amplified and that again improv signal Integrity so hopefully is is outside of the ram chip basically okay so basically we use this um help of this inter intra uh um met Network to um perform indam uh reduction operations so basically here what we are doing is implementing a reduction tree so let's say that you want to reduce uh you want to uh reduce this uh the address of A and B into a single scaler so basically what we do we compute the first step of that addition inside the both the the Mets here and then as I mention we implementing Ed three so then we need to move the data from one met to another met and then we do again the addition but only involving that second met uh here and and then this process repeats across the different columns inside the internet uh uh using the intermet network as well so uh I didn't draw edit three here but hopefully when I say edit three you understand more or less that this is just basically uh if we Spread spread the data and then you do one Eder with another Eder here in one level and then another note here and that connects the the input of the pre two previous ones and then you keep repeating that in that three structure uh I should have drawn this I just realized that now but anyway so hopefully it's more or less clear uh what is missing now is for us to control this computation so the control is a little bit more complex than the Sim Duram control uh because now we need to schedule uh different Mets depending on the target uh uh Med utilization for a given process using Dr operation so basically we have some M some structures for doing that so A me Scher and I cor boards to keep track on which met currently being used and we replicate those uh control units in SD ram so that they can operate independently across dat independent uhr using the operations across the different Duram Mets and the Duram subarray so this is the hardware design of MIM Duram so in the software s uh we Implement some um comp uh compiler passes on top of L VM to uh identify Loops that can be uh candidate for process using theam computation is schedule the the do some P scatterling of the computation so that I can tell my control unit that some operations need to be executed in the same MTH because they are data dependent sub operations need to be ex can be executed in different Ms because they have no data dependence between them and finally we do some code generation so the first pass here is responsible for code identification so basically you're just reusing what lvm has uh for Loop autov vectorization so we take a loop we autov vectorize that Loop and that uh generates some Vector instructions here and instead of just using regular CPU Vector instructions we are going to convert those Vector instructions down the line to a process using theam operations and this autov vectorization uh engine allow us to identify what should be the target vectorization factor for that s destruction so for example this Loop here operates over 1,24 elements so the maximum vectorization factor that we can use for this Loop is this number of scalar oper that is going to be used which is 124 so uh if this is 8k for example then you fully utilize the match if the the subarray if it's uh 512 you're fully utilizing one met and since here is 124 I know that the size of the Met is 512 so I know that I need to locate at least two dams for that computation so the second P so after I vectorize the instructions I I I can I generate the data dependence graph of those Vector instructions so let's go back to the C code so uh in this C code here uh uh you A and B generate c and d and generate F so those two operations here they can can operate parallel because they are dating dependent uh but this third operation here uh C minus F they are dependent of the two previous structions so they need to so there is some data dependence here right so this means that this first operation and the second operations can execute in two different damat concurrently uh but the third operation needs to uh uh needs to uh wait for the two previous ones to execute uh and so basically what we do in this scheduling routine is to identify those patterns and then uh use some uh metadata to tell the control unit later on that oh when you fetch this instruction and you fetch this instruction execute them in parallel but when you fetch the third instruction wait and execute them in serial uh and also in serial to the previous two ones and also I need to identify that since I execute this instruction in one met in this other one in the other one and I have this one that depends on both of them I need to move the data from this met here to another met uh so we also include this uh data movement instructions that use that those those interconnects doing the computation and finally the third path is just generating the binary code and triggering those BB structions uh for uh the computation so this MIM Duram paper is a quite dense paper because we Tred to cover a lot of system supports that uh needs to be done so that this works uh in a real system and one of the key things that needs to happen is data location and Alignment so now I need to find a way to map my to know that some given data structures are going to be placed in some given Duram Mets and need to be align so that I can execute the computation while fully utilizing the substrate so basically I need to find a way to um to to influence the memory allocator routine so that I can all locates data dependent uh not dat depend operations that are going to execute together inside the same durat and operations that are going they are not going to be executed together across different Duram Mets and to do that basically we use some we create this new uh P location algorithm uh pck um is a p lock sorry uh that basically uses huge pages so huge Pages give you uh this feature in the operating system that a physical frame are going to be continuous on the address space so uh this is how everyone property and then but it does not guarantee you that those physical frames are going to be contigous inside the Duram subarray or Duram row or Duram or of those things because we have the Duram inter living scheme that is going to spread that frame across the different uh elements of the dur uh chip itself so we need extra information so we have some this reverse engineer and Dam inter schemes and this is basically a function that tells me oh based on this address I can I know where how this address is going to map across the different Duram uh channels Banks columns uh and subarrays a met um so if I want to if I want to for a effect allocates to add arrays in the same duramat basically I need to apply the reverse of this function and then I know how I should map those bits of the address space and then I can uh compose the allocation according to so this is way more involved that I'm mentioning here but I'm not going to go into much details over here uh so we again evaluate this using the gen5 simulator uh using 12 workload for four Benchmark s so uh in terms of performance efficiency when you are executing those uh uh 12 applications we see that MIM Duram significantly perform the CPU the GPU and uh Sim Duram uh operations and if you have uh multiple applications executing at the same time Sim Duram can outperform Sim Duram because now we can execute those multiple applications across the different duramat of a single Duram subarray instead of having to spread those applications across different Duram Banks um we also Compares MIM Duram to other two uh processing IM memory architectures so d a architecture here uh this paper was published at the same exactly year that was published in the exactly the same conference in the same session and they do something they do the same thing in different ways so uh while Ambit does those trip activations without incurring much area cost Ambit the paper takes a completely different approach and Go full accelerator mode and they add some nor uh uh nor gates to each bit line of a dram subarray so so you can do bullan opes in Dam so we compare to that substrate and to another process in near memory operations which uses a scaler L engine uh at the edge of the subarray to do a scal operations um uh in again in a processing near Duram type of substrate and we see that MIM Duram can outperform those architectures in terms of performance per area because you're are not incur much performance cost area cost while all of those two other ones are including circuitry so that we can do the computation so there are many more results in the paper which I'm not going to cover over here um but what I'm going to cover is what is left uh over there uh to be done uh and there are things left to be done because I didn't graduate as a PhD student yet um but yeah even after that hopefully there I still would to be uh new things to be done because um is my needs to graduate so have a question yes so the trut was elevated because we just allocating as much SD length as the operation need to be uh as the vectorization factor that the compiler identifies so uh if you for example if if you are doing Sim and the the vectorization factor of your Loop is of half of 64,000 so 32 2,000 the S utilization of sim dram is going to be half of mam is going to be 100% because you only allocate those uh um those given Dam columns uh and on top of that we improve throughput because for the other half we are doing some something useful other than uh Computing on zeros okay so what is there to be done in process using I'm sorry I have another question so what all of this is based on another PRI work of figuro which over there we did is also from Safari and uses something similar to move columns across different dur Banks and over there we did some spy simulations to see the latency of moving like basically asserting that column uh column multiplexer and it's the same column Multiplex so we didn't redo it because we be the same results but we reference Accord point to that paper Okay so some uh limitations of the current substrate so data convertion layout is a is a big thing that is is is needs to be solved in this process using D substrate because at some point if you don't have enough uh um computation to amortize data transposition overheads this data transposition overhead is going to um is going to to dominate the execution time of your your application uh this is not only true for process using theam or you might think like oh you just this is artifact of your own stupidity like you create this problem in simd Ram and then just fing up with this so this is is not only Pro like simam or mam that follows this protocol and not only Duram based proc using AR proy memory architectures so Asam based proc using dur proy memory architector from exactly the same problem because they also needs to uh transpose the data and operate in a in a bit serial manner if they don't want to blow up the area with interconnect so is and and the same thing happens for Flash Bas so is is it's just um the execution model it's the artifact of the execution model that leads to the least amount of uh area cost the consume foring connect uh uh we are uh we are already trying to solve this problem um I don't have I'm going to give a little bit of a i spoiler how this can be solved uh if you just take that figure in simam paper uh figure four and then if you rotate your head like this you are going to understand how that needs to be solved it's just uh it's just that actually I'm not going to talk much because I'm writing a people for Isa and I don't want to spoil myself uh but it's basically how you should do it you just flip your heads uh so the other problem is highlighting for bit serial operations uh as I mentioned here uh multiplication um uh and division scales quadratically with the number of or the the target bit Precision of your computation uh and this basically means that um the latency is going to be quite uh High uh for for those operations itself so we also did some work together here with Myan uh that uh look into alternative implementations of those micro programs that can leads to more efficient uh uh operations or efficient multiplications or division uh operations as well application scope is a big Pro is a big uh thing as well so uh I I personally try to be general purpose in my uh work so you saw there's like always a bunch of benchmarks from random mmark s because I'm trying to go for the uh general purpose scope but the fact is is that uh the way that industry uses works is by killing applications so we need to find what is going to be the killing applications for this processing using dur architectures so um right now the trend is Transformers NE networks all of this but the the problems that those architectures what it does is mostly gmv operations right General multiple Vector multiplications and as I mentioned multiplication sucks so it's not necessarily a good fit for that um at least this type of substrate but there are some works in the literature that tries to or shows that depending on how you design the system you actually can accelerate those Transformers and uh with those uh bit cial process architectures but still we need to have uh do more over there to improve that um we also need to do more space efficient loot computation so the way that we are doing the loot computation in Pluto is by uh uh taking the lookup table and indexing directly right so this means that the size of the local table is going to be limit by the size of your Dam subarray uh which is not good so if your the bit of your lookup table start to increase uh then all of the blue you're not going to fit into a single d subarray uh one way of solving that is by hashing so you can hash the input and then map that hash input into the dam sa itself uh we didn't uh uh uh look into that we actually start looking to that but then we didn't follow up uh so this is why I'm giving this spoiler so if you want to do that feel free it's not going to be my uh but yes so you need to find a better hash mechanism for those uh look up table computation uh another thing that is a a spoiler uh is execution models for throughput Orient execution so in MIM I'm trying to or we are trying to take steam destructions uh and exploit Sim Paralis only uh to uh execute process using the ram computation but Sim parm only is not good enough for a to put oriented accelerator we see this all the time with gpus which uses uh on top of sim level pism also uh thread level pism so both combined together so that they can have good enough utilization to utilize all of the S uh cores on this on this on the GPU so we need to move into that more to put oriented execution model so that we can uh fully utilize all of the Duram subay all of the met all of the banks all of the chips all of the models all of the CH it scales really quickly right uh with you fully wants to use the harder for computation uh so we can do we are working on this and we can uh improve those one step of a time uh uh especially um if uh you have to do our PhD on the top uh so this is uh MIM Duram and Pluto I don't know if you have further questions on those uh otherwise I'm going to to pull up a onor and I'm going to delay your break uh a little bit more uh just going to give this introduction uh to process near memory architectures and then before I go to the real world uh once we go to the break so this is going to be h a shift so we are going to move from processing using memory so might say shift we are going to processing near memory now we want to add logic near the memory chips itself so that we are going to do computation uh uh this uh is something that is really important to know if you want to design this type of systems that you you cannot do whatever you want and this annoys me a lot as a reviewer sometime that people design process near memory or propos processing memory architecture sometimes as if is a silver bullet to any problem and is not when you are doing processing your memory you are operating at very uh tradeoffs a very strict tradeoffs uh when you're deciding which logic you're going to do for computation so the first tradeoff that you need to think about is area so when you are designing a a a chip a memory chip then you're going to embed logic inside that memory chip you need to always remember that your memory chip initially is not there for computation is there for storing for density storing uh bits uh and once you start messing up with that you are you are decreasing the density of the memory density of your chip and it might looks like a red hole okay like okay maybe I don't care about density but at the moment that you are decreasing your density you're also increasing uh the potentially increasing the memory bton NE that your application might suffer because if you don't have enough space to store your data at some point you're going to have to move that data in and then do the computation so uh at some point you're going to end up in a processor Cent set up as well uh one we are going to see how this realizes on the real world application that real world chips that I'm going to mention later uh but one good example here is that at the time um of there was a hype of designning processing architectures around 2013 to 2020 using this architecture called HMC this does not exist anymore but still we have now this hbm chips they are quite similar in scope but basically those chips were those ones that we have some uh 3D stack 3D stacking of theam layers with a logic based die and in this this logic based dies you could put some computation but uh first you were limited in the amount of area that was available in the basic die so in this hmct for example was 4.4 mm squar and 312 mm watt per uh vertical partition that was called a vault uh for computation so it was quite limited and uh even though I tried to trace back this number here and then I couldn't find Who start with this but there are even though this might not be 100% correct or traceable is still the spa the fact is the space is limited and the power network is not uh strong enough to P up whatever you want so you need to take that into account the second problem is thermal constraints so uh if regular we are not used to uh have hit sign on top of Duram chips right the heat is designed for or active cooling is designed for processors or um or gpus or something like that right so we want to keep the Duram at that point as well otherwise we are going to lose the Energy Efficiency edge of processing their memory architectures uh and if at some point you the the processing capabilities of your chip becomes too uh costly and start generating a lot of Heats we need to move from passive uh cooling systems to more active cooling systems which is a active research field for tradition tech architectures for example so those this means that that will be costly so this also needs to be taken into account the third one uh and per perhaps the more important one that the reason that took us so long to have processing near memory architectures is because manufacturing logic plus theam is extremely and it's really difficult uh because those two manufacturing process have been heavily optimized for uh cont for different trade-off they contradict to each other so logic have been heavily optimized for speeds where the have been heavily optimized for density uh and if you try to implement logic using the Duram manufacturing process it's not that you cannot but since the process was not designed for that you're going to end up having a a logic that is lower than if you just manufacture that same logic using the correct simos manufacturing process so there is this all this paper that uh does some analysis on that and shows that if you just do do some Boolean Gates using uh the manufacturing process your clock frequency of those that you can clock those Gates is going to be lower uh than uh doing the same gates with log and we going to see that this is going to affect a lot the clock frequency of the processing near memory Hardware that you're going to uh that manufacturers are putting out uh today so this means that basically that your core you're going to do this processing in your memory architectures is great you're going to alleviate data movement bottex but you as a designer have a really difficult task because you going against a much powerful core your autoold course or gpus at the same times your hands are more or less tight because you don't have much area to spend you don't have much power to spend you can knock Co down the system and it whatever you design is going to be slower so is is is a really interesting tradeoff point to be uh even though that uh might be challenging is feasible so all of those works that honor mentioned for example the test R paper the Google neuro neuro Google neuro Network acceleration for Google H tpus and for uh Google uh work CL for Consumer devices all of those papers we tries to deal with those tradeoffs and in the end uh we can manage a design point which provide us more performance or Energy Efficiency compared to the uh Baseline CPUs or accelerators so uh before going to the break I just want we are going to talk about real processing your memory architectures next and this is a great time to think about working on processing uh in memory and processing your memory in particular because um before all of these this uh before when I started doing proc memory was 2017 um uh it seems like a dream like I had uh that one day in 15 years maybe would have those chips available in the market and uh when I graduate I would have to work on something completely relevant to proc memory but now is a um six years later we have prototypes and prototypes and uh prototypes for uh not only memory manufacturers but also server manufacturers that are looking into processing near memory architectures as a concrete serious uh design points to accelerate key important applications in the market so uh it's great for for the community that this is happening right now uh and is uh particularly great for me because I need to find a job next year so yeah so we are going to talk more about those processing de memory architectures when you come back from the break uh maybe we can take uh 10 minutes break and you back at uh 3:18 e e e e e e e e e e e e e e e e e e e okay welcome back so I might Google a little bit over time but if you need to leave feel free to leave uh I'm going to get a list to the lab three parts before we finish on time because it's quite important because we we have to do lab three I said lab four lab three um but okay so hopefully there's no hope I guess there's no burning question related to what I just talk about processing memory architectures so let's talk about some uh of my future employers so uh starting so the first commercially available processing near memory architecture that was um publicized was this French company actually called upman uh so just just me give you a brief historic perspective of this so this is not the first time actually that prototypes for proing memory architectures have been proposed there was a big project in the late 1990s early 2000s called Iram uh was from some people from Berkeley and they want to do some processing near memory architectur they actually tape out some chips but it never actually reach commercial um deployment these upm architectures is a bit different they manufacture they chip and then you can actually go to uh their website and you can they have some servers and you can buy time on the servers you can run those your application those chip so uh it it is um it's out there and like it's you can actually buy you can actually buy those dims here and you can use them so this up M as I mentioned is uh I set up from France uh they released this prop they proposed this uh uh processing Duram engines uh is a Duram is a 2d base one so it's not R st memory it's just your regular dims and inside these dims you have some opum chips and inside the opum chips you have both uh Duram Banks and some uh simple uh in other course so I'm going to give some uh um um details on this architecture and this is going to be the Baseline architecture that you're going to use on lab Tre so the upm architecture uh follows a accelerator model so even though we could in theory uh but uh we could use this diam dims for your main memory they are going to be there is not going to be the case least for now because it simplifies the the system integration a lot if you disjoin the add space from the uh ping accelerator and the main memory for your host processor uh so they are going to do that they are going to follow accelerator uh model where the up dims are going to coist coexist with conventional dims used for uh memory for your host system so you're going to follow accelerator model and particular uh what's called loely cup accelerator so a loely cup accelerator is any accelerator that um uh requires SPC data movement from uh whatever host memory that you have and whatever memory that accelerator have so one good example of a Lo C accelerator is gpus uh compared to the alternative of having a Lo cup accelerator is having a tightly cup accelerator so a tightly cup accelerator the accelerator sits uh in the same address space as the processor itself a good uh example of a tily cup accelerator is Vector processors that are in your CPUs they are just sitting there as a component of your CPU die but in theory they are accelerators for initially they were called MMX uh multim media extensions I guess was the name uh and they were there to accelerate some multimedia multimedia uh applications that in that time to identify as a key application so they are going to follow this up M chip are going to follow this loely cup accelerator where we need to explicitly move data between the main memory and the accelerator itself and we need to explicitly launch the kernel that is going to execute in the upm chip itself again as I mentioned this guys resembles how you would program uh GPU so this is from the initial patent that they uh M Founders wrote and publish uh I don't remember the year I think was 16 maybe but don't quote me on that uh so this just describe this accelerator mode uh that I just mentioned here so the uh over there the patent says thec uh loads the data into the Duram memory bank then transfer the commment to the Duram process uh the data uh process by the process the process starts and when the computation is done the uh the SLC can access the output data from the memory bank so this is again this tutorial picture uh from their initial uh patent deployment so here we have their so that is the master so it's your regular host CPU with uh some uh your regular DDR interface and then we going to have the upm uh chips here uh with some dur memory and some processor uh that's going to have some simple cores and also some uh scratch Pad memory so um as I mentioned the upman based pin systems organized as in a dim from Factor uh so it fits in the regular dim slot of your motherboard uh so it does not require something fancy like um silicon interposing like Inu Tech chips that we going to see later and it's it's alongside the main memory of your system so the a single up dim contains eight or 16 uh uh ping chips and this is going to depend on how many ranks are per uh per dim so is going to be either one or two ranks and inside each Duram or each Pro memory chip you're are going to have 864 megabytes of Duram bank here called Mr Ram so it's memory RAM uh um eight uh they call dpus theum Process units here which are the cor that are going to be used for on computations so you have 64 dpus per Duram bank because again you have either uh eight or 16 dur chips per uh one or two ranks and uh inside the dpu you're going to also have some instruction Ram that stores the instruction sequence that the core is going to execute and the working Ram which is some SRAM memory uh that um is going to store the DAT the the data that the pipeline is going to operate on top of so you can don't operate directly on the data on the Duram Bank you need to move the data from the dur Bank to the scratch Pad memory and then the from the scratch Pad scratch Pad memory to the registers of the CPU course so this is a picture of the what the one of the things that up people give you access to uh so here we have a m board uh containing your regular CPU uh socks uh the Duram dims here that stores uh main memory and the P enable memory here and in the currently uh in the in the this given rank here uh rack you have uh in total 2560 uh uh of those dpu processing engine across the across the rec so uh we are going to use this uh vector addition example especially later on to illustrate how you program this system uh so let's say that you have uh you want to add two input uh vectors A and B uh and generate a third input Vector uh C and for you to do that you're going to have to move the data from A and B to the uh for to the mram in the dpus partition the data across the different dpus in the system and then uh partition the data inside onep across the different software Tech test Toof thread called test clats uh that executes in parallel inside a single dpu because I did mention about that processor inside that the dpu is a a maded processor so again this resembles a lot how we would program a GPU and we going to have more details on this uh later on in this lecture so as I mentioned uh since this is following a lop accelerator model you need to manually transfer data between the CPUs and the dpus so to do that uh there are three different types of uh data transfer instructions that we are going to use during the labs the first one is called a Serial data transfer and this is going to move the data between the CPU and a single dpu so it movees the later from the main memory to one of those dpus in one of those ping chips the second one is a parallel transfer so here you can move uh big portions of data from the main memory to a set of uh dpus in parallel and finally it's a broadcast and this is only for CPU gpus not dpu CPUs so it's only a single Direction so here we can move a single piece a single buffer to multiple DS at the same time so we are going to see how they you program those uh data transfers in more details later on so important thing between the upman process M system is that there is no direct communication across the DP in a single Duram dim so this means that if you need to move data from this pin chip here to this pin chip here you need to go through this CPU uh copy the data to the main memory and then write back the data to the destination dpu um uh yeah so this is important for some communication patterns like Merion P personal inputs or distribute intermediate uh uh results all of these required to move pinkpoint the data back and forth the the CPU and the main memory so I know that this is contraproductive compared to the go of processing memory but yeah this is the first realization of their architecture and having uh um connections between these chips here like in a in a no for example would be quite costly to implement so but we know that is in their road map so then I'm going to give more details on the micro architecture of this um on the dpus so again this is a a picture from their uh patent that they initially publish so here we have the as I mentioned the memory bank so Bank the M Ram the processor which has the uh instruction catch the the scratch Pad memory and the CPU Pipeline and this is a yeah so this is the same chip in a little bit more details on the cartoon cartoonish version so we have some control interface and ddr4 interface to communicate to allow the CPU system to communicate the dpu chip itself then we have those 64 megabytes of the ram Bank of mram which stores the data some dma engine is which is going to allow you to move the allow the pipeline uh to move the data between the uh this Duram Banks to the scratch Pad memory then we have the 64 kilobyte uh working R scratch Pad memory that the plan going to read the 24 kilobytes instruction RAM and the for 14 uh stages in order a multi threaded uh pipeline inside the single dpu so as I mentioned this is a in order Pipeline and this is one of the effects that I mentioned that uh is a direct effect of manufacturing logic together manufacturing logic with the D manufacturing process the frequence of this processor is quite low right this is 400 up to 20 425 mahz for in mod core the processor in your phone runs four times faster than this right uh question H it's from the dim from the diot yeah able to support power for the core yes yeah yeah so they don't don't have a second power up or anything uh as I mentioned this is all uh it's a multi threaded processor and you can have up to 2400 threads working uh at the same time uh I'm not going to show here but often it's not beneficial for you to have all of those 24 hard threads work at the same time because the pipeline State this is a fine G much threaded so you just need as many hard threads as pipeline stage or and most as the number of pipeline stages so that the pipeline stages are uh fully occupied doing the execution so based on some micro Benchmark we saw that actually 11 hard threads working at the same time is enough to fully utilize the the pipeline lots and Achieve p throughput as I mentioned we have 14 pipeline stages those are regular pipeline stages not much different from what you saw in all the processors F dispatch read some operate formatting operations and W access and merging I stage so this processor has its own uh Isa struct uh it's a 32-bit Isa stru that resembles uh um a risk five like instruction set architecture but it's not necessarily risk five but it it similar uh or or risk- like sorry it's not risk five risk like um and the up many people uh extended the llvm and the sine compiler to allow you to write CI codes and generate the binary for this architecture uh so here inside the uh Pip this is again uh the pipeline we are going to there's not much many more interesting things over here other than how you program this but we are going to see how to program this a little bit uh uh later in this lecture so if you are interested on the microarchitecture aspects of the upm and benchmarking and understanding uh how different features play along with the up M architecture I invite to check uh hanan's lecture on that uh and also we have put up this um um analysis of this re real Chip where we test different components and see how they perform uh yeah so Han give this also this talk at tchon at some point and we also provide this Benchmark tweet to to stress different component of the M system uh it's called Prim and you can see the lecture online so so uh yeah there are many lectures on this topic and we got access to this upm servers in Safari and we uh we've been using these upm servers to accelerate different memory bound applications so we start by benchmarking the architecture and then we start moving to more complex uh operations like uh classic machine learning uh training like executing things like decision 3 SC means or logistic regression over there uh doing sparse matrix multiplication doing uh I'm going to talk a little bit more on this translational functions on this on these chips uh sequence alignment uhor encryption um reinforcement learning and recently we we show this paper impact where it does distributed optimization algorithms over there uh and also we have this upcoming paper in micro uh that shows how you accelerate ref neuron networks uh for this up Ming architecture so yeah so that was the up Ming architecture I'm going to go back to this to do this when I get to the programming part we are going to use in your lab so if you don't want fully understand everything hopefully when I get to the programming part it's going to be a little bit more concrete but I just want you to have a overview of the hardware design hopefully is there any question related to the op okay so now I'm going to accelerate a little bit uh because we are not going to use those architectures in uh in the labs uh but it's extremely important for you to know what is out there in the market as well as alternatives for upm so right after up in 2021 samung uh also came up with their own processing their memory architectures uh and here they are going to use 3D stack uh chips uh to design their process near memory architecture so they show this work on uh for in this uh isscc conference and they keep publishing works and I'm going to mention as well keep improving on this architecture and then reporting in other conference so as I mentioned they are going to use 3D St uh uh Duram to uh propose in in their process in their memory architecture and as a a background a 3D Tech chip or or hbm which are going to use here hybrid band memory is a memory chip that is tacked Duram layers uh in a buffer layer so in HMC times this was called launch clearer hbm times this is called buffer layer uh so the buffer layer contains some iio circuitry some self testing and some test debug so is not there for for for free is there because some of the um some of the components of that was uh uh usually part of the memory controller was moved to this um to to this uh buffer layer to improve um uh the the the connection between the the chip and the the memory itself so the to connect the different dur uh layers to the buffer layer they use a technology called PR silicon Vias or tsv and this is a pictorial picture of what you have uh so this is uh the um the so the system in package and here we have the hbm chips here uh these Circle things are the the tsv connections and the this is the substrates and the connection between the uh host and the diam chip is done with something called micro bumps uh that is used is this thing here that connects the host and the die in the Silicon interposer so this is a little bit different from like dim from Factor because this you cannot plug and unplug so you put the you put the the device in the package and once it's there it's there you cannot remove it uh but this is why what we will buy for example uh well will not buy because it's not is impossible to buy from anything because everyone bought already but this is what you would get from a 800 800 uh Nvidia GPU chip that uses those High hbm chips uh to provide High bandwith for uh their high-end gpus so uh as I mentioned the buffer layer is connect to the host via silicon interposer and in the hbm2 uh uh realization of this chip what we have is uh what is called sud sudo chanels so some groups grouping of those tsvs uh which each one of them with four uh Banks and this is a little bit finer gr lar than regular Duram where they um transfer this is wider because the Duram is 64 bits the the column transfer and here is 250 six uh bits uh that is going to be used from a column uh uh read but the row size is a bit smaller so this is one kilobyte and in the ram the dr4 is8 Kil kilobytes so is a trade off because uh um we want to have more Dr Banks so then you can operate them more in parallel so this is so in the original paper Samsung called their pecture F Duram um uh now they they keep changing their name I cannot keep up so it was initially called fim dram then it became hbm Pim and now it's Aqua bolts and they keep appending some XL something in the ends of the design so I don't know how many X and how many L's are there anymore um but yeah so but it's the same they are talking about the same thing uh uh different a little bit from what I mentioned from this uh uh HMC like design where we would put the competition the logic die what Samsung does is a little bit different uh and I guess because they are a many manufactorer they can do that is that they take out some of the memory dice uh that are originally in hbm2 chips and they replace those memory dies with compute dies so basically what you have is uh some regular uh memory dice together with some U memory dice uh in the same uh cheap so this is nice because now you're not super limited by the area of the buffer die but now your density of your the ram chip is reduced uh because I guess they go from 8 gigabyt uh to four now I'm not remember if it's gigabytes or gigabits is one of them it's it's half basically uh so this is there is always this trade off as I mentioned right area capacity uh Power and density so this is their custom implementation that they have inside but it's not that much interesting for us we're not Circ people uh what's interesting is these block diagrams so uh this is the organization of the F Duram architecture so this here we have the organization of the regular die for memory die for hbm to and here we have the die for fim Duram uh architecture so what you're going to see there is difference that between every two D Banks they have these PCU blocks here which is basically the core that's going to do the computation and this core is going to be shared between neighboring Duram Banks um over there so they have some design goals that they want to maintain while designing this F Duram architecture so uh the first one is that uh this hbm chip is J de compliant so there is a specification that defines the how you the protocol to access uh this memory chip and they want to keep this protocol unchangeable they don't want to change the J deck change the change J deck is always a mess because it's it's a it's basically a a contract that you do between the memory manufacturer and the processor manufacturer that dictates how that memory is going to be operated so changing that protocol requires going to a lot of meetings and I want to do this I want to do that and and then a bunch of back for so it's really complicated so they don't want to do all of those they want to keep the protocol as it is so this is one and this is going to impct how this Chip is going to be operated uh the second thing is that they want to minimally engineer the or they redesign the Duram cell area so this area here inside the dam chip is like where they make money off so this is really optimized to density so they don't want to change that they don't want to move things around uh so they are going to restrict themselves to adding computations to the peripherals the peripherals are they still don't want to do they don't noway they're more open but like they don't like to change many things but the peripherals are okay to to to change because there is regardless there are some decoders there there are some some logic so regardless there are some logic over there so they are going to restrict themselves to adding computations to the peripheral of a Duram Bank as I mentioned uh so they are going to have one pin unit for each two D Banks uh and those pin units are going to be fixed functional units so this is are not general purpose course they the target application that are they are having here to accelerate in these chips is gas NE Network inference so basically what they're going to design is some Mech units so that they can do Mech operations inside the dam chip so uh this this fix functional unit here is going to be a CD uh FPU float Point engine for again to do this uh 64 uh bit float Point operation so uh for this to operate as I mentioned they don't want to change the protocol so they are going to just repurpose the read and write comments to trigger process using the ram computation so basically they are going to have a sequence of in the the memory controller issues a sequence of read request write request to some given uh memory address and based on that sequence you trigger uh you are in memory mode or you are in P mode basically and we're going to talk a little bit more um on those modes later on on uh as I mentioned uh since this is in the io of the bank uh the data in from this uh float Point engine is going to be as wide as the number of columns uh this the column size of the bank so it's 256 bytes so if you divide this by 16 this gives 16 bits per FPU because you have what you have a 16 of those 64 16 bits per uh uh FPU in the peripheral of the bank so as I mentioned uh the uh the way to trigger this computation is by issuing some sequence of activate and pre-charge comments to some predefined or some fixed address locations in some banks and this is going to allow you to trigger uh process using Duram operations either in a in a single Bank in a across or across all of the Duram Banks inside the uh the the Duram ship so yeah so uh this is called ABP mode in their nomenclature uh which is going to trigger computation across all of the Duram Banks so that you can employ Bank level parm so this is the design of the FPU uh itself so uh here I I think they Chang their they keep changing their name PCU now so basically we have the interface for the bank so even I guess not I guess I as I mention this chip the the hardware is shared between neighboring Banks so we have an interface for the even bank and an interface for the odd Bank you have the the control uh the instruction for the instructions that you are going to execute some register file to store the data that you led from the banks and the uh s the uh engines here either for multiplication or for addition again what the target here is to do multiply uh and accumulates so this is another view of the same uh thing so what we have as I mentioned here is the uh common buffer uh some pipeline decoder some sequencer and some uh register files so uh the instruction set is quite limited because again we just targeting neur Network Mech operations so they have only nine right risk five risk like instructions and those functions are basically addition multiplication make uh multiply and add uh what is ones multiply n mly accumulate and some data movement to move data across the global register file and and the data P of the process of the of the simd uh processor and some simple jump some simple control and no op so this is not fancy at all as you can see this cannot do fancy Brin always jump jump to a a particular fixed location so it's quite restrictive but again because they are targeting a single Cal from a important application so this is a comparison table that they have in their origional paper that they compare their architecture at the time F Duram with the upm pink chip uh some interesting things to look into this table is that yeah they are using different different types of Duram op uses dr4 uh F Duram uses hbm two uh okay this is U some manufacturing process that is not so interest not so interesting the size of the chip I not half is six gigabytes per Cube um so yeah so it's six gigabytes per Cube while op has 8 gigabytes per dim it's important to say that this a cube is way more expensive than a dim because manufacturing to this stack is much harder than manufacturing to the uh logic uh Tod Duram uh since they are Tod St they have way more uh band per Cube compared to the band with inside the dim um similar to what we saw before up M the clock frequency of the processor is not so high again because manufacturing process logic is using dur manufacturing process is complicated even for a memory manactor like Samsung but overall they because of the available parm that they have they achieve quite High Pak throughput per Cube 1.2 ter per Cube and since they are targeting um neuron Network inference they support float Point operations in y the M chip being chip mostly support arithmetic interg operations so if you are interested on uh the architect have more lectures on those so as I mentioned they keep uh publishing or keeping they keep prototyping other architectures that are built on top of this feam architecture so in 2023 they came up with this other architecture targeting specifically uh Transformers so it's just a de plot that generative step in Transformers is memory bounds and then they came up with this architect that merges this hbm P architecture together with MD gpus and they show that if you they collaborative do the computation together with the GPU uh this is much more beneficial than doing the computation you only using the GPU itself so they also have a s implementation of similar architectures for mobile devices uh generative AI on on on edge devices and here they are not going to use the hbm chips anymore because you don't embed HPM chips on uh small factor resources like a phone but they're going to use low power DDX memory devices that targeted those architectures so even though the the memory technology is different the architecture of the P units is exactly the same so some control some uh float point multiplyer and addition uh the the pin units share across the different dur Banks so they follow exactly the same uh organization just change the memory technology and again they show some uh performance and energy gains compared to running the uh gp22 inference uh on their system so they also have one solution for SK out system so if you want to uh if you have a model you want to train a model that is really large you're going to have to have this memory expanders that is allows you to your SS to extend the memory capacity of your system scalable weight called uh a protocol called cxl uh and here they have two setups where they add this pin logic either in the controller of this memory expander or in the memory chip itself similar to what we just saw for the uh hbm2 and the low power chip itself for these ones they don't have a design itself it's just more conceptual but probably should be quite similar to the other two that I just mentioned so uh Samsung also have another um another um uh solution for recommendation of systems and uh and this is the exdm work that they published 2021 and someone ask him about LR dim so this is a RM actually so what they have is the dim and in the in LR dim you have this buffer here uh that is used as I mentioned for Signal Integrity for both data and for the column data so what they are going to do is that they are going to embed some simple computation to this uh buffer this buffer inside the gym itself uh so that they can do acceleration of accommodation on systems so uh I'm not in the interest of time I'm not going to cover in so much details on how this works so basically they have this fpj based uh framework which allow us allow them to this is the dim itself and then inside the dim this have this custom fabric here in this fpj and they have two ranks and they are going to uh use this fpj to have some logic Implement some logic to accelerate uh recommendation systems and the key operation that they are going to accelerate in the recommendation system this element summation um uh um for for for the Target uh recommendation uh uh application so I'm not going to explain in much details the design of this uh but what is important for you to know is that this is a custom fabric so you can actually Implement whatever you want but since they are interested in recommendation systems what they Implement is this the important key component is this Ed Rory here which is a s like processor that us this elementy summation required doing the recommendation system uh uh processing so yeah so they also use this architecture to accelerate other things like uh sparse length sum and also database uh operations and again if you are interested on uh this architecture I invite you to check hanan's lecture on that so another memory manufacturer SK HX also has its own uh pin chip now it's called aim and the also have uh a followup called ax uh they keep adding X on stuff uh here what changes again this is targeting your network inference and again what is going to change that they are using a different type of memory G uh gddr6 uh memories so those are usually high band memories you utilize uh for example uh in gpus so this is the design of the chip is a little bit similar to the design that I just mentioned from the Samsung work but here different from Samsung they have uh uh two uh one processing uh engine per bank so it does not share across the banks uh and they also have some uh supplementary buffer that allows them to move data in and out uh this uh this theam chip so I was going to go through the details on this uh commment set but I think it's not uh so interesting basically this is not J what is important for you to understand is that this is not that compliance so they go they go to uh to the extent of creating new Duram command so that they can trigger the computations in this in this architecture as they want uh so yeah so they have like this new activation instructions like activate four and activate eight which activate either the rows in four banks at the or 16 Banks simultaneously uh they have this uh process unit here again which is uses some multipliers together with some other tree and some accumulation and activation function to accelerate Uh u g and GV operations in a NE neuron Network kernel uh yeah so this is uh the how the different ways of operating it uh with if the input come from outside if theuts come from inside or a combination or both basically so what I want to show next is this comparison table again this is uh now they are comparing their design with the fim Duram architecture together with the up architecture and so what is different between those three is that this is using ddr6 which is use um more as lower uh technology noes uh their density is one G is 4 gigabits uh per uh data process the engine that they have uh that The Troop let's look at the processing operation here for some reason they manag to get to a much higher uh frequen than both upm and Samsung uh they don't fully explain why uh so but good for them they reach to one gahz um the throughput of their system is one Tera flops per chip so 32 gig flops in total if they have 32 chips in the in the system and they are targeting um um the um what the blood flow blood flow Precision for blood Flo points for neuron Network inference and they have some supported from several different activation functions your neur netw so yeah so this is you can talk they they have entire stack computer system stack uh for this including uh sdks and everything and then you can check their uh lecture online on to learn more about that so finally uh I'm going to really briefly talk about the uh hbm pin no the the one from Alibaba so Alibaba is not a memory manufacturer per se they are a server provider but they also came up with their own uh pin chip and here they are going to use hybrid bounding for comp for stacking pin and memory at the computation and logic at the same time so H hybrid bounding is this process let me show here where here where you manufacture your Duram wafer and then you Manu facture a logical wafer in their own manufacturing process and then you flip them and then they they have this what's called Waf wafer wafer connection uh where you use uh copper I guess to create connections from one wafer to uh another wafer here in this hyperb setup so this is a quite newish uh technology so was quite impressive to see that a server manufacturer is already looking to this direction uh so here you're not going to you cannot stack this right because it's face to face so you take one way for and and just glue them together um so here they are targeting this architecture for um for so sorry I'm going to go a little bit uh out of time so if you have to go please feel free to go uh so they are targeted recommendation systems again similar to the XD architecture so again they have a comparison to the them Duram architecture and the upm and now they're using low powerd for Duram instead of hbm and they are manufacturing this a little bit higher technology noes their density is 4.5 gigabytes per chip uh and then their frequencies again is a bit low again I don't know why so low actually because the those are manufacturing different nodes but what is important is that they they Bend with for capacity whatever this metc means is higher than the other ones they they keep divining things around and then uh to show that based on this met something is better than the other but anyway but the energy is energy right so the energy per bit uh is quite is lower compared to the other two ones because they are using low power the R4 so this is why U so if you are interested on this again check uh Lan's lecture on that so finally gets to what's important for lab three so uh in lab three you're going to use the upm SDK to program some Cardinals and measure some stuff and then you're going to uh report the things that you measure so uh overall uh um some some programming recommendation that we learn over time programming this up Ming system is that you want to if you take your application kernel and have your input data and your test that you not need to execute you want to paralyze the uh tasks a way that you can run those tasks in the DP Port parallel portions of this code in the DP as long as possible so you fully utilize them um uh so you also want to find dat independent blocks which the op Ms can operate independently because the moment that that they become dependent you're going to have to move data from one dpu to another and moving data like that requires you to go to the CPU and this is going to include extra latency so you should work with as many dpus as available in the system so you want to this again a throughput oriented processor and you want to use as much throughput as possible and as I mentioned you want to launch at least 11 software tasks software threads uh per dpu so you can fully utilize uh the pipeline Slots of the dpu so uh the way that you program this is by using a series of uh apis that the SDK is providing and here I'm going to list the key ones that you're going to use doing the the the lab tree so the first one is the dpu loock this allocates uh uh some given number of dpus that you want and creates this dpu set that you can operate on so this is the syntax here so this is just for uh a search to see if something fails uh but basically you unlock the number of theps that you want uh and this gives you a dpu set which is the uh the the what you're going to use later on to allocate the data so you can allocate different dpus in the course of the program so you can have this for example in a loop and you can allocate and deallocate as you go uh and you deallocate the pews using the dpu free um which is equivalent to the equivalent of the dpu alock and then you pass the dpu set so uh once you allocate the dpu then you need to load the binary of your application into to the dpu itself so you point to the to the binary that you're going to execute and then you load the dpu binary in the dpu set by passing that that pointer itself so you can launch different kernels onto different dpus doing the execution of the application for example but we not going to do that in the in the lab Tre so the data transfer that I mentioned right so in the thep is the presentation I mentioned that there are three three different dat transfers serial to data from to one single dpu parallel to cop chunks of data to many dpus and CPU and broadcast which cops a single piece of data to multiple dpus as well so for you to use the serial data transfers the API that you're going to use is the dpu copy to and dpu copy from depending on the the direction so dpu copy from come from copy from the host to the to the dpu and dpu c I said wrong two cops from the host to the CPU H to the dpu and from cops from the dpu to the host uh so for you to transfer uh you need the dpu set so here is the syntax again for each uh dpu uh in the dpu set what you do is you do a data copy from that given dpu a a pointer in the hip space of the of the mram uh this is the Duram that you have uh the the offset uh on the that you're going to point to the to the mram the pointer of the the pointer of the data that you want to copy in the host memory and the size of the transfer that you're going to operate uh so here we are copying two arrays two buffers A and B so uh this is a linear address space so after you copy array a array B is going to start after aray a finishes and then you just pass that offset to the uh dpu copy to uh function for the dpu data transfer we have a similar process and but we have different apis we have the dpu xer to dpu or X or from but you need to prepare the data beforehand because this is going to be a uh dma transfer basically so uh this the idea is similar for each dpu in the dpu set you do a dpu transfer uh you prepare the data first to point the data pointing to the location that you want to copy uh from in the host name memory and then you cop you push by doing dpu prepare xfer uh no it's the other one DP push xer with the direction so here you're pushing from uh to the dpu but you can also push to the C uh to the CPU uh from the dpu in the other direction uh again you need to pass the offset inside the the memory RAM again it's a linear space so the first copy the offset is zero the next one is after that data so this uh after the date after the data that you just copy the size of the data uh and this is always the default here don't need to worry about it finally we have the broadcast and for that you use dpu broadcast to again you going to copy this data this buffer data here the transfer size but you're going to copy to a sets of dpus in this case because it's a broadcast as I mentioned there is no communication across the thepu so if you need to do that you need to go use those copies and through the host and then you cop to another dpu set so after that uh you can launch the kernel uh for do to do that uh you do dpu Louch and the dpu sets and then the loue can be both synchronously or as synchron depending on uh the parameter that you pass over here but mostly you're going to use the synchronous one so uh you can also pass parameters to the uh dpu um and then to do that uh you use again the dpu uh transfers so here for example the copy to and then you pass uh as an argument uh I struct with the parameters that you want to uh pass to the the PE course going back to the the vector addition uh this is the code uh for vector addition so here we have uh the test L this is similar to Cuda so you have a test ID for example this is testl ID for that given kernel uh you you tile the data to move it to the to the dpu uh you do the uh allocations of the uh working R and you have some data from the mram as well uh you move move uh the data from the MR to the W specifically use the m m operations here and then finally you can launch the kernel and then you write back the data from the workr to the Im so this might sound quite hetic or complicated but uh is you can always follow this uh uh um this the same um recipe to whatever Cod that you're going to implement met in the lab so this is going to be available online so you can reference back over there so finally what is left for you to do is to launch the Cal so vector addition here and this is just a vector addition operation so uh uh also uh you can synchronize uh the different testlets using uh some uh synchron synchronization routines that the SDK provides and those are some mutex some handshake barriers and sem for and you need to use those synchronization barriers if you are implementing something like a par par parallel reductions where uh testlets are going to work together in a parallel reduction operation so how you do that you divine the data into different testlets you compute a local sum but at some point you need to have one single dpu Computing uh uh the final local sum here uh so basically here you have a uh accumulation V variable and once this computation is done you move this accumulation variable and finalize the the final sum in one one given testl here uh doing a uh doing a SQ SQL accumulation so uh that was a extremely fast uh description of the apis but I I hopefully this was at least a guide for you to know which API routines you're going to use doing lab three so in lab three uh I'm going to post the handouts today uh we are not going to use the real opment system because they are in a server and need to buy time over there but in the upm SDK they they provide us some simulator of the Ming system and this is what we're going to use so the hang the handout describes you how to install and use the the SDK and the simulator but you also provide some Docker container which you can just uh launch and then everything's already installed simulator already works and you can just start including implementing your codes so you're going to find inside the uh the handout uh two uh template files for test one and for test two uh the test one here you're just going to play along with those uh data transfer serial uh serial uh parallel and broadcast and they are going to measure uh their their uh performance using instruction count and then the second test you're going to implement um a simple Coral here uh which basically yeah it performs y = y + x uh and then you're going to test you're going to check this CERN with different uh data formats and then you're going to see that in this in the files there there is a make file and the make file makes you quite easy is everything is par parameterizable so you can just change the make file compile to those different data types and then measure the instruction counts and then repeat that for each one of them uh finally you're going to use the implement the vector reduction and for the vector reduction you're going to have to use the synchronization barrier uh then you have uh bonus portions of the lab as well that you have to implement RGB brightness uh kernel over there as well if you if you want to do the bonus so that is for lab three I don't know if there's any question for lab three I test on my computer which is uh Intel but it shouldn't have if you just have Docker you should just run yeah and we have a Docker for Linux and for Windows as well so you should walk on both have a question okay sorry so this is lab three uh is going to be post uh today in the evening after I finish here and then um the deadline is in two weeks the soft line so uh that was the main thing uh so as you saw programming this dpu uh systems require you to do a bunch of stuff so you need to split the data across the different pin cers you need to mainly transfer the data within the dpus uh and the host M memory you need to manually handle cashes between that Am ram and the W RAM and you need to manually transfer your output data from pin chips to main memory so uh in practice uh you as a programmer are going to have to do several tests to implement something so you need to Align data within uh because the data needs to be8 by a line uh to fit inside a single uh dpu at a time you're going to have to collect input parameters and transfer those input parameters uh to the different uh dpus using those data parallel serial data transfers uh we're going to have to launch the computation collect results from the dpus uh manage the scratch Pad so manage the data between the Mr and the WR RAM and you're going to have to orchestrate the computation and communication finally and we know that all of this is quite combersome as part of the lab uh but we have been working on trying to design uh frame works that tries to age the programmability of the op Ming system uh this is the D framework which we cannot use for the lab uh because then we don't need to do anything but basically this framework uses this concept of uh data parallel patterns to OB uh uh to abstract away those uh task that the programmer needs to do uh and basically we Implement those five data parallel patterns here this is called based program so the user just Define how data is M between inputs and outputs and Define a modifiable function and and then basically you can combine those different parallel patterns to do some form of computation so the way that this works is basically we have this concept of a pipeline uh and this pipeline is going to allow you to do different transformation across different stages uh and each state is going to represent a different data parallel pattern and since the data flows sequentially across different states uh the framework automatically uh handles uh those data transfers data ortion routines that are required in the Ming system so yeah so uh this daa framework that we implemented uh compiles and executes each stage of the pipelines uh automatically compared completely independent from the user uh so the way that it does so let's say that you want to do this tgage computation here the only thing that you need to do is Define the stages uh that you're going to be required for this computation so this case is a map and a reduced stage and then there is a the framework dynamically compiles this code down to upm binaries and run the application in the upm P system itself so this is the DAP uh uh framework that the framework that we designed to try to reduce the complexity of implement of of uh programming real proc memory systems and owner also mentioned the simplein architecture which is a early implementation of the same uh the similar framework so we also implemented uh this transp Li which is a library of cord of transational functions for the up Ming system because the upm harder does not provide you uh harder support for those transational functions themselves so basically we op op uh create these apis that uh that uh approximates transational functions using either lookup tables or cord implementation of them and we Implement several key presentational functions over there and the API is available online so this concludes my I think that was the last lecture on memory s Computing this semester hopefully uh sorry for going over time but hopefully you you enjoy it uh again this is a great time to do process memory research there is still a lot of things to be done across the memory stch we can do all of them one step of a time um and with this deployment of pin in the field right now in real products uh more than ever uh we have a lot of uh great incentive to push for those Memory Center Key architectur again uh next week I'm flying to micro and we're going to have a p tutorial over there we are going to have invited talks from other P researchers and it's going to be available online on YouTube uh so you're going to have uh if you want to join feel free to join you uh you're going to have other people talking about memory and Computing so you're going to see that I'm not the only one inventing all of those things from my head because I want to have a PhD froma so it's good uh to to for you to have to see other people talking about their own problems related to uh processing memory and again we are currently working on the new version of pinbook chapter uh which I need to finish now uh which is going to be available soon so thank you so much for your attention sorry again for going over time and if you have any question related to lab trip you can post on mudo I'm going to try to solve everything thank you everyone I don't know if there is any burning questions otherwise we are done thanks so much

Transcript for:[Lecture 12] Exploring Processing in Memory Technologies

Transcript for:
[Lecture 12] Exploring Processing in Memory Technologies