Computer Architecture: Processing in Memory (PIM)

make sure that all uh important topics and and concepts are are sufficiently described um the lecture as I said I have more than two well you can already see them in the in the website as well right it&#39;s more than 200 slides but the last part about programming uh real processing in memory architectures is something that we may cover tomorrow or you know finish at least uh tomorrow it all depends on how fast we go on the first part which is what uh we can call the processing near memory lecture okay so it seems that we are live streaming now got it okay and yeah over the lecture we will see when to uh make a break maybe we&#39;ll finish five minutes uh earlier we&#39;ll see um yeah I guess we can we can start um any questions not for a start okay so then let&#39;s go ahead with lecture three in this course on computer architecture today uh as promised we start a really fascinating part of the course that is pro uh processing in memory and processing in memory as we are going to see in a few slides is divided into different types today we are going to talk about the uh main first part of processing in memory which is processing near memory remember that processing in memory consist of uh placing some sort of compute capability in the memory or in the storage when we talk about processing near memory what we mean is that we are placing uh processing elements or compute units or uh processing cores near the memory arrays or near the storage um so yeah we we already covered uh well okay let me also clarify tomorrow we will have in principle two lectures in lecture 4A uh we will talk about programming a real world processing in memory architecture because that&#39;s what you&#39;re going to do in one of the labs uh so you need this background we are not going to go into a lot of detail probably because we will also um uh Premiere the lecture from last year which was uh you know pretty long it was an entire lecture about how to program the the abman pin system um and um yeah I will also share some pointers to a lot of material that uh um that will be useful for you to learn how to program this system the second part of tomorrow&#39;s lecture 4B uh it&#39;s going to be you know like kind of potp of different things that need to be done different challenges that we need to uh tackle in order to make processing in memory something real something that can be used in the real world and jumps from the you know academic or industry research environment where just um you know um simulators and prototypes are developed to some uh Real World products that uh can be you know um um and and available to all consumers or potentially all consumers and um and but but you know uh it still needs to be uh done a lot until we can do the seamless let&#39;s say end to endend integration system integration okay and then next week at least the first lecture will be also about processing in memory it will be the other you know big part of processing in memory which is what we call processing use in memory uh in next week&#39;s lecture you will have Geraldo Geraldo Francisco deliva teaching he is one of our PhD students and uh and he is a real expert on the topic because that&#39;s what he&#39;s doing in his PhD so um for sure I think you he will he will give a a very good lecture okay so let&#39;s start with this lecture three processing near memory and first of all we need to recap a little bit on things that we covered already last year what&#39;s the motivation for processing in memory what&#39;s what we call processing in memory and why processing in memory now uh remember that last year last week we were discussing uh some major Trends affecting main memory we were also talking about the need for intelligent memory controllers and um we U discussed these three Key System strength systems Trends uh the first one that data access is a major bottleneck for the reason that applications are increasingly data hungry uh the energy consumption is a keil lieter in the systems and data movement energy dominates compute we saw some um interesting plots last week um in order to motivate all of these and we are going to recap on all of these uh but you know all these challenges um um allow us to make some observations and also also take advantage of some opportunities uh there is a high latency and high energy that is caused by data movement due to several reasons for example that we have long and energy hungry interconnects these interconnects are based on electrical interfaces so they are also very much energy hungry and we are moving large amounts of data the reason is that data is increasing continuously we are handling more and more or bigger and bigger data sets these days but this represents opportunity which is the possibility of minimizing data Movement by performing computation directly inside the memory or close to the memory and that&#39;s what we call processing in memory or near um sorry in memory computation or in memory processing or near data processing it&#39;s a even more general term that encompasses uh also storage right so um any um like something like um the cach the ssds the main memory even the network or the memory controllers if we equip these devices or these parts of the system with some compute capability we can talk about near data processing right uh in this course we uh mostly talk about processing in memory but that&#39;s probably because you know the initially the focus was on uh equiping the main memory with such compute capabilities and that&#39;s why we are kind of inheriting the uh the term processing in memory and making it you know wider and encompassing all these uh possible memory spaces that we focus on this is one of the uh key um directions in this course and also in our research remember this slide from the first day one of the uh key uh issues and one of the topics uh is fundamentally energy efficient architectures memory Centric and data Centric architecture so uh in these uh lectures we are talking about processing memory we are talking about how to build memory Centric or data Centric architectures um yeah this is a motivating slide from Professor mudu about maso&#39;s hierarchy of needs you know like all the different needs depending on what&#39;s the the status of the accomplishment right uh if we uh translate this to uh the Computing systems probably what we want at the very bottom as the basic needs are Everlasting energy and why is that because we want a sustainable world right we want a world that looks like this and not like this and that requires us to be very much energy efficient right energy efficient and also high performance and also sustainable all at the same time and this is something that we can achieve by making compute systems more data Centric or more memory Centric and we are going to see in this course house how um the problem that we want to solve with this kind of systems is data Access Data movement because or current design principles caus a great energy waste and also a great performance lost we um have already seen some motivating examples we are going to review them today and we will go into the possible solution that is processing in memory because processing so far is done very far away from the data and what&#39;s the reason for that the reason for that is the how the systems Al be are buil from the top to the bottom uh if you think about the fman bottleneck sorry about the fman model uh sometimes we talk about the for bottleneck when we mean memory bottleneck or data movement bottlenecks data movement bottlenecks that that&#39;s why I said that but I wanted to talk about the pH NOA model which is supposedly the cost of that that bottleneck uh there are three key components computation communication and storage and memory and they look like this right we have the computer unit here we have the memory and storage units on the other side and in between we have a channel a communication unit to bring data from the memory and the storage to the compute units and when we once we are done with computation return the results there right typically the this uh memory storage unit is divided into the memory subsystem and the storage subsystem right and this is how this um this systems looks right supposedly the problem uh is well not supposedly the the problem the real problem is that this design is uh overwhelming over overwhelmingly uh processor Centric right why is that because all computation is down here while the memory and the storage are dumped they are not uh optimize at all to perform anything uh uh different from just keeping bits right storing bits and um and that&#39;s where the problem appears because the way that these different unit units have evolved over time has been at different rates if you think about well I think I already mentioned these numbers last year as motivation but if you think about how much compute units have improved in the last decade or in the last 30 years you will see that they have done much more and much faster than the memory and the storage according to different metrics like energy consumption or performance or uh um regarding the memory bandwidth latency Etc we already discussed all of these and in between the communication unit is also you know pretty narrow it&#39;s like a funnel right we cannot we we don&#39;t have like a very wide Highway that that where where we can transport a lot of data at the same time in the end this is sort of a funnel so every time we need to go here we&#39;ll have to you know enq request and go one by one bring the data from the memory and the storage to the compu units and that&#39;s what makes the um system uh you know that what creates this memory or this data movement bottleneck as I said earlier sometimes is also called the um um fman bottleneck because you know the the the the the key reason for the bottleneck is in the way that the system is H built from from the beginning and that&#39;s something that is already known for for many years remember this um uh interview uh with re Richard sites already in the 90s I think that this is probably the most important except of this interview with respect to the current lecture is that I expect that over the coming decade memory subsistent design will be the only important design issue for microprocessors and he was thinking about the coming decade probably two decades later still the problem is the same and let&#39;s uh very quickly review some of the motivation U results that I showed you last week uh first of all this one this uh nice plot from Professor mud&#39;s uh PhD thesis uh where uh we can see that more than 55% of the execution time is spent on bringing data from the main memory to the cash hierarchy uh remember that this is the paper this is a shorter version of the paper and I also pointed you to uh this interview but uh more recently we can check more recent studies for example this one from Google in 2015 uh where they use the top down approach the top down methodology to characterize the workloads and see where the pipeline uh slots the pipeline Cycles are being spent uh remember that the you know in the ideal Ideal World we would have 100% retiring meaning that we are making full use of the uh pipeline right and we are productive all the time but unfortunately that&#39;s not the case and we see that in most of the workloads most of the time is to spend on the back end and the bank the back end means it means the compute units but most of these uh Cycles are spent on accessing the memory units and um with a little bit more detail in this figure as well we can see that half of the Cycles are spent stall on caches SO waiting for data coming from the main memory um so and this happens because current processor Centric designs are grossly imbalan uh in the sense that processing is only done in one place they then needs to move all the time from the memory and the storage to the processing elements and then the results going back to the memory and the storage and this is energy efficient inefficient is low performance and it&#39;s complex and um and you know trying to mitigate these issues with processor Centric designs um uh processors have become more and more complex over time and that in reality doesn&#39;t uh really help even though it was uh uh all designed to you know tolerate the data access in some way but that made us create very complex hierarchies remember that you know CPUs for example these days have three levels of caches and also complex mechanisms like prefetching and people keep working on that improving these techniques that are ex uh that that are um effective until you know to some extent uh but they unfortunately didn&#39;t solve all the problems the system still is very energy inefficient low performance and complex remember also this uh picture from uh last week uh even though most of the compute system is devoted to memory but still we have these Perils of processor Centric designs in terms of energy we already saw this um slide comparing the uh total energy to perform a complex arithmetic operation to the um um energy span on accessing dram for reading or writing and we see that there are two or three ORD of magnitude difference or we also uh saw this um uh graph the other day comparing the energy for a 32-bit operation from an integr addition to an access to dram and we can see a huge difference of more than 6,000 times uh more energy on in a memory access than a simple integr addition or for even more motivation you have another slide here this one shows 41% of mobile system energy during web browsing is spend on moving data this is an study from 2014 or more than 100 times is the energy of an memory access with respect to an ad operation in line with the results that we have seen in the previous slides and also uh you must remember this slide um we are going to discuss this paper by the way today we are going to discuss this work uh but yeah this is kind of a very U um um highlight number 62.7% of total system energy is spent on data movement the main reason is that bringing data to the processor is much more expensive than Computing on that data okay so um we can fix this situation right we can make uh compute systems more memory Centric we can you know overcome all these perals of processor Centric designs but this requires us to think in a different way and and try uh different approaches in the end we need a paradig shift right um where we enable computation with minimal data movement and we compute where it makes sense where it makes sense means where the data is right let&#39;s try not to move data let&#39;s try to just send the computation to wherever the data is and that might be in the processor itself might be in the cashes or might be in memory or in storage uh the goal is to make Computing architectures more data Centric so now instead of thinking about memory as just a dumb space that stores zeros and ones uh we can uh think about it about like something else like an accelerator where we kind of load computation from the main system uh for example this could be like a system on a chip right with the CPU cores GPU cores and also some uh video or Imaging accelerators so we could send some computation we will see what type of computation of course this um memory is not going to be ideal for any sort of computation but for some uh particular operations it will be very good it will be much faster than the CPU or the GPU so now we should see the memory as an accelerator more similar to a conventional accelerator but don&#39;t forget that is an accelerator we will still have a host processor which is the CPU or the GPU or both of them um that can uh access store data in memory and also send some computation to the memory to be performed there for example we could have uh some workload running here like a database and perform queries directly in memory assuming that we have some processing elements inside the memory or near the memory we can overload a query to the memory and the memory will return their results when they are ready right uh this sounds like a very good idea but of course is something that is not so easy or so direct to enable right there are um uh other design considerations that we have to think about and need to find solutions for the respective challenges for example um uh if we make the memory compute capable what should we do with the controller should we also make the controller compute capable or how should the uh processor communicate through the memory controller to to the memory units or uh where um so how do we need to design the processor ship itself do we need to change something in the cash hierarchy for example and how do we have to design the inmemory processing elements or inmemory units and how do we program these systems we need a new hardware software interface probably we need a new Isa and we also need new highlevel ways or high level Frameworks of to program this um this new system uh and that might require us to develop new system software new compilers or even new programming languages right and and and and and we will also have to rethink algorithms as well uh and I think I mentioned already last week some examples of uh algorithm Hardware codesign I think that uh we are going to go over them again uh today right so uh many things need to change in the system in order to enable and adop uh processing in memory and all of that requires changes you know uh in in the entire transformation hierarchy uh we are going to talk about these different challenges and the potential Solutions in detail tomorrow in lecture 4B about enabling processing in memory but I wanted to mention it um way well in advance in order for you to understand why over the remaining part of this lecture where we are going to discuss um um real world uh systems and also we are going to discuss um some you know academic uh proposals for processing in memory uh but you will understand as well why we pay attention to certain aspects of the system integration for example how to deal with cache coherence or how to uh uh to deal with um virtual memory Milt memory address translation for example okay um as an introduction to processing in memory remember that this is a uh highly recommended reading I don&#39;t know if it will be um um required or not that&#39;s something to decide yet I believe but for sure is a very good reference I think that most of the things that I I&#39;m going to explain today and tomorrow and also what Geraldo will explain next week is uh is already in this uh book chapter uh it&#39;s pretty long but um it&#39;s also very comprehensive and and I think very useful reading for all of you this is the abstract and here you can see the uh um table of contents uh we we start in a in a very similar way as these lectures and this course right motivating why processing in memory is needed discussing what are the main Trends affecting uh main memory then we um um introduce processing in memory uh the the two main approaches processing using memory processing near memory and finally the enabling adoption uh part all of this is about processing data where it makes sense and it&#39;s a paradig shift that we need to to do in order to make Pro uh compute systems more data Centric but it&#39;s not just a new idea that we uh started here today or two years ago no it&#39;s an idea that has been explored for 50 years the first paper that we are aware of is this one from William Couts in the it transactions on computers in 1969 with the title cellular logic in memory arrays the idea of extending memory arrays with some uh sort of compute capability near each of near each of the memory cells or one year later we can find this other one a logic in memory computer by Harold Stone so processing in memory is an all idea but it was really difficult to make it real why was that because there were many challenges to solve remember that there are many things we still need to do in order to you know make um processing in memory Universal or ubitus right um in uh or at least available to the compute systems that can really uh benefit from processing in memory uh I&#39;m not going to discuss what having those say historical challenges but they are mostly related to the you know how advanced technology was right if you think about the way that a a processor is designed and is fabricated in SOS logic or how dram is designed and fabricated with a dram uh technology if you compare these two you will see that the because the requirements of each type of device are different also the way that the technology has evolved is different and now if you want to integrate a processor or an ALU inside the memory using different technology that&#39;s uh you know it&#39;s pretty challenging to design um that&#39;s those are the you know key reason of course there are ways of overcoming these challenges over time but you see it took us like 50 years right okay but now is the right time for memory computation because there are huge problems with the memory technology remember that there are problems related to how memory I mean memory scaling de scaling is an example the um um undesired phenomenons like Ro Hammer For example that can represent a a a a real um security issue there is also a huge demand from applications we are running more and more applications and more variet applications in our compute systems and um and accessing data is always an issue right that entails energy and power bottlenecks and performance botal necks as well and the designs are somehow squeezing the middle so um now it&#39;s uh you know it&#39;s the right time to try to overcome all these different issues with the memory computation uh we can say that the um you know the resent uh develop Vel Ms in processing in memory systems uh even though processing in memory was proposed 50 years ago uh but you know more recent developments started maybe in the last decade uh when this hybrid memory Cube consortion appeared it was led by Micron one of the major drum vendors and is a 3D stack memory or it was a 3D stack memory several layers of drum and at the bottom a simos uh layer that was called the logic layer as you can see here and in this logic layer you have certain logic that is necessary to access memory um you have a small memory controllers there to access the different memory banks in different layers by the way if uh if you look at this from the top uh you will see that the um logic layer well not only the logic layer but also the different layers on top are divided into you know different parts and each of them is called a bolt and at the bottom of each bolt was there was a memory controller to access the data in that bolt but the logic layer had also some spare um uh area right some area some part some silicon that was unused so it was possible at least to think about potentially embedding some compute capability in that logic layer and that&#39;s what inspired many people in industry and in Academia to do research in this direction what would happen if we have access to this 3D stack memory with a logic layer where we can place a small CPU core or a small accelerator for certain operations that we want to accelerate okay so this is like as I said this was very inspiring for a lot of research and that&#39;s why we are going to review one of some of those uh interesting proposals but there have being more attempts from industry more recent attempts from industry as well for example from Micron something different called the automata processing uh or the upman P architecture that you are going to work with and we will start covering with more detail today also the prototypes from Samsung uh from um SK heix we are going to see a couple of slides about these ones as well the other one from Sansung that is called axd or this other one from Alibaba for recommendation systems as I said uh we are going to uh talk about them okay uh and there are many other experimental chips and and startups that are not in in this slide but as soon as you you know you uh type in Google and and search you will find different ones as well or a few more as well mut there any question no okay do you guys have any questions so far okay um yeah so let&#39;s H well we are already discussing why in memory computation today and in memory computation requires system integration we will also need intelligent memory controllers to communicate with the um uh processing in memory units right so that&#39;s why this slide is again here remember that you know if you want to review all the different issues regarding memory scaling uh this is a paper that you can check uh a few of them are also disc in this work we are going to uh cover this work uh uh today uh and um yeah because this one motivates on why uh processing in memory is needed and um why applications for example have issues when scaling and uh uh and how we can propose solutions for um those applications okay and yeah uh let Let&#39;s uh go quickly over uh this few slides about the real world processing memory systems before we go into details on on the you know different approaches uh that uh we can make to processing near memory okay no questions okay so the app P architecture you already you have already seen this slide last week uh it&#39;s based on ddr4 memory technology inside each of the chips you find not only uh memory but also small processors that are called dpus they are pretty slow as you can see but you have a bunch of them more than 2,500 and that means that you can accelerate the applications a lot because you can or all these 2560 cores can enjoy a lot of bandwidth from memory more around actually more than two terabytes in in the most uh um updated system uh this is how the systems look with uh the Dual socket CPU there is still some main memory dram conventional Dr chips and or Dr dims and Pim enabled memory we are going to talk in detail U later today and also in tomorrow&#39;s lecture because we have done a lot of work uh on this architecture on this processing in memory system and also because you need to have some background on how to uh well how the uh system is built and how uh to program this system for uh one of our of your life as as I said okay this um is also the announcement from Samsung in 2021 they announced a processing in memory system for artificial intelligence and machine learning remember that this is based on a 3D stack memory not HMC but hbn 2 where some of the layers have been modified to integrate processing elements called PCU or PCU blocks there is one of these PCU blocks uh in between two Banks and these are relatively um you know simple uh units because the um system or the processing memory this processing memory architecture is targeted at a specific type of applications that are neural networks artificial intelligence machine learning and these workloads typically need mostly need multiply and accumulate operations and that&#39;s why they are um so specialized uh you have already seen this uh um photo already where you can see how the Dr layer has been modified to place this PCU Block in between two uh Dr Banks and this is another picture of the system with the pin unit between two uh Banks and then uh if we take a closer look we&#39;ll see that the pin unit is sitting near the um column decoder right drivers and sense amplifiers so we uh so that when we open one row uh the uh that will be here in the sense amplifiers the this uh PCU now it&#39;s called SD because it&#39;s a CD unit has direct access through its own registers so this is how one of these memory banks would look like and the S the the pcus themselves are simd units in total they have 16 Lanes right and each of the lanes can operate on um 16bit floating Point values why is that because that data type is pretty useful in machine learning and uh artificial intelligence uh is I mean there are uh as you may know and actually I think we are going to mention that later as well there are different ways of uh applying quantization to the neural networks in order to reduce the size of the uh parameters you could start training a network with 32bit floating point but eventually want to reduce that in order to save storage and to compute faster right uh because you know networks are robust so they can still produce accurate results even though uh we uh may want to reduce the Precision by by doing quantization uh 16bit floating point is a kind of a standard thing and that&#39;s why they focus on this and also observe that this is a simd unit simd meaning single instruction multiple data meaning we have 16 of these Lanes each of them operating on different data but all of them performing same operation for example a multiplication for example an addition or a multiply accumulate operation and why does it make sense that this is simd computation because the type of workloads that we are targeting here uh require a lot or have a lot of data parallelism for example Matrix Matrix multiplication or Matrix Vector multiplication are widely used in machine learning and neural networks so makes sense to have multiple of these Lanes Computing in parallel because we can operate on multiple rows and columns at the same time so this is a an important let&#39;s say Smart Way of exploiting the not not only smart but also conventional way of exploiting the data level parallelism something uh pretty interesting as well of this Samsung Pam architecture is that um it uh it can be you know more easily integrated into a real system because it&#39;s compliant with a modified jedc controllers uh jedc is the standardization um in um institution that um creates the standard for the different types of memories different types of drram right and they Define how the uh the the you know the the host system has to operate the memory how what are the latencies that need to be respected how frequently needs to be refreshed the the the the Dr rows Etc um if you want to integrate a new type of memory like this hbm P or Samsung Samsung it&#39;s called hpmp or dram um as this uh prototype from Samsung if you want to integrate it into a real system with a CPU or with a GPU uh you may need to modify the memory controller for the system to communicate with the pin units right um but uh um it&#39;s not that easy to change the you know standards and change uh the way that Jed controllers already work right because observe as well or think as well that they are fabricated typically by different people by different companies if you think about what are the companies that manufacture uh CPUs or gpus these are different from the companies that manufacture uh dram for an easier integration um Samsung um devis this HPM pin system as a Jed compliant we are not going to go into the details about how the communication is done between the host system and the pin units but I&#39;m going to refer you to a longer lecture that covers this architecture with a lot of detail uh if you want to learn about that how that can be done um if we take a closer look at the PCU or the pin unit you&#39;ll see that it has you know an execution unit with a pipeline relatively simple pipeline but yeah it has an array of multipliers and an array of others as well there are also some registers it&#39;s sequencer is to access the CRF that contains the instructions themselves uh remember that here we are performing kind of you know simple computation mostly focusing on dot product operation or GV or or Matrix Matrix multiplication so we don&#39;t really need many instructions to perform those operations that&#39;s why there are only 32 different instructions that you can hold there and the way to program this uh is uh uh using this instruction set with a multiplication multiply accumulate multiply at and then a few more uh instructions for data movement jump Etc as I said uh we have longer lectures uh about this real world P architecture and here you have a link if you are interested and want to take a look another real prototype also from Sansum is completely different instead of using hbm2 memory this is a dim based solution uh with um you know this axd buffer that supposedly contains an FPA and you can program this FPA for the specific operations that uh you want to perform for example in their first work uh they presented um an accelerator for uh the um sparse embedding operators that are used in recommendation systems if you look at how the typical recommendation system is um uh Works internally it has some you know parts of the comp computation that are more dense they are more they have more data level parallelism typically uh um multi-layer perceptor networks that are you know pretty dense and you can uh solve them efficiently usually in CPUs or gpus because uh you can use optimization techniques like tiling and bring large CHS of data to the large large tiles of data uh to the cash hierarchy and their uh compute uh so that&#39;s pretty efficient but other parts uh require more IR regular more sparse memory accesses so those are not good for the main processor for the CPU or the GPU um so those are very good candidates for processing in memory and that&#39;s why uh in this first paper what they propose is an accelerator for this operation that in reality is pretty simple it&#39;s sort of a reduction operation or even vector addition operation something like that that&#39;s why is if you look at this nmp unit NM P meaning near memory processing unit is pretty simple with just an an array of others to you know bring data from the memory ranks to the uh this array of others perform some um additions there and store uh results in this buffer of partial sum but this is like a you know all designed and to to get integrated into the FP what&#39;s the key advantage of this approach the key Advantage is that we can we are exploiting rank level Paralis typically in one Dand you have more than one rank you typically have two ranks but you can only access one of them at a time right if you uh place these near memory processing units near the rank this this means that they can operate independently and you can access both ranks at the same time in this way exploiting more parallelism and this is another figure about how uh you know the interaction uh with the CPU would be done because everything is controlled uh from the the overall execution is controlled by the host processor right um uh and is this host processor the one that needs to upload the computation to the near memory processing units so the first thing that the processor could do is to write the bending tables are those tables that are going to be placed here in the memory ranks and need to be accessed you know with the regular access patterns as I said and when the tables are written there the CPU can change the mode and go to the processing memory remote uh essentially uh launch the execution of the SLS operator which is the operation that runs on the near memory processing units and then uh the the near memory processing units start the execution in the meantime what the CPU is doing is checking a specific stat status register uh that will indicate when the computation is done all these registers are memory mapped so they can be accessed by the host processor as if they were regular memory addresses in memory okay um yeah again if you want to learn more about this here you have a full lecture of you know 32 minutes in duration another uh interesting prototype also uh under development right now from SK HX is SK heix aim or accelerator in memory similar in um um Spirit to the hbm pin or hram from Samsung because it&#39;s uh targeting the same type of workloads machine learning and artificial intelligence and also is a similar approach but here instead of having one Processing Unit every two Banks we have one Processing Unit every bank and the memory technology is not hbm2 is gddr6 but if you look at the internals of this pu uh you&#39;ll see that what it has is um multiply and accumulate units it also has some units for Activation functions is something uh pretty interesting from this architecture uh here is where you can see the Pu you see an array of multipliers and then it has another tree why do we need that because if you think about a DOT product operation row times column uh what you need to do is first of all perform modification and then you need to accumulate right in order to obtain a single scalar uh per U dot product operation right and that&#39;s what we obtain from here after that it&#39;s possible to um execute an activation function that as you may know is are you know key for many machine learning algorithms and neural network layers right like soft Max or sigmoid or reu or G Etc right you have heard of them um yeah um what else oh okay another something also interesting from this um um proposal from this SK H same is this supplementary SRAM buffer inside the dun chip you have these two spaces working as a single buffer of size 2 kilobyte that can be used for example to move data from one bank to another bank you know again temporary uh storage for data movement or also to store vectors basically think about a a neural network a convolutional neural network for example the uh input is one image right uh so uh and and the and the image if if you are running the neural network inference in these pus the image must be coming from the host processor right so you could write that image into this Global buffer or at least part of it and then perform the neural network inference using these processing units that&#39;s uh the overall idea again if you want to learn more 35 minutes of lecture about this accelerator in memory and the last one that I&#39;m going to show you before we go into the you know the the actual contents of this lecture is this hbm team no hbm P&amp;M from Alibaba um it&#39;s um pretty nice from the uh you know technology point of view because uh it&#39;s a it&#39;s a 3D stack architecture with one d d and one logic D that are bonded using a technique called hybrid bonding that allows you know a lot of connections between the Durand D and the logic D and this way we can have a lot of bandwidth very large fwid between the drand die and the logic die what do we have in the Dr die well if you look at this it more or less looks like a regular memory uh with some uh uh iOS and amplifiers and some decoder and control uh logic right and at the bottom we have of course the necessary logic to access the ram the memory controllers but also some so-called engines for the specific operations that Alibaba wanted to Target taret here uh one of them is called neural engine the other one is called match engine and they are designed for different parts of the recommendation system you see for example this Coors grain matching good run on this match engine while this um fine grain ranking runs on this neural engine by the way if you look at the neural engine what you see here is it&#39;s mostly a gem unit remember the last week we were talking about Google tpus and systolic arrays uh this gem unit is kind of a systolic array because in that neural engine what what is going to run is a small neural network okay but those are different that&#39;s why we need two different types of engines because the requirements of this character of this uh two steps of the recommendation system are different um the what this part of the uh picture represents is that these three stages are typically done on GPU because they are very data parallel but these two are typically or were typically perform on the CPU because they have less uh parallelism and they have more irregular accesses and that&#39;s why uh you know they decide or Aliva proposal is to replace the execution on the CPU with an execution of the processing near memory architecture that they are proposing again if you want to learn more you can take a look at this lecture from our processing in memory course so these are uh I wanted to have this slides here for kind of a motivation about you know existing real world processing memory systems but as you know we are organizing as well a tutorial on real world processing IM memory systems so um this is going to happen on October 29th you can attend because it&#39;s going to be uh everything is going to be live streamed and uh we will talk in more detail about uh these architectures hopefully or we will have different talks that might be even more interesting who knows um but yeah uh but there are other real ways of doing processing near memory or near memory processing for example this FPA based near memory acceleration that weal already talked about uh last week okay so we need to think differently from past approaches and we are going to see how uh this is something that I have already mentioned there are two main approaches or two main directions two processing in memory uh we covered them in the book chapter uh we have processing using memory and processing near memory today we focus on this part here processing using memory is the uh topic for lecture five uh this is one first approach to the different types of processing in memory that we can have observe that if you read each of these you will see that we are we are you know putting all together under the processing near memory umbrella but uh there are you know certainly uh important differences between them it&#39;s not the same uh using logic layers in 3D stack memory than logic in the memory controller or logic near the caches right so there are different um challenges that you will need to um uh face depending on what&#39;s the type of processing near memory but that would be like the first approach to this classification of processing in memory there is no let&#39;s say widely accepted process memory taxonomy but but if you look at you know different uh taxonomies that people may have proposed in the end they all focus on different um things or different um characteristics that you can use to make the classification the first one would be the nature of computation the type of computation in processing near memory we are really placing compute elements near the memory like a small CPU code or small GPU core or an accelerator or something similar in the end an ALU that was was never there but now is placed there near the memory arrays or we can uh talk about processing using memory when we take advantage of the analog operational properties of the memory structures the the different uh so the classification can also be uh uh about the memory techn ology but we are mostly focusing on dram in this course but there are also proposals about SRAM or flash memory Etc and we are going to mention some of them as well in these lectures and also where exactly the processing in memory capabilities are is it in the sensor or is it in the storage or is it in the hard drive or the SSD or the main memory or the cash or the network or the interconnect Etc okay so if you take one of these let&#39;s say near memory dram and the main memory then we have a different type of processing in memory but it&#39;s still you could classify even further because uh one thing that is not uh not here is the type of computation that you&#39;re doing is it more general purpose is it more application specific it depends right so in the end the classification is uh pretty complex anyway best thing that we can do is to start discussing some of the approaches to uh so or some of the processing near memory proposals and um and um and I think that you we will see all of these much more clear remember that a lot of the inspiration and for about for the recent academic and Industry research on processing in memory came from the hybrid memory cube a three stack memory technology with multiple layers of dram and and a logic layer where we have memory controllers but also can have um uh some uh processing elements there it&#39;s just one type of processing is one type of 3D stack memory there are other uh 3D stack memories for example HPM as you can see here hbm has been uh over the years more successful than uh HMC but but you know from from uh the research perspective they are uh really very similar okay so um if we have uh one of these uh 3D memory Technologies where we have the possibility of placing compute units near the memory what&#39;s the approach that we are going to follow how are we going to do it uh well one possibility could be uh creating an accelerator sort of a for graen accelerator based on 3D stack memory and um and and and and that could be kind of you know equivalent thing to a GPU right if you think about your system you have your CPU there there&#39;s a PCI Express Bus and you have a GPU a large GPU on the other side probably discret GPU for gaming or something else so that is what we can call a quarz grain accelerator why is it quarz grain because the amount of computation that we offload to the accelerator is relatively large right we launch an entire kernel that runs for a few milliseconds a so that would be one approach to processing in memory as well we can create an accelerator of 3D stack cubes that contain a lot of memory but also contain um execution units near the memory um so um that somehow requires to change the entire system right because now you could be uh integrating an an an large accelerator in your system or maybe we can do something different instead of of loading a lot of computation to that accelerator what we can have is some let&#39;s say simpler units inside the memory and just perform simple function of loading in a similar way as you could do with some um of the execution units that a regular CPU or GPU have if you think regular CPU they have simd extensions no like AVX and you can write that say a short program or a short function that makes use of these simd extensions and and you upload the compu ation there so in a similar way you could also do with some um execution units near the memory so it&#39;s another potential approach or even uh simpler what would be the minimal processing in memory support minimal changes to the system and programming that&#39;s also uh a different uh proposal that we are going to cover as well so as you as you see um we have started this part of the lecture talking about the PIN taxon we can have processing near memory processing using memory it might be the type of processing memory might be with different memory Technologies or in different uh places of the system the cache the main memory the storage uh but that&#39;s not the or let&#39;s say those are not the only uh features that we make use of to perform a classification if we uh decide to focus on 3D stack memory and we decide to focus on dram and on Main memory is still we will have to think about how the processing elements themselves need to be do we want these processing elements to execute you know a large amount of computation like a big kernel and consider them a quen accelerator or are we going to upload just simple functions just maybe I don&#39;t know 20 100 instructions something like that a relatively simple function of loading or maybe even finer grain maybe just a single operation as if we were using one floating Point Unit in in in in the in the in our compute system right or or just a different type of ALU in some sense so we are going to discuss the these different approaches the first one we we start from the bigger one to the smaller one and the biger one is the coar grain accelerator where we need to change more things in the system the motivation for this quars grain processing in memory accelerator is graph processing why is that well the first reason for that is that graphs are large and they are becoming even larger observe that this is from well these numbers here are from 2015 almost well eight years later right uh if you think about the graphs for Wikipedia or Facebook or Twitter or ex which is called now or Instagram they would be uh much bigger right because the number of users has increased the number of connections has increased as well so these graphs are larger and larger right so that&#39;s the first motivation we need a lot of memory and we we will need a lot of data movement to process these graphs right and the second reason is that scaling is very challenging that&#39;s why uh well we we have already mentioned this application scaling um if you uh take you know more relevant um graph processing algorithms and execute them on a multicore system and try to and increase the number of course or increase the number of threads according to the number of cores that you have in the system and uh and perform the computation you will see that the performance saturates at some point and the reason is that even though you continue increasing the number of cores the am the total bandwidth that is available to this course is limited it&#39;s very limited because remember we need to access Access Data through a thing a narrow funnel which is the memory the memory channel right or the memory unit in the for f um um model uh so yeah these are the you know real um example results for 32 cores and certain graph processing algorithm you can obtain some performance if you increase the number of cores four times you only get 42% more performance and that&#39;s why because we are saturating the bandwidth so it&#39;s not as simple as using more cores in the system is uh we require different solutions more uh you know like um um smarter Solutions in the end but yeah let Let&#39;s uh discuss why this happens what what are the key bot in graph processing the first reason is that graphs are very large and the second reason is that uh the graph algorithms are typically iterative and that means that we require many iterations to process the entire graph uh entirely the the the the graph entirely right and also one additional or two more additional bottlenecks are coming from the fact that uh accesses are typically random because graphs are very spars and if you think about each of the nodes in a graph for example representing a user a user a social network uh this user is connected to many other users but the near near boring user might be connected to completely different other users right so two of you might be sitting next to each other but your friends might be in completely different parts of the world in the end and that&#39;s what makes that when it comes to processing the graph and going over all the vertices as you see in the uh outer loop going over over all the vertices maybe the vertices themselves are in nearby positions in memory but when it comes to visiting the successors or the neighbors these successors might be in very you know sparse areas of the memory so this entails irregular memory accesses or random memory accesses meaning that you might be bringing one entire cach line to the core and only use a couple of bytes or four bytes or eight bytes of the entire Cash Line because all other neighbors are in completely different cach lines right so this is a problem in terms of data movement but also an additional problem is that we are not going to make use of that cach line for long time because there is not really so much computation to do if you think about this algorithm for example this um um three lines of code correspond to the page rank algorithm uh where we just need a multiplication and an addition to update dat the rank of each of the successor of one vertex right so there is uh there are random or irregular memory accesses very little computation so the data movement problem or the data movement bottleneck gets exacerbated here so potential solution the tessera system for graph processing is a quarz grain accelerator based on processing in memory Technologies as you see this is how the system the entire system looks each of these is a stack of 3D uh memory 3D Dam memory and each in in in the in each of the stacks there is a logic layer containing multiple processors multiple small processors in order cores that are communicated through or communicate to each other through this crossbar Network and they can access data in the different uh different layers in the in in the stack uh we take a closer look at each of the cores we see that there is there is an there is an inorder core that has access to drun through this uh drun controller or can access uh the dram of other cores or even dram in other uh cubes of the of the accelerators through this network interface and there are some more uh units to perform communication across course and also to prefetch data um so here one interesting thing in this approach or in this course is that uh you can either uh bring the data from remote places for example this core here might need data that is directly accessible to this core here and that&#39;s something that you could do by using prefetches for example or the other possibility is to use this message CU to communicate uh actual computation to communicate instructions from One Core to another core so that&#39;s um you know those are are the key ideas of the tesak system um that you know remote execution of instructions are are called the remote function calls we are going to see uh some um example um quickly well actually based on the page rank algorithm and again this is what I what I was saying remember uh in the page rank algorithm we are visiting the successors of each of the vertices in the um in the graph some of these successors might be in the same bolt might be in the same part of the memory or they might be in a different part of the memory right so if we need to perform that update operation that consist of one multiplication and one addition one thing that we could do is replacing that multiplication and addition that you see here replacing it with this put remote function call what this is doing is that uh is send sending a request from one bolt or from One Core to another core to perform the update operation that&#39;s why it&#39;s called remote function call the good thing is that this remote function call or the execution of this function can be done asynchronously as soon as one core finds you know the uh that one neighbor is in a remote bolt will send the put instruction to the remote bolt but observe that there are no dependencies right we can we can go to the next accessor and then to the next accessor and then to the next note or vertex of the graph and there is no dependence across um iterations right we have we we send the update to the remote bolt that will be get updated there at some point but we don&#39;t have to wait until that update operation is being performed so that&#39;s why they are called non-blocking because we don&#39;t have to wait until the uh computation is completely done right if we need that to happen we can use a barrier and that&#39;s something that the programming model of intak is also proposing the use of barriers and this is um with a little bit more detailed uh how these remote function calls are done you have a local core through the network interface you send the function and also with necessary data to the remote core that has access direct access to the memory on top of it and will perform the update operation right so um and yeah that&#39;s why we also need these message cues because uh the the one cor will be receiving request or might be receiving requests from many other cores right and you need a queue to temporarily uh store those um uh those requests okay what else well also the prefetching capability right so uh tesak is very complete in that sense you can e either send computation to a remote place to a remote Core to perform the put operation there or you can bring data um from the uh remote um uh from a you know remote bolt to the to the local to the local core to perform computation okay so then yeah let&#39;s take a look at some uh evaluation results uh um the authors of the tesak paper they uh tried different systems or compared the tesak accelerator which is here on the right hand side uh to some different baselines the first Baseline is uh this this could be like a normal multicore CPU as you see uh eight well in total 32 out of other cores running at four gahz so really powerful beefy uh out of order cores and they have they have access to a bunch of memory but this memory is DDR3 the DDR is relatively low bandwidth so that&#39;s why the nominal uh bandwidth of this system is 102 gigabytes per second um it&#39;s not ideal because DDR3 is not a fast memory so what they propose as well as a second Baseline is replacing DDR3 with HMC memory is high bandwidth memory because it&#39;s a it&#39;s a more sophisticated memory technolog olog in this case the total bandwidth can be 600 gigabytes per second but yeah they wanted to have also another basine as well in this case they replace the out of order cores with in order cores more similar to the ones that we have here the tessera cores that we have here so in total as you see 512 in order course running at 200 uh two thou uh yeah two gigahertz and and here we also have the same number of um tesser cores uh in total there are 512 tesser cores what&#39;s key advantage of Tarak the key advantage of Tarak is that these tesak cores are right under the theun layers right so they can enjoy much higher memory bandwidth and if you account for the total aggregate bandwidth in the tesera core it will be8 terabytes per second which is significantly higher than the external bandwidth of the HMC cubes which which is in total for this systems 6 40 gigabytes per second okay so that&#39;s where uh tesak should have the key Advantage the fact that it has lower memory latency and much higher memory bandwidth okay and that results in up to 13 times performance Improvement notice that well this is a normalized execution time or speed up over the first Baseline which is the out of order uh CPU system system with DDR3 memory using significantly faster memory because remember this goes from it&#39;s more than six times more bandwidth right from DDR3 to HMC more than six times but due to the scaling issues of graph applications this results in only 56% more performance right or 25% more performance if you use a uh in order course with tesser act and especially when uh we include also the prefetches with Tarak the speed up can be much much higher yeah uh no no it it uh both types actually that&#39;s why the uh fence instruction or the baring instruction is necessary and and and and and I mention let&#39;s say the most ideal scenario which are the asynchronous calls that are the put operation but the programming model also supports a get operation with where you perform the update and return some value because this value is the the output result for example because it&#39;s necessary for further computation so uh yeah I don&#39;t you can check the paper you probably have to read this paper you can check what are the uh different um benchmarks that were used but this corresponds to the evaluation of uh benchmarks with or workloads with different characteristics okay okay so where is the uh performance benefit or performance Improvement coming from uh it&#39;s coming from the increased memory bandwidth right so if you see what&#39;s the you know effective bandwidth consumption in all cases you&#39;ll see that tesak provides much more bandwidth to the course and that&#39;s where the benefit is coming from but but it&#39;s not only bandwidth it&#39;s also that here the authors are completely rethinking the system because think about the um uh think about the graph right the graph is huge so that&#39;s why you need multiple of the HMC cubes and you&#39;re going to map the graph onto the whole memory space that you have multiple uh um layers in in the HMC memory right and also um um so yeah you need also a way of programming the course and Performing the computation and in the end redesigning the entire algorithm because in some cases you&#39;re going to bring the data from uh abroad or you need to send the computation uh there with the remote function calls that uh we were talking about so there is a lot that needs to be rethought in the system right not only uh not not only the system itself but how we program it and how we uh change the algorithm to make it more suitable for the specific Hardware that we are using and indeed some of the benefits are coming from there one uh interesting analysis that the authors did was comparing the uh one of the baselines uh the the multicore the the the um Baseline with multiple cores with multiple in order course you using um hnc memory um but they also simulated the same system that instead of using you know the external bandwidth of hnc remember that that was 640 gigabytes per second they are simulating uh an hnc U ideal hnc uh Cube uh that can provide in total the same bandwidth as tesak the eight terabytes per second you see and that improved the performance a lot by 2.3 times right but still that was really far from the performance of the actual Tesseract system uh that uh increased the speed up to 6.5 times and uh so what the the authors want to claim with this um analysis is that a lot of the benefit of the performance Improvement is coming from the programming model itself it&#39;s coming from the way they redesign the algorithm and made use of the programming interface that is provided by tessera System okay uh yeah in terms of uh energy also a strong reduction in energy consumption like eight times um also pretty um um uh interesting results and then yeah to start summarizing uh Tarak has advantages and disadvantages and advantages as a specialized processing accelerator it can provide large performance and energy benefits it&#39;s taking advantage of 3D stacking for an important workload and it can be more General than just graph processing even though the authors focus on graph processing algorithms but this kind of system would be potentially useful for other applications as well even more uh modern applications than the ones that were used by the authors of this work in 2015 today for example graph neural networks ER are probably could probably be a pretty good fit for this kind of accelerator but it&#39;s all this also has disadvantages one of the key disadvantages is that you need a lot of changes in the system you need a new programming model and that&#39;s always challenging because users programmers need to learn how to use this system and the the teser cores are specialized for graph processing so if you want this to be say more suitable more general purpose you might need a different design for theera core cost is also a disadvantage you can expect that hbm memory or sorry 3D stack memory like hnc or HPM is more expensive than DDR memory right so that is also going to make this accelerator expensive and um and and in the end there might be still some uh scalability uh problems because uh we need to partition the graphs and um and um and yeah um it&#39;s to a remote Core if you need to bring data from a remote Core in the same Cube or even from a different Cube um not in all cases uh scalability is great because you might need to have many um accesses many remote accesses and in the end uh remote accesses are at lower bandwidth than local accesses and that may also affect the performance and that&#39;s why some of the later works that have built upon this tessera system uh one of the key things that they are targeting is uh mapping is um or graph partitioning is how you need to map the graph onto the tesak system with multiple cubes um in order to minimize the data movement even further in order to minimize the amount of remote accesses that you have so still is not a perfect solution but anyway the paper is there there are also some slides that you can check and we continue talking about near memory processing do you guys have any questions no okay so then maybe it&#39;s a good time to uh to do make a break now like 15 minutes is that okay now it&#39;s 2:30 so uh we will continue in 1550 minutes does it make sense okay let me know in the meantime if you have any questions or anything that you want me to try okay I think we can continue okay um yeah remember uh we are talking about processing near memory and um and and we are mostly focusing on processing near memory with 3D stack memories like HMC or like hm and um and we are covering different examples of uh of how we can build processing in memory systems using these 3D stack memory technology right the first one is a proposal of a quar scoring accelerator for a specific type of applications in principle that has have uh you know very challenging requirements very large data sets they also have uh frequent random memory accesses very little computation so uh these characteristics make the workload very suitable for processing in memory but we are not going to design accelerators or large uh cor spring accelerator for any application that might get benefit from processing in memory right because that would be really CH really costly and probably we cannot have so many coring accelerators in our system um so there are other approaches to processing in memory um one uh simpler thing that we can do is extending the main memory with some compute capabilities that can uh perform certain operations that are important for the specific applications that we want to focus on and that might be very much related as well with the type of system that we want to focus on or we are focusing on for example in a mobile system uh or in um you know relatively small environment and embedded systems Etc we may not need to have for sure not a course green accelerator but we may not need to have a strong variety a large variety of different operations to perform in memory because um these systems are potentially more specialized on different on specific applications right so it could be nice to identify for these specific applications that are important in the let&#39;s say embedded or mobile environment um identifying these specific applications what functions what calculations can get benefit from execution closer to the memory and that&#39;s what was proposed in this Google workloads for Consumer devices paper that was presented in 2018 is a paper that we have already mentioned because the famous 62.7% of data movement span or energy G spend of data movement is coming from this paper and energy is in the end the main motivation for this work why is that because this work focuses on consumer devices for example tablets or cell phones or Smartwatch that are mobile devices they rely on batteries there is no connection to the you know electric uh network uh that can provide a continuous supply of energy right uh these systems rely on the use of batteries and batteries have a certain duration right and we want to maximize this uh the duration of of these batteries how accelerating the workloads making their execution more energy efficient and uh that&#39;s why the work focuses on these type of workloads workloads that are typically used in consumer devices for example Google Chrome or tensorflow mobile or the vp9 video playback and video capture uh applications right uh the interesting thing of this work is that first of all um a lot a lot of characterization work was done for these important useful applications that are basically everywhere right because they are from uh from Google so there are in many mobile devices that we are using these days and one um interest as as I said one first interesting thing of this work is that there is a lot of analysis of what&#39;s the cost of the energy cost and the performance cost of executing uh these applications in existing um customer um consumer devices right um so there are one some important observations for a start the 62.7% of total system energy and span of data movement that you already know some other uh important observation is the uh that a significant fraction of this data movement is coming from relatively simple functions so that inspires the authors of this work to propose logic near the memory to accelerate those simple functions in memory but there might be different ways of doing it you could have a more general purpose cord that can execute the different operations or you could have an array of specialized accelerators near the memory for each of the specific functions and the authors evaluated both uh and so what were the potential energy and performance improvements uh as you see more than 55 and 54% uh savings um uh uh in in you know these are average results of course we are going to see more details retail results soon for these different workloads we start focusing on tensorflow mobile tensorflow mobile is used in the um consumer device for uh neural network inference so uh you&#39;ve for and whatever neural network is being used will feed some input and will get some predictions some inference output one um first uh um outcome of the workl characterization analysis that was done in this work is that 57.3 uh of the inference energy is spent on data movement and 54.4% of the data movement energy is coming to from two main functions uh simple functions like packing and unpacking and quantization that are used in tensorflow Mobile so we are going to see what&#39;s going on with these functions indeed and why they are you know so um uh so much they have these uh large data movement requirements and we are going to see what are possible solutions for that if you think about packing is a quite simple operation that just requires reordering the values the data of matrices in order to perform matrix multiplication faster U minimizing cash misses but doing that requires us first to um spend 40% of the inference energy and 31% of the inference inference execution time on this packing operation which is just moving data right it&#39;s just rearranging data in order to make the data structure more um suitable for for the later computation and one other observation is that packings data movement accounts for up to 35.3% of the inference energy so lots of performance and energy are wasted with due to this packing operation even though is a relatively simple re data reorganization process that doesn&#39;t require you know a lot of computation just requires simple arithmetic the other one is quantization in quantization uh what we are doing is converting floating Point numbers into 8bit integers remember that I talked about quantization earlier today uh when when we were we were introducing the Samsung hbmp architecture uh neural networks do not require full Precision to uh perform well in and in especially inference so that&#39;s why quantization is a really uh good approach to reduce the amount of memory space that we need and also accelerate the execution right but doing this quantization is costly and is costly in terms of execution time and energy here you have some uh numbers coming out from the workl characterization analysis A lot of the uh quantization energy is coming from data movement and quantization is pretty simple as well it&#39;s a data conversion that just requires shift addition and multiplic obligation operations so not very costly operations what is what this work proposes first of all move the computation regarding packing and unpacking or quantization near the memory have there some either some general purpose core or some specialized accelerator both uh cases were studied and and obtain some performance and energy improvements indeed you can see here normal energy uh this is for the different applications that are being used in this work and different steps in this applications we are focusing now on ter of Flow Mobile for parking and for quantization we see a strong reduction more than 50% or around actually more than 50% in some cases around 50% in other cases uh Energy savings from the execution on the CPU core the execution on a pin core or the execution on a pin accelerators so there are very good chances of uh improving uh the performance and energy uh consumption of the system if we do uh acceleration uh processing in memory acceleration for tensor Flow Mobile or for the other applications as well and these are results for normalized runtime where we see you know um performance improvements as well in the same ballpark around 50% performance Improvement just by offloading those functions that are really memory bound as the wordl characterization step has shown right um yeah in the paper you can see you can you can find the the analysis for all these different applications in this slid we also have a few details about the uh Chrome uh browser uh you know the uh how the the way that Chrome renders a web page has different stages first of all loading and parsing then this layouting step uh where what so some calculations of visual elements and positions of the objects is being done and finally the pain painting step first r rization that paints the objects and finally this compositing that assembles all the layers into a final screen image right and U well each of these steps have different requirements um the thing is that for a SAT satisfactory user user experience what they you know you as the users of the browsers expect uh to have is fast loading of the web pages smooth scrolling and quick switching if you are uh changing tabs right um but um but yeah um the these operations might be costly that&#39;s why um the focus here is on two of these operations page scrolling and tab switching because both include page loading loading entire Pages web pages from the main memory to the the um um to the um display so that they can be used by the uh by the by the user for example in TP switching uh something that uh was observed is that you know when you have a chrome with multiple tabs open there is a lot of data movement that is coming from Context switching and loading new pages you have your Chrome browser and you have multiple tabs there and every time that you switch one tab you need to contact switch or the browser needs to contact switch and load a new page because the pages are not active all the time if you are not using it it&#39;s just compressed and sent to memory and that&#39;s uh and that compression um uh indeed requires a lot of uh data movement um what the uh Chrome browser does is uh uh when when one top is inactive it&#39;s compressed and sent to theam it&#39;s stored in part of r that is called the Z RAM and whenever the user wants to open that tab again uh the CPU will have to access the compress tab decompress and then show the tab show the web page uh to the user all these compression and decompression operations require a lot of data movement and this was studied with some uh interesting experiment emulating how a user switches through 50 tabs and how much data movement was needed the um observations are that uh 18 18.1% of the total system energies spent on compression and decompression and a lot of data was moved between the CPU and the zram almost 20 Gigabytes for um of data for these 50 tabs so how can we use processing in memory to mitigate this cost instead of doing the CPU only approach where compression and decompression are are done on the host and this requires a lot of data movement uh what can be done is moving the compression operation to processing in memory and this way save a lot of data movement and potentially use the CPU for other tasks in the end accelerating the overall performance of the system right so this was also uh you know prop studied and and and and tested pin cores and P accelerators and indeed I think the results are in the previous plots that we have seen and uh yeah so this is for uh tab switching and that&#39;s uh basically what we have about this work as as you seen you have seen we we we have shown two of the examples one is tensor FL mobile the other one is Chrome browser and how some uh specific operations here can be um accelerated and the overall performance of the system and um energy consumption uh greatly improved okay questions about this work or the previous works okay yeah we have a few more indeed I want to cover we have enough time I think uh I want to cover at least two more proposals the second the the the last one is kind of similar to this one because it&#39;s also focusing on um a specific parts of uh of a specific applications and uh and and and one more is about the you know minimal of loading um to uh to processing in memory right what if we just want to upload single instructions to the processing in memory site and what are the you know challenges that we need to face there and solutions that uh we need to we need to use uh but there are many more uh slides uh in the presentation showing other interesting works on processing in memory that not are not only for CPUs this is also a processing in memory solution or processing memory proposal for uh gpus right in gpus as well gpus also suffer from the data movement bottleneck even though gpus these days use high bandwidth memories in a similar configuration as the one you see on the slide uh but still all computation is being done on the main GPU which is pretty beefy cheap with uh you know many cores and many threads that can run concurrently on the GPU but still all of them need to go through you know relatively narrow funnels uh to access data uh from the hbm hbm uh memory Stacks right um well there is a there are several interesting BS on how to do processing in memory four gpus one of them is this transparent of loading and mapping uh presented in 2016 that I identifies uh a specific sections of the GPU kernels that are more suitable for processing in memory and of loads them to some uh processing elements small uh GPU cores placed under the bolt of the U HMC memory and it&#39;s not the only one this is also another interesting one scheduling techniques for gpus archit GPU architectures this one considers a different of loading granularity instead of relatively small sections of the kernel um execution it offloads entire kernels it characterizes the kernels first and sees that the authors saw that there are some kernels that are more memory bound other that are less memory bound and depending on that they decide where to schedule but there are a few more works I think tomorrow we will talk with a little bit of detail about this one uh because it has some uh interesting uh ideas about how to handle uh virtual memory management in processing in memory systems this is an accelerator for pointer chasing applications uh and uh we are talking about you know different types of processing in memory processing near memory processing using memory but also different memory Technologies and also different places where we can have the processing near near memory elements right um uh we we we are mainly focusing on main memory and 3D stack memory Technologies but processing memory can also be done in the memory controller for example uh this work uh proposes a way of accelerating uh cach misses uh by an enhanced memory controller that can do sort of processing in memory or this other one called continuous run ahead that uh kind of places a a run ahead execution engine in the memory controller as well well so these are are all interesting readings and examples of the different places where we can have the processing in memory or near data processing capabilities or this one for example we have already mentioned this one in the past um is an FP based near memory accelerator or this other one that uh yeah it&#39;s also near memory in this case we used HPM memory and uh and and it&#39;s is uh Al a nice um example of uh algorithm Hardware Cod design with an algorithm for um approximate stream matchine that is perfectly tailored to the um underlying Hardware or this other one uh for time series analysis so we keep covering examples of uh re or or processing in memory um research work and uh and we have already seen how to design a course accelerator for a specific class of applications or how to uh um design processing in memory architectures for relatively simple function of loading but uh one other question that we can ask ourselves is what&#39;s the minimal processing in memory support that we can provide and we can get benefit from and something that makes the integration into the endtoend system uh simpler because uh just minimal changes would be needed that&#39;s the next proposal that we are going to cover it&#39;s called pin enable instructions and instead of using designing here an Accel an entire accelerator for relatively large kernels and computation or something that uh implements an entire function like packing or um tab switching or the the the examples that we saw in the Google consumer workloads paper um this one just proposes to upload a specific instructions for example an addition or for example um dot product operation something like uh pretty uh simple to the memory site and is still obtain uh significant performance and energy benefits so the idea is to develop mechanisms to get the most out of near data processing with minimal cost minimal changes to the system and no Chang to the programming model so the key idea number one is to um expose each Pim operation as a cache coherent virtually address host processor instruction that is called Pim enable instruction or Pei that operates only on a single block why is that important because uh if you think about how memory is mapped so how data is mapped onto memory you&#39;ll see that you are going to typically map uh consecutive cash lines to consecutive Banks or memory channels Etc right so uh if you need to if you in your computation you need to access um more than one Cash Line you will likely need need to access more than one memory bank right and that requires that the processing in memory unit has access to everywhere in some in some way and this access to different parts of the memory might be different with different latencies right ideally we want to access the closest memory the local memory remember in in HMC for example we have the memory divided into multiple bolts and under each of the bolts we can have one in order core for example in tesak this in order core will prefer to access data in its own bold because that&#39;s going to be faster right but that&#39;s not the always the case in indeed that&#39;s why in tesak we do remote function calls or we we do prefetching from bring dat to bring data from other bols right but if we just uh perform computation on a single cach line the biggest Advantage here is that uh we will upload the instruction to a specific processing element near the memory and this processing element just needs to access one cach line in its own local memory in its own Bol if you think about an hency memory so that&#39;s um the key idea and because they are just simple instructions we will just need to replace some lines of code in our program with the a specific Pi instructions that we want to do for example if we just want to do one addition let&#39;s just replace the addition with this pin ad instruction okay so advantages uh we can easily modify our programs we don&#39;t need to big changes in the programming model there are no changes to the virtual memory system why is that because the instruction starts it&#39;s the the the Pei instruction starts its execution uh in the CPU pipeline right and the CPU pipeline itself can find what&#39;s the memory address that we want to modify or we want to update for example with this pin ad operation so the CPU itself which has a memory management unit right can perform the memory address translation and send the pin ad instruction to the specific processing unit that resides close to the cach line that we want to update remember that here we are just updating one um memory address the content of one memory address so no changes to the virtual memory system minimal changes to the cache coherence uh why is that just because we are just operating on a single cach line and as you will see if this cach line is already in the cach hierarchy we won&#39;t go to memory which is something that also makes a lot of sense in processing in memory right compute where it makes sense or compute where the data resides so we need just minimal changes to the cach coherence and there is no need to worry about data mapping because each processing element will execute Pi instructions on data that resides closer to it in the local memory right in a single memory module okay the key idea number two is related to the cach cerence as well is that we can dynamically decide where to execute the pii if we know that the cach line is in the main memory okay great send the execution of this pin ad to the main memory and execute it there uh in the um whatever pin processing element that you have there but if the cach line is already in the cash hierarchy why do you need to go go to memory just execute the operation in the host okay and uh well just for a little bit of motivation remember uh the algorithm is the same we use for an example Yer uh before uh the the page rank algorithm um remember that this only needed one multiplication and one addition uh the multiplication now is done uh earlier but but in the end the computation is exactly the same uh if you want to execute this on a conventional architecture and remember that it&#39;s we will typically have frequent random random memory accesses there so we are going to bring entire cach lines to the host processor perform some update operation and then we need to return that cach line probably mostly unused right because of the random uh memory accesses and we are moving in and out 64 bytes that&#39;s not a great idea we are moving a lot of data for maybe just eight bytes or four bytes that we want to update if we replace the the the update operation with this pin ad we could just send the value to the main memory and perform the uh update operation the pin ad operation in the main memory and this would require just eight bytes in okay well but executing always in memory is always in memory is not always a good idea and actually uh the the paper has this nice anal is for uh different graphs and of different size right so more more to the right means and more vertices so larger graphs and what the authors observe is that not in all cases offloading the computation that we have just seen in the previous slide to the main memory not always is a great solution why is that because sometimes the graphs are pretty small and they fit in the caches or part large parts of them fitting fit in the caches so it makes no sense to upload to memory so there is where it makes sense to come up with ways of doing you know a smart scheduling based on where the data really resides uh one advantage of the P enable instructions as well is that they can be executed asynchronously in many many cases it&#39;s it&#39;s similar to the remote function calls in terak right we upload the computation to memory and we can continue uh doing uh computation if there are no dependencies as it&#39;s this case if we eventually need to um synchronize we can use this Pence that has a similar purpose as a barrier in tesak right so it&#39;s making sure that the so making sure that U there are memory consistency guarantees right we we are sure that all updates were done before these Pence uh the paper uh I mean the the idea itself is not limited to any Specific Instructions right pable instructions you could come up with your own P instructions the authors proposed a few ones that were nice nicely um implemented for the specific applications that they were targeting and uh well yeah they have some characteristics that we have already discussed like cash coherent virtually address they target a single cash block they are Atomic between different piis so if two um let&#39;s say two you have two CPU threads and and and both of them are uh sending a pii to the main memory to execute execution is going to be in anatomic manner that means that uh if they want to update the same cach line the execution of One update will be first and then the uh the second update from the second thread will come later so they are atomicity is is guaranteed and at some point we might want to use p fenes and and here you see well that those are um more characteristics of the P enable instructions in terms of localization interoperability and simplified locality monitoring so localization key idea is that uh because we are targeting a single cach cach line uh yeah we don&#39;t need to perform remote accesses uh so that uh pretty convenient is also pretty convenient for the system integration support okay and and here you see um yeah some uh information about the different uh data intensive workloads that were used for the experimental part uh of this work uh graph processing data analytics machine learning and data mining and here you have the uh uh what are the different applications I average teenage follower BFS payran so as you see several graph processing algorithms but also data analytics and machine learning and uh the paper shows uh pretty good results they evaluated three different data sets for each of the applications of data sets of different size because if they are very small they will likely fit in the cach memory so pin might make no sense um the more interesting analysis for sure is for the larger inputs because that&#39;s where we can see um most benefits from uh either pin or either the locality aware execution the locality aware execution adds a certain unit near the caches of the uh host CPU and locality monitor that is going to check whenever we want to update or we want to operate on a cash line with a beam enable instruction the first thing to do will be for the locality monitor to check if that cach line is in the cach hierarchy if it&#39;s not then we upload the computation to the memory and and that&#39;s why you see a little bit higher uh performance Improvement for the locality where execution because in not in all cases makes sense to offload the computation to the memory in fact here you see the results for data for small data sets uh where you know because you know data sets are are pretty small they likely fit in the caches so it makes no sense to overload all the time to processing in memory that can only cause um performance reduction yeah you see uh here some more [Music] um um yeah some more uh results and Analysis to justify the the the performance that we uh see in the other plot these are the onchip sorry off chief of Chip uh memory transfers for in Pim only we are always of loading the operation to the memory even though the cach lines might be in the host so that requires a lot of data movement so it&#39;s always good to understand the workloads understand the data sets and understand the systems in order to make the right SCH scheduling decisions and these are results for medium um data sets that you can sell check yourselves and these are uh energy consumption analysis as expected uh larger uh um energy savings are for the large data sets up to 25% uh Energy savings just by offloading some individual instructions these Pi instructions that can be performed near the memory so advantages and disadvantages one advantage of these Pi is that uh they are simple and and is a simple and low cost approach for to processing in memory there are no changes to the programming model or the virtual memory it&#39;s possible to to decide dynamically the locality monitor does that uh where to execute an instruction but there are also some disadvantages uh we don&#39;t take full advantage of the pin potential because we are just uploading individual instructions due to this um single cach block uh restriction and um and yeah in the end if you have uh Pi units near the memory in your system you might be able to only execute some um relatively limited operations near the memory if you think about the this um uh Pi proposal and the real world processing in memory systems that I presented in the beginning you&#39;ll see that there are certain similarities right um here in pi the main idea is to upload uh just individual instructions simple instructions to the memory that operate on a single cach block if you think about Samsung hbm Pim or SK heix aim they are uh using uh they are implementing processing elements that are simd but these Syd processing elements are operating only on uh Changs of 256 bits uh that&#39;s not exactly the size of a cach line as considered in this PI work but is a relatively limited amount of data that can reside close to the processing element right so in that sense it&#39;s also kind of A fine grain type of of loading uh one difference is that here in pii uh each individual instruction is uploaded from the host if the locality monitor decides that from the host to the main memory while in um a system or in an architecture like Samsung hbmp or skyx aim we would be uploading uh a little longer right a few more instructions to perform multiplication and additions on um a little bit larger um amount of data to perform for example a do product operation so it&#39;s not exactly the same but there are um some important similarities and uh at the same time similar challenges for system integration as well okay so this is the paper um pretty likely uh another reading that you will you will do in this course um uh very uh I think is very inspiring and and and also pioneering work um in the you know res and research in processing in memory okay any questions yes is I mean it depends on the type of uh pin right if you think about tesak for example in tesak for each core is possible to to access data in its own bolt or in remote bolts even in remote cubes because there is an interconnection Network and there are the architecture itself has mechanisms to allow this remote communication right either in the form of of loading computation or bringing data uh from from outside um the idea in this uh work Pim enable instructions is to provide a way of doing pin that requires minimal modifications to the system right because that&#39;s important for adoption and for you know n to system integration uh so that&#39;s why uh the authors propose to operate on a single cach line because in for many workloads that might be fine right if you think about page rank every uh update just updates one neighbor uh vertex right so it&#39;s it&#39;s just eight bytes or something like that that you want to update or if you think about a histogram calculation in histogram calculation you have a large input image uh but you can partition this large input the much over the different cores that you have in the system and each of them just needs to update one particular element of the output histogram so there is not much uh um um there are not many uh requirements in in terms of um accessing data for the specific operations that we of loading here if you restrict the operation to a single cach line first of all that means that you can send the computation to the processing element that is closer to that cach line so we avoid remote accesses that are always more costly because you need an interconnection Network and because they need more time longer latency right so that&#39;s the first Advantage and one important Advantage as well is virtual memory support why because if you send the computation to the processing core and the the processing in memory core and the processing in memory core needs to access multiple cach lines in different parts of the memory it needs to do virtual memory addressing uh virtual memory address translation right and uh in order to really figure out where in the physical memory the data resides right so that needs more complexity in the processing in memory core and probably you want to uh it it&#39;s challenging already to integrate cores inside the memory so even more challenging if you need to have these sophisticated mechanisms such as virtual memory address translation and you probably prefer to expend that area into placing more alus which in the end are what what you need there right to compute so uh the the the good thing is when targeting a single cach line virtual uh memory address translation can be done by the host the host figures out what&#39;s the physical address and just sends the instruction to that core that is next to that physical address and that&#39;s it so it simplifies a lot the design in the system okay so those are the that that&#39;s a key reason of course uh it&#39;s also more limited and it&#39;s and if you see what are the performance improvements and the Energy savings of these PI work they are significantly lower than tesak for example but the other one is uh is more challenging to integrate right and and to adopt okay uh yeah we continue talking about uh processing in memory uh proposals um but this one is uh again um we we we we could again consider it like function of loading right this one is also about Google workloads but in the end in in in this case uh neural network models for Edge devices similar motivation again Edge devices will likely rely on batteries right and we are very much interested in accelerating their performance and uh minimizing their energy consumption and this is called the Mena framework and the Mena framework uh well proposes different accelerators for the different um layers that we can find in uh ml models um again well there are um if you think about uh mobile device uh potentially with an htpu uh it might need to um uh make use of different NL models back then when this work was was done uh these these models where uh recurrent neural networks or convolutional neural networks or lstms or uh or recurrent convolutional neural networks that is different types of neural network models that have some similarities but they they also have significant differences and that&#39;s again uh understand your workloads before you um accelerate them before you figure out uh how to improve their performance of the system you need to understand them well and that&#39;s uh again why um some interesting warlock characterization work was done in this project here this uh graph shows the float overbite versus the parameter footprint for different neuron networks and different layers of this neuron networks these flop flops per bite represent the arithmetic intensity so when we say uh bytes it means the bytes that we are bringing from the memory to the processor or to the accelerator in this case the htpu and the flops are the number of floating Point operations that we perform per bite right per each of the bytes or data that we brought from memory right and as you see the this arithmetic intensity changes a lot from some models to other models or from some layers of the models to other layers of the models and at the same time the parameter footprint so what&#39;s the size of the layers themselves the weights of those layers might also change a lot and as you see well several orders of magnitude difference depending on the type of network and the type of layer we can even do more analysis because even within the same layer we may have or within the same network we have layers with very different characteristics for example in one of the CNN that was anal Anze there is there are large differences in the in the number of multiply and accumulate operations that are performed in different layers or the arithmetic intensity again flops over bytes um it&#39;s uh it&#39;s very different in different layers so what could be ideal does it make sense to have a single type of processor or a single type of accelerator for all these different models and all these different layers the answer is no right because if you have just a monolithic Accel accelerator such as the Baseline htpu which might be a nice design with a systolic array to perform you know a lot of the computation in in all these networks is dot product gmv operations that map pretty well onto this systolic array but as we have seen size of the parameters number of uh operations that we perform per bite changes a lot in different models and in different layers so that&#39;s what makes the monolithic accelerator inefficient in many cases so the idea in the Mena framework is to characterize the models and depending on the characteristics of the models Define different families of workloads meaning neural network layers or neural layers of the different neural network models classify them and base on their characteris overload them to a different type of accelerator the first of the accelerators is pretty similar to the original Baseline monolithic accelerator and is near the CPU why is that because this one is going to um uh to be to this one is going to accelerate workloads with higher arithmetic intensity uh that workloads that can take advantage of the large cash here key in the CPU so when you bring data from the memory to the CPU or to this caches here you can reuse that data several times right and and that amortizes the cost of bringing the data and makes an efficient use of uh the hardware that we have here but that&#39;s not the case in all cases we have a strong arithmetic intensity variability um in the different families so that&#39;s why it might make more sense to design other accelerators two more types of accelerators that sit near the memory in the logic layer of 3D stack dram and yeah uh the the paper itself I don&#39;t know if you will have to read this one or not but it has a really interesting um analysis of the of the different um layers how to identify the different families and depending on what&#39;s their uh what are what&#39;s their me memory footprint or arithmetic intensity or number of or Mac intensity number of multiply and accumulate operations some of the layers will be more compute bound or compute Centric or more data Centric or memory bound and depending on that observe that here five different families were identified so families one and two are more comp compute Centric so they will likely be executed on the accelerator one while the others are more suitable for the near memory accelerators and the overall energy reduction of Mena for uh well baseline or either Baseline with high bandwidth memory or the MSA frameware itself with the three different accelerators um as you see the um energy consumption reduces by three times compared to the Baseline htpu uh it&#39;s a pretty nice uh breakdown of the energy consumption uh with you know each of the uh part of the bars corresponds to the energy consumption in each of the individual components of the system uh so pretty U nice analysis for the different models that were evaluated and the average is there on the right hand side and this is the Thro put Improvement 3.1 Times Higher throat putut inference throat putut than the uh Baseline also for yeah the plot here is normalized to the uh Baseline the Baseline with with hm memory and the MSA framework uh pretty significant performance improvements okay well you can find much more uh in the in the paper itself and um yeah I guess we can continue do you have you guys have any questions about the Mena framework okay let&#39;s um keep making progress I think we are going to have time to um um start introducing the ABM architecture and the how to program the ABM system I think that&#39;s good because that will give us uh more time for um tomorrow&#39;s lectures to go to go deeper into tomorrow&#39;s lectures okay so this is the Mena paper uh but yeah this is another one that I mentioned before um this Tom paper transparent offloading and mapping um it&#39;s a as you&#39;ll see is a pretty nice proposal to identify what code can be executed near the memory and what are the best ways of mapping data to take advantage of processing in memory system um um of loading of the code when to make use of the pin system is always a is always a question in all types of proposals and something that is usually Target in the um different papers but there are more questions that need to be answered we are going to talk about them tomorrow tomorrow uh with more detail uh but but yeah uh here in this slides we already have some sort of introduction to this issues for example how to keep it simple right remember what are the what&#39;s the main motivation for this team enabled instructions work um uh keeping the modifications of the system small keeping them simple but still be able to uh enjoy or take advantage of processing in memory capabilities and there are more questions to answer right one of the advantages of the PI work the previous slide is that it&#39;s a single cach line so it&#39;s much easier to handle virtual memory add translation handle memory coherence but in some cases especially for example in this lazy pin work uh when you have both the pin units and the CPU or the host processor accessing memory at the same time and potentially working concurrently on the same data sets you need to enable cach coherent mechanisms between the memory side and the uh CPU side why is that because the one pin core might be uh updating the content of one specific memory address and maybe the corresponding cach line is already in the cach hierarchy right and the CPU could be reading a stale data if the processing in memory core updates the contents of those of that cach line that is already in the cash hierarchy so we want to keep data coherent over across the entire system and that&#39;s what this lazy pin work tries to do I think we are going to talk about this one um tomorrow but yeah here uh you can see that uh yeah that that&#39;s indeed possible the Lacy pin approach got pretty close to an ideal pin system with no coherence overhead um yeah I think we will elaborate more on this uh tomorrow and this is a follow-up word that is called K that um yeah also proposes efficient cach coherence support and in the pin system we typically have multiple multiple cores right we are that&#39;s the kind of a scenario that we are considering uh in all cases right in um tesak for example is a course scen accelerator with many tesser cores that need to communicate and the work itself uh proposes ways of Performing this communication or synchronization remember that there are barriers right what the barrier guarantees is that the execution of all pin cores uh stops when you when when the cores reach that barrier until all other cores all other threats in the system have reached the barrier and right after that execution resumes right so there are ways of synchronizing across pin cores some of them might not be efficient in all cases if you have a thousand core a thousand um course in a system and you want to synchronize the Thousand course in the system it will probably take a few Cycles to uh perform the barrier operation uh that&#39;s why this more recent work from 2021 proposes efficient ways of Performing synchronization in a processing memory system with many pin cores I think we will cover this one as well um with some detail tomorrow and um so support for virtual memory we are already discussing ways of supporting virtual memory in pi uh we have seen one example but there are we might need more complex ways of supporting virtual memory and and uh in the in tomorrow&#39;s lecture we will talk about this um uh work presented in 2016 that has a pretty um nice way of uh of uh supporting virtual memory in the processing in memory course okay yeah so those are some of the let&#39;s say barriers that we still need to overcome in order to support processing in memory in real systems right how to enable processing in memory in the real world that&#39;s a a question that has many different phases and for each of these phases we need uh different solutions right uh and one other thing that we need to enable processing in memory are real systems real systems that allow us to experiment with processing in in memory and um explore what are the potentials of processing memory for different workloads that&#39;s why uh we can consider this part processing in memory in the real world as part of eliminating the adoption barriers um in this course you already know you are going to have a lab where you will need to program a real world processing memory system from ABM and um and uh you know the most uh interesting thing of that experience is that you don&#39;t I me for you it&#39;s going to be I think uh pretty nice because it&#39;s a a parall programming lab so you you can uh refresh a little bit on Parallel programming which is good uh but at the same time you will see as well what are the issues with using a real processing memory system right what are the the what what&#39;s the potential that you can have for for performance Improvement but also what are the limitations that the system has so um working with a real system allows us to understand um what are the uh what&#39;s the true potential of the of this real system and identify its inefficiencies and propose ways of improving the system and overcoming those inefficiencies right this way we can build more energy efficient computer architectures and high performance Computing architectures Computing architectures with minimal data movement okay so um yeah so uh these are you know um different points that we will cover in tomorrow&#39;s lecture in more detail but uh starting from the top applications and software for PIM if we want to enable processing in memory in the real world we need to understand existing processing in memory systems we need to understand how to program them how to make use of them and and how software needs to be written because likely we will have to learn something else to make use of these processing in memory systems and that&#39;s something that it&#39;s okay as long as we get improvements we get benefits from doing that right uh 15 years ago when gpus became general purpose uh people had to learn how to program these gpus for general purpose processing and that&#39;s why many people started to learn how to program in Cuda or in opencl and today you will find uh GPU programming courses in probably all universities teaching uh computer science and um and and that&#39;s something that makes sense to do because we are using gpus for important workloads these days right um of course uh before you know GPU programming became mainstream um it was necessary to show what were the large potential benefits from using gpus for general purpose computations so at this time we are more or less in a similar situation so that&#39;s why uh we want to we need to understand what are the potential benefits of processing in memory and at the same time we need to develop ways of making the usage the use of these processing in memory systems uh easier and that um yeah that&#39;s um basically what we are trying to do here so that&#39;s why you want or we want to experiment with a real world processing memory system uh we uh have a little bit of experience with this appman pin system we have studied it in in in several papers this was the first one we published and we have public we have made open source a lot of code as well for this pin system this is a benchmark Suite that uh we develop been the first work exploring different compute patterns for uh this processing in memory system and that well you as you already know it&#39;s based on ddr4 dims uh with small processors inside the uh drun chips um and the the the what what this this slide actually is coming from the vendor itself from appm um where you can already see what are some of the promises that they make in terms of uh Energy savings in terms of um um um speed up as well like you know estimations about 20 times or 10 times faster than doing it uh on the CPU but uh that probably was not for free one thing that they are claiming here in this slide is that fabricating uh chips de and chips with processing elements is pretty challenging this is something that I mentioned briefly in the beginning of the lecture today uh transistors are significantly slower inside theam because they are designed with different characteristics right we don&#39;t want a transistor to be so fast because they have a different purpose they are just uh switches that um you know that connect uh one Dr Cell to the beat line and and they have a different purpose right and so typically the logic in dram is less dense than in the ASC process in simos uh so there are not so many metal layers for routing so that makes designing a processor inside the D chip much more challenging but in the end they made it and they filed patents about that and I like this abstract of the patent because already tells us what the system is about first of all it&#39;s a Memory circuit that has a memory array as we could expect and then inside the memory array there is a processor it&#39;s called first processor here it&#39;s a small processor uh in the the actual product is called dpu as you might remember um and there is also a control interface why do we need a control interface because this first processor in reality is sort of a slave of a central processor which is the CPU or the host right um so the central processor sends uh request to the first processor sends kernels to be executed by the first processor and while the first processor is Computing the dam banks are only accessible to this first processor and the memory banks become accessible to the central processor when the first processors H are already done right so this is what the system is about uh in the end we should see the ABM dims the ab pin dims we should see them as as an external accelerator in that sense if you think about the system that we are starting to describe now you see that it&#39;s sort of a quar gring accelerator in a similar way as tesser was right even though the memory technology is completely different because this is DDR the other one was um HMC and the type of processors as well are different but we can see uh we can see um this upman system as a an external accelerator so it resembles tesak but also resembles GPU Computing where the GPU is a let&#39;s say B accelerator that sits on the other side of the PCI Express Vass and we want when we want to make use of it we&#39;ll have to send the data from the CPU memory to the GPU memory then perform the computation on the GPU memory and finally return the results to the CPU memory right so this is what we are going to do with this abman pin system as well um we will have first of all the CPU in this slide called s so system on a chip right the CPU loads data to be processed to the dram memory bank the dram memory bank here is the PIN enabled memory right then trans transmits a data processing command to the D processors to the dpus it&#39;s like launching the kernel onto the dpus and then the execution by the dram processors start right um these dram processors compute for certain time for certain number of Cycles maybe millions of cycles and in the meantime the CPU would be checking whether the computation has finished or not because when the computation finishes the banks become again accessible to the CPU that can go there and bring the data being bring the results right so observe that is pretty similar to what we do in the GPU comp as well and here you you have another picture of the uh well or the first picture of the system organization is pretty uh simple because it&#39;s coming from the patent you can see the CPU there and then you can see the different D deems or D chips with memory arrays and with processors but here you have a another say um different picture of the entire system with the host CPU the main memory because still there is there are conventional D dims that we use as main memory and the Pim enable memory the Pim enabl dims and we can take a closer look at as at each of these dims uh the dims typically have well uh one or two ranks most recent ones have two ranks of eight chips right and in that sense is pretty similar to Conventional ddr4 dims and then uh in each of the ranks we have eight chips and inside each of the chips we have these we have eight Dam Banks each of the dram Banks has a size of 64 megabytes and they are called mram right even though they are let&#39;s say conventional dram and on the other side what we have is a processor a small processor if you look at this it looks like a I mean it&#39;s a pipeline right it looks kind of similar to the mips pipeline L for example that you might have uh studied um in your Bachelor courses right um I think yeah we have we we will have more detailed slides of this but anyway um as you see uh we have uh eight different banks near each of the banks we have one Pipeline and there are two small SRAM based memories one is for instructions and the other one is for operan this one behaves as a kind of cache but it&#39;s not exactly a hardware managed cache similar to what you have in your CPU in your laptop or your cell phone is not Hardware manag this one needs to be managed by the programmer so we should explicitly request accesses uh of data or request data from the dram bring the data from the Dr to the W RAM and then the the the scratch pad or or software manage cache and then once we have the data here we can use it we can operate on that data using the alus okay yeah we will see this in in more detail later um this is another picture of the say current upand based pin system that has uh up to 20 dims uh in total uh 40 ranks so in total uh uh 2560 dpus as you see it&#39;s a dual socket CPU connected to main memory and to Pam enable memory in the Pam enable memory we have all these many dpus and in total 160 gigabytes of P enable memory and this is one picture that you have already seen before probably now it makes much more sense right you can see all the uh DS there in the in the memory slots some of them are devoted to the pable memory others to the uh main dram okay any questions so far about this system okay so then let me continue and for maybe four or five more minutes and we will be done until tomorrow we&#39;re going to uh talking about this abman pin system and about how to program it we will not cover the entire presentation about this system for sure but I want to give you the basic background on uh abmin programming at least because that what you need for the uh for the Le for the lab sorry as our first example vector addition this is the simplest thing that you&#39;ll have to program is our first programming example in vector addition the only thing that we are doing is the element wise addition of two input vectors A and B and storing the output in a vector called C what we do if we have a system with multiple processors and the processor here processors here are called dpus what we will do is partitioning the workload across the different processors imagine that we have four dpus okay so we divide uh input and output vectors into four large chunks and we assign each of them to each of the dpus how what&#39;s a what should be the size of these chunks well it depends on the amount of memory that you have in each dpu you guys remember how much memory we have a access to a dram Bank of 64 megabytes right so that&#39;s the dram that we have for each of the dpus so that defines what&#39;s the largest chunk that we can assign to H dpu but then inside the dpu inside the we can call I will call it dpu or sometimes maybe pin core right uh so inside the dpu we are going to have multiple threads running right in the same way as you can um program threads with open MP or P threads or C++ threads in your multicore CPU something similar we can program these dpus to use multiple software threads that are called tasklets so inside each of the dpus we are going to partition the workload again and we&#39;ll assign different chunks of the input to different tasklets this way we can exploit the large data level parallelism that this simple vector addition has okay uh by the way typical numbers are not four and two here we have more than 2,000 dpus as we have seen more than 2,500 dpus while typical number of tasklets are more than 11 we will see why tomorrow and you know typical number that you will use in your um your lab will be 8 16 12 something like that the maximum number of tasklets in each dpu is 24 okay um here uh you can see uh well a link to the user manual actually that&#39;s uh an old version there is a newer version from 2023 uh it&#39;s it&#39;s a key reference for you not only or lectures but you can also access the SDK documentation that explains everything you know all the uh different things that you you need to know to make use of this uh system and and program it um I would like to First and and I will finish the lecture here give you some general programming recommendations uh that um yeah that come from different places um so first of all and but but as you will see they you know are are kind of intuitive right make they make a lot of sense so if you want to if you want want to make use of of an accelerator like the entire pin system is an accelerator right remember we can see that a qu graen accelerator uh we we probably want to accelerate the workload as much as we can right we have a lot of bandwidth a lot of pin cores so why not let&#39;s try to use them as much as we can so that&#39;s why one General programming recommendation is to execute on the dpus on the D processing units portions of parallel code that are as long as possible that&#39;s something that you would do in this system but you would also do in other types of accelerators as well because if you upload more computation that means that you will have less communication between the host and the accelerator right between the host CPU and the pin system or between the host CPU and the GPU if we are talking about the GPU so this is um kind of General because it does not only apply to uh the pin system that we are uh working on um here okay so second important thing or second uh uh important consideration is to split the workload into independent data blocks which the dpus operate on independently why is that well first of all because if you are a pin core and you have direct access to some part of the memory that you can access faster than other part of the memory well you want to stay there right because that the performance will be higher if you are accessing the local memory that&#39;s something that makes sense everywhere it makes sense in tesak as well right as we have discussed but it makes even more sense in this pin system why is that because each individual dpu has access only has only direct access to its own Dr bank so all communication between dpus as we will discuss in more detail tomorrow needs to happen through the host CPU so that&#39;s not a good idea and that&#39;s why we want to operate on we want to work on Independent data blocks of course if you have a parallel system and you want to accelerate the performance of your workload you should use as many processing in memory cores as you want as as as you as you can right so there are more than 2,000 or more than 2,500 dpus in the system use as many as possible and finally launch at least 11 tasklets remember the software threats that run on the dpus Y 11 well I think that we are going to see this tomorrow but it&#39;s related to the number of pipeline stages remember that the dpu is a a in order pipeline processor and it has a certain number of stages so um we really want to keep all stages in the PIP line BC and that&#39;s something that we can do when we have at least um 11 threads 11 task LS okay any questions no okay so then I think it&#39;s uh enough for today we will continue tomorrow with this admin programming we just started uh it&#39;s going to be a little a little bit longer probably the the first half uh of tomorrow&#39;s lecture maybe one hour or so we will talk about admin programming lecture 4 a and then uh we will continue talking about processing in memory in the um enabling pin lecture that is 4B So yeah thank you very much for your attention and see you tomorrow

make sure that all uh important topics and and concepts are are sufficiently described um the lecture as I said I have more than two well you can already see them in the in the website as well right it&amp;#39;s more than 200 slides but the last part about programming uh real processing in memory architectures is something that we may cover tomorrow or you know finish at least uh tomorrow it all depends on how fast we go on the first part which is what uh we can call the processing near memory lecture okay so it seems that we are live streaming now got it okay and yeah over the lecture we will see when to uh make a break maybe we&amp;#39;ll finish five minutes uh earlier we&amp;#39;ll see um yeah I guess we can we can start um any questions not for a start okay so then let&amp;#39;s go ahead with lecture three in this course on computer architecture today uh as promised we start a really fascinating part of the course that is pro uh processing in memory and processing in memory as we are going to see in a few slides is divided into different types today we are going to talk about the uh main first part of processing in memory which is processing near memory remember that processing in memory consist of uh placing some sort of compute capability in the memory or in the storage when we talk about processing near memory what we mean is that we are placing uh processing elements or compute units or uh processing cores near the memory arrays or near the storage um so yeah we we already covered uh well okay let me also clarify tomorrow we will have in principle two lectures in lecture 4A uh we will talk about programming a real world processing in memory architecture because that&amp;#39;s what you&amp;#39;re going to do in one of the labs uh so you need this background we are not going to go into a lot of detail probably because we will also um uh Premiere the lecture from last year which was uh you know pretty long it was an entire lecture about how to program the the abman pin system um and um yeah I will also share some pointers to a lot of material that uh um that will be useful for you to learn how to program this system the second part of tomorrow&amp;#39;s lecture 4B uh it&amp;#39;s going to be you know like kind of potp of different things that need to be done different challenges that we need to uh tackle in order to make processing in memory something real something that can be used in the real world and jumps from the you know academic or industry research environment where just um you know um simulators and prototypes are developed to some uh Real World products that uh can be you know um um and and available to all consumers or potentially all consumers and um and but but you know uh it still needs to be uh done a lot until we can do the seamless let&amp;#39;s say end to endend integration system integration okay and then next week at least the first lecture will be also about processing in memory it will be the other you know big part of processing in memory which is what we call processing use in memory uh in next week&amp;#39;s lecture you will have Geraldo Geraldo Francisco deliva teaching he is one of our PhD students and uh and he is a real expert on the topic because that&amp;#39;s what he&amp;#39;s doing in his PhD so um for sure I think you he will he will give a a very good lecture okay so let&amp;#39;s start with this lecture three processing near memory and first of all we need to recap a little bit on things that we covered already last year what&amp;#39;s the motivation for processing in memory what&amp;#39;s what we call processing in memory and why processing in memory now uh remember that last year last week we were discussing uh some major Trends affecting main memory we were also talking about the need for intelligent memory controllers and um we U discussed these three Key System strength systems Trends uh the first one that data access is a major bottleneck for the reason that applications are increasingly data hungry uh the energy consumption is a keil lieter in the systems and data movement energy dominates compute we saw some um interesting plots last week um in order to motivate all of these and we are going to recap on all of these uh but you know all these challenges um um allow us to make some observations and also also take advantage of some opportunities uh there is a high latency and high energy that is caused by data movement due to several reasons for example that we have long and energy hungry interconnects these interconnects are based on electrical interfaces so they are also very much energy hungry and we are moving large amounts of data the reason is that data is increasing continuously we are handling more and more or bigger and bigger data sets these days but this represents opportunity which is the possibility of minimizing data Movement by performing computation directly inside the memory or close to the memory and that&amp;#39;s what we call processing in memory or near um sorry in memory computation or in memory processing or near data processing it&amp;#39;s a even more general term that encompasses uh also storage right so um any um like something like um the cach the ssds the main memory even the network or the memory controllers if we equip these devices or these parts of the system with some compute capability we can talk about near data processing right uh in this course we uh mostly talk about processing in memory but that&amp;#39;s probably because you know the initially the focus was on uh equiping the main memory with such compute capabilities and that&amp;#39;s why we are kind of inheriting the uh the term processing in memory and making it you know wider and encompassing all these uh possible memory spaces that we focus on this is one of the uh key um directions in this course and also in our research remember this slide from the first day one of the uh key uh issues and one of the topics uh is fundamentally energy efficient architectures memory Centric and data Centric architecture so uh in these uh lectures we are talking about processing memory we are talking about how to build memory Centric or data Centric architectures um yeah this is a motivating slide from Professor mudu about maso&amp;#39;s hierarchy of needs you know like all the different needs depending on what&amp;#39;s the the status of the accomplishment right uh if we uh translate this to uh the Computing systems probably what we want at the very bottom as the basic needs are Everlasting energy and why is that because we want a sustainable world right we want a world that looks like this and not like this and that requires us to be very much energy efficient right energy efficient and also high performance and also sustainable all at the same time and this is something that we can achieve by making compute systems more data Centric or more memory Centric and we are going to see in this course house how um the problem that we want to solve with this kind of systems is data Access Data movement because or current design principles caus a great energy waste and also a great performance lost we um have already seen some motivating examples we are going to review them today and we will go into the possible solution that is processing in memory because processing so far is done very far away from the data and what&amp;#39;s the reason for that the reason for that is the how the systems Al be are buil from the top to the bottom uh if you think about the fman bottleneck sorry about the fman model uh sometimes we talk about the for bottleneck when we mean memory bottleneck or data movement bottlenecks data movement bottlenecks that that&amp;#39;s why I said that but I wanted to talk about the pH NOA model which is supposedly the cost of that that bottleneck uh there are three key components computation communication and storage and memory and they look like this right we have the computer unit here we have the memory and storage units on the other side and in between we have a channel a communication unit to bring data from the memory and the storage to the compute units and when we once we are done with computation return the results there right typically the this uh memory storage unit is divided into the memory subsystem and the storage subsystem right and this is how this um this systems looks right supposedly the problem uh is well not supposedly the the problem the real problem is that this design is uh overwhelming over overwhelmingly uh processor Centric right why is that because all computation is down here while the memory and the storage are dumped they are not uh optimize at all to perform anything uh uh different from just keeping bits right storing bits and um and that&amp;#39;s where the problem appears because the way that these different unit units have evolved over time has been at different rates if you think about well I think I already mentioned these numbers last year as motivation but if you think about how much compute units have improved in the last decade or in the last 30 years you will see that they have done much more and much faster than the memory and the storage according to different metrics like energy consumption or performance or uh um regarding the memory bandwidth latency Etc we already discussed all of these and in between the communication unit is also you know pretty narrow it&amp;#39;s like a funnel right we cannot we we don&amp;#39;t have like a very wide Highway that that where where we can transport a lot of data at the same time in the end this is sort of a funnel so every time we need to go here we&amp;#39;ll have to you know enq request and go one by one bring the data from the memory and the storage to the compu units and that&amp;#39;s what makes the um system uh you know that what creates this memory or this data movement bottleneck as I said earlier sometimes is also called the um um fman bottleneck because you know the the the the the key reason for the bottleneck is in the way that the system is H built from from the beginning and that&amp;#39;s something that is already known for for many years remember this um uh interview uh with re Richard sites already in the 90s I think that this is probably the most important except of this interview with respect to the current lecture is that I expect that over the coming decade memory subsistent design will be the only important design issue for microprocessors and he was thinking about the coming decade probably two decades later still the problem is the same and let&amp;#39;s uh very quickly review some of the motivation U results that I showed you last week uh first of all this one this uh nice plot from Professor mud&amp;#39;s uh PhD thesis uh where uh we can see that more than 55% of the execution time is spent on bringing data from the main memory to the cash hierarchy uh remember that this is the paper this is a shorter version of the paper and I also pointed you to uh this interview but uh more recently we can check more recent studies for example this one from Google in 2015 uh where they use the top down approach the top down methodology to characterize the workloads and see where the pipeline uh slots the pipeline Cycles are being spent uh remember that the you know in the ideal Ideal World we would have 100% retiring meaning that we are making full use of the uh pipeline right and we are productive all the time but unfortunately that&amp;#39;s not the case and we see that in most of the workloads most of the time is to spend on the back end and the bank the back end means it means the compute units but most of these uh Cycles are spent on accessing the memory units and um with a little bit more detail in this figure as well we can see that half of the Cycles are spent stall on caches SO waiting for data coming from the main memory um so and this happens because current processor Centric designs are grossly imbalan uh in the sense that processing is only done in one place they then needs to move all the time from the memory and the storage to the processing elements and then the results going back to the memory and the storage and this is energy efficient inefficient is low performance and it&amp;#39;s complex and um and you know trying to mitigate these issues with processor Centric designs um uh processors have become more and more complex over time and that in reality doesn&amp;#39;t uh really help even though it was uh uh all designed to you know tolerate the data access in some way but that made us create very complex hierarchies remember that you know CPUs for example these days have three levels of caches and also complex mechanisms like prefetching and people keep working on that improving these techniques that are ex uh that that are um effective until you know to some extent uh but they unfortunately didn&amp;#39;t solve all the problems the system still is very energy inefficient low performance and complex remember also this uh picture from uh last week uh even though most of the compute system is devoted to memory but still we have these Perils of processor Centric designs in terms of energy we already saw this um slide comparing the uh total energy to perform a complex arithmetic operation to the um um energy span on accessing dram for reading or writing and we see that there are two or three ORD of magnitude difference or we also uh saw this um uh graph the other day comparing the energy for a 32-bit operation from an integr addition to an access to dram and we can see a huge difference of more than 6,000 times uh more energy on in a memory access than a simple integr addition or for even more motivation you have another slide here this one shows 41% of mobile system energy during web browsing is spend on moving data this is an study from 2014 or more than 100 times is the energy of an memory access with respect to an ad operation in line with the results that we have seen in the previous slides and also uh you must remember this slide um we are going to discuss this paper by the way today we are going to discuss this work uh but yeah this is kind of a very U um um highlight number 62.7% of total system energy is spent on data movement the main reason is that bringing data to the processor is much more expensive than Computing on that data okay so um we can fix this situation right we can make uh compute systems more memory Centric we can you know overcome all these perals of processor Centric designs but this requires us to think in a different way and and try uh different approaches in the end we need a paradig shift right um where we enable computation with minimal data movement and we compute where it makes sense where it makes sense means where the data is right let&amp;#39;s try not to move data let&amp;#39;s try to just send the computation to wherever the data is and that might be in the processor itself might be in the cashes or might be in memory or in storage uh the goal is to make Computing architectures more data Centric so now instead of thinking about memory as just a dumb space that stores zeros and ones uh we can uh think about it about like something else like an accelerator where we kind of load computation from the main system uh for example this could be like a system on a chip right with the CPU cores GPU cores and also some uh video or Imaging accelerators so we could send some computation we will see what type of computation of course this um memory is not going to be ideal for any sort of computation but for some uh particular operations it will be very good it will be much faster than the CPU or the GPU so now we should see the memory as an accelerator more similar to a conventional accelerator but don&amp;#39;t forget that is an accelerator we will still have a host processor which is the CPU or the GPU or both of them um that can uh access store data in memory and also send some computation to the memory to be performed there for example we could have uh some workload running here like a database and perform queries directly in memory assuming that we have some processing elements inside the memory or near the memory we can overload a query to the memory and the memory will return their results when they are ready right uh this sounds like a very good idea but of course is something that is not so easy or so direct to enable right there are um uh other design considerations that we have to think about and need to find solutions for the respective challenges for example um uh if we make the memory compute capable what should we do with the controller should we also make the controller compute capable or how should the uh processor communicate through the memory controller to to the memory units or uh where um so how do we need to design the processor ship itself do we need to change something in the cash hierarchy for example and how do we have to design the inmemory processing elements or inmemory units and how do we program these systems we need a new hardware software interface probably we need a new Isa and we also need new highlevel ways or high level Frameworks of to program this um this new system uh and that might require us to develop new system software new compilers or even new programming languages right and and and and and we will also have to rethink algorithms as well uh and I think I mentioned already last week some examples of uh algorithm Hardware codesign I think that uh we are going to go over them again uh today right so uh many things need to change in the system in order to enable and adop uh processing in memory and all of that requires changes you know uh in in the entire transformation hierarchy uh we are going to talk about these different challenges and the potential Solutions in detail tomorrow in lecture 4B about enabling processing in memory but I wanted to mention it um way well in advance in order for you to understand why over the remaining part of this lecture where we are going to discuss um um real world uh systems and also we are going to discuss um some you know academic uh proposals for processing in memory uh but you will understand as well why we pay attention to certain aspects of the system integration for example how to deal with cache coherence or how to uh uh to deal with um virtual memory Milt memory address translation for example okay um as an introduction to processing in memory remember that this is a uh highly recommended reading I don&amp;#39;t know if it will be um um required or not that&amp;#39;s something to decide yet I believe but for sure is a very good reference I think that most of the things that I I&amp;#39;m going to explain today and tomorrow and also what Geraldo will explain next week is uh is already in this uh book chapter uh it&amp;#39;s pretty long but um it&amp;#39;s also very comprehensive and and I think very useful reading for all of you this is the abstract and here you can see the uh um table of contents uh we we start in a in a very similar way as these lectures and this course right motivating why processing in memory is needed discussing what are the main Trends affecting uh main memory then we um um introduce processing in memory uh the the two main approaches processing using memory processing near memory and finally the enabling adoption uh part all of this is about processing data where it makes sense and it&amp;#39;s a paradig shift that we need to to do in order to make Pro uh compute systems more data Centric but it&amp;#39;s not just a new idea that we uh started here today or two years ago no it&amp;#39;s an idea that has been explored for 50 years the first paper that we are aware of is this one from William Couts in the it transactions on computers in 1969 with the title cellular logic in memory arrays the idea of extending memory arrays with some uh sort of compute capability near each of near each of the memory cells or one year later we can find this other one a logic in memory computer by Harold Stone so processing in memory is an all idea but it was really difficult to make it real why was that because there were many challenges to solve remember that there are many things we still need to do in order to you know make um processing in memory Universal or ubitus right um in uh or at least available to the compute systems that can really uh benefit from processing in memory uh I&amp;#39;m not going to discuss what having those say historical challenges but they are mostly related to the you know how advanced technology was right if you think about the way that a a processor is designed and is fabricated in SOS logic or how dram is designed and fabricated with a dram uh technology if you compare these two you will see that the because the requirements of each type of device are different also the way that the technology has evolved is different and now if you want to integrate a processor or an ALU inside the memory using different technology that&amp;#39;s uh you know it&amp;#39;s pretty challenging to design um that&amp;#39;s those are the you know key reason of course there are ways of overcoming these challenges over time but you see it took us like 50 years right okay but now is the right time for memory computation because there are huge problems with the memory technology remember that there are problems related to how memory I mean memory scaling de scaling is an example the um um undesired phenomenons like Ro Hammer For example that can represent a a a a real um security issue there is also a huge demand from applications we are running more and more applications and more variet applications in our compute systems and um and accessing data is always an issue right that entails energy and power bottlenecks and performance botal necks as well and the designs are somehow squeezing the middle so um now it&amp;#39;s uh you know it&amp;#39;s the right time to try to overcome all these different issues with the memory computation uh we can say that the um you know the resent uh develop Vel Ms in processing in memory systems uh even though processing in memory was proposed 50 years ago uh but you know more recent developments started maybe in the last decade uh when this hybrid memory Cube consortion appeared it was led by Micron one of the major drum vendors and is a 3D stack memory or it was a 3D stack memory several layers of drum and at the bottom a simos uh layer that was called the logic layer as you can see here and in this logic layer you have certain logic that is necessary to access memory um you have a small memory controllers there to access the different memory banks in different layers by the way if uh if you look at this from the top uh you will see that the um logic layer well not only the logic layer but also the different layers on top are divided into you know different parts and each of them is called a bolt and at the bottom of each bolt was there was a memory controller to access the data in that bolt but the logic layer had also some spare um uh area right some area some part some silicon that was unused so it was possible at least to think about potentially embedding some compute capability in that logic layer and that&amp;#39;s what inspired many people in industry and in Academia to do research in this direction what would happen if we have access to this 3D stack memory with a logic layer where we can place a small CPU core or a small accelerator for certain operations that we want to accelerate okay so this is like as I said this was very inspiring for a lot of research and that&amp;#39;s why we are going to review one of some of those uh interesting proposals but there have being more attempts from industry more recent attempts from industry as well for example from Micron something different called the automata processing uh or the upman P architecture that you are going to work with and we will start covering with more detail today also the prototypes from Samsung uh from um SK heix we are going to see a couple of slides about these ones as well the other one from Sansung that is called axd or this other one from Alibaba for recommendation systems as I said uh we are going to uh talk about them okay uh and there are many other experimental chips and and startups that are not in in this slide but as soon as you you know you uh type in Google and and search you will find different ones as well or a few more as well mut there any question no okay do you guys have any questions so far okay um yeah so let&amp;#39;s H well we are already discussing why in memory computation today and in memory computation requires system integration we will also need intelligent memory controllers to communicate with the um uh processing in memory units right so that&amp;#39;s why this slide is again here remember that you know if you want to review all the different issues regarding memory scaling uh this is a paper that you can check uh a few of them are also disc in this work we are going to uh cover this work uh uh today uh and um yeah because this one motivates on why uh processing in memory is needed and um why applications for example have issues when scaling and uh uh and how we can propose solutions for um those applications okay and yeah uh let Let&amp;#39;s uh go quickly over uh this few slides about the real world processing memory systems before we go into details on on the you know different approaches uh that uh we can make to processing near memory okay no questions okay so the app P architecture you already you have already seen this slide last week uh it&amp;#39;s based on ddr4 memory technology inside each of the chips you find not only uh memory but also small processors that are called dpus they are pretty slow as you can see but you have a bunch of them more than 2,500 and that means that you can accelerate the applications a lot because you can or all these 2560 cores can enjoy a lot of bandwidth from memory more around actually more than two terabytes in in the most uh um updated system uh this is how the systems look with uh the Dual socket CPU there is still some main memory dram conventional Dr chips and or Dr dims and Pim enabled memory we are going to talk in detail U later today and also in tomorrow&amp;#39;s lecture because we have done a lot of work uh on this architecture on this processing in memory system and also because you need to have some background on how to uh well how the uh system is built and how uh to program this system for uh one of our of your life as as I said okay this um is also the announcement from Samsung in 2021 they announced a processing in memory system for artificial intelligence and machine learning remember that this is based on a 3D stack memory not HMC but hbn 2 where some of the layers have been modified to integrate processing elements called PCU or PCU blocks there is one of these PCU blocks uh in between two Banks and these are relatively um you know simple uh units because the um system or the processing memory this processing memory architecture is targeted at a specific type of applications that are neural networks artificial intelligence machine learning and these workloads typically need mostly need multiply and accumulate operations and that&amp;#39;s why they are um so specialized uh you have already seen this uh um photo already where you can see how the Dr layer has been modified to place this PCU Block in between two uh Dr Banks and this is another picture of the system with the pin unit between two uh Banks and then uh if we take a closer look we&amp;#39;ll see that the pin unit is sitting near the um column decoder right drivers and sense amplifiers so we uh so that when we open one row uh the uh that will be here in the sense amplifiers the this uh PCU now it&amp;#39;s called SD because it&amp;#39;s a CD unit has direct access through its own registers so this is how one of these memory banks would look like and the S the the pcus themselves are simd units in total they have 16 Lanes right and each of the lanes can operate on um 16bit floating Point values why is that because that data type is pretty useful in machine learning and uh artificial intelligence uh is I mean there are uh as you may know and actually I think we are going to mention that later as well there are different ways of uh applying quantization to the neural networks in order to reduce the size of the uh parameters you could start training a network with 32bit floating point but eventually want to reduce that in order to save storage and to compute faster right uh because you know networks are robust so they can still produce accurate results even though uh we uh may want to reduce the Precision by by doing quantization uh 16bit floating point is a kind of a standard thing and that&amp;#39;s why they focus on this and also observe that this is a simd unit simd meaning single instruction multiple data meaning we have 16 of these Lanes each of them operating on different data but all of them performing same operation for example a multiplication for example an addition or a multiply accumulate operation and why does it make sense that this is simd computation because the type of workloads that we are targeting here uh require a lot or have a lot of data parallelism for example Matrix Matrix multiplication or Matrix Vector multiplication are widely used in machine learning and neural networks so makes sense to have multiple of these Lanes Computing in parallel because we can operate on multiple rows and columns at the same time so this is a an important let&amp;#39;s say Smart Way of exploiting the not not only smart but also conventional way of exploiting the data level parallelism something uh pretty interesting as well of this Samsung Pam architecture is that um it uh it can be you know more easily integrated into a real system because it&amp;#39;s compliant with a modified jedc controllers uh jedc is the standardization um in um institution that um creates the standard for the different types of memories different types of drram right and they Define how the uh the the you know the the host system has to operate the memory how what are the latencies that need to be respected how frequently needs to be refreshed the the the the Dr rows Etc um if you want to integrate a new type of memory like this hbm P or Samsung Samsung it&amp;#39;s called hpmp or dram um as this uh prototype from Samsung if you want to integrate it into a real system with a CPU or with a GPU uh you may need to modify the memory controller for the system to communicate with the pin units right um but uh um it&amp;#39;s not that easy to change the you know standards and change uh the way that Jed controllers already work right because observe as well or think as well that they are fabricated typically by different people by different companies if you think about what are the companies that manufacture uh CPUs or gpus these are different from the companies that manufacture uh dram for an easier integration um Samsung um devis this HPM pin system as a Jed compliant we are not going to go into the details about how the communication is done between the host system and the pin units but I&amp;#39;m going to refer you to a longer lecture that covers this architecture with a lot of detail uh if you want to learn about that how that can be done um if we take a closer look at the PCU or the pin unit you&amp;#39;ll see that it has you know an execution unit with a pipeline relatively simple pipeline but yeah it has an array of multipliers and an array of others as well there are also some registers it&amp;#39;s sequencer is to access the CRF that contains the instructions themselves uh remember that here we are performing kind of you know simple computation mostly focusing on dot product operation or GV or or Matrix Matrix multiplication so we don&amp;#39;t really need many instructions to perform those operations that&amp;#39;s why there are only 32 different instructions that you can hold there and the way to program this uh is uh uh using this instruction set with a multiplication multiply accumulate multiply at and then a few more uh instructions for data movement jump Etc as I said uh we have longer lectures uh about this real world P architecture and here you have a link if you are interested and want to take a look another real prototype also from Sansum is completely different instead of using hbm2 memory this is a dim based solution uh with um you know this axd buffer that supposedly contains an FPA and you can program this FPA for the specific operations that uh you want to perform for example in their first work uh they presented um an accelerator for uh the um sparse embedding operators that are used in recommendation systems if you look at how the typical recommendation system is um uh Works internally it has some you know parts of the comp computation that are more dense they are more they have more data level parallelism typically uh um multi-layer perceptor networks that are you know pretty dense and you can uh solve them efficiently usually in CPUs or gpus because uh you can use optimization techniques like tiling and bring large CHS of data to the large large tiles of data uh to the cash hierarchy and their uh compute uh so that&amp;#39;s pretty efficient but other parts uh require more IR regular more sparse memory accesses so those are not good for the main processor for the CPU or the GPU um so those are very good candidates for processing in memory and that&amp;#39;s why uh in this first paper what they propose is an accelerator for this operation that in reality is pretty simple it&amp;#39;s sort of a reduction operation or even vector addition operation something like that that&amp;#39;s why is if you look at this nmp unit NM P meaning near memory processing unit is pretty simple with just an an array of others to you know bring data from the memory ranks to the uh this array of others perform some um additions there and store uh results in this buffer of partial sum but this is like a you know all designed and to to get integrated into the FP what&amp;#39;s the key advantage of this approach the key Advantage is that we can we are exploiting rank level Paralis typically in one Dand you have more than one rank you typically have two ranks but you can only access one of them at a time right if you uh place these near memory processing units near the rank this this means that they can operate independently and you can access both ranks at the same time in this way exploiting more parallelism and this is another figure about how uh you know the interaction uh with the CPU would be done because everything is controlled uh from the the overall execution is controlled by the host processor right um uh and is this host processor the one that needs to upload the computation to the near memory processing units so the first thing that the processor could do is to write the bending tables are those tables that are going to be placed here in the memory ranks and need to be accessed you know with the regular access patterns as I said and when the tables are written there the CPU can change the mode and go to the processing memory remote uh essentially uh launch the execution of the SLS operator which is the operation that runs on the near memory processing units and then uh the the near memory processing units start the execution in the meantime what the CPU is doing is checking a specific stat status register uh that will indicate when the computation is done all these registers are memory mapped so they can be accessed by the host processor as if they were regular memory addresses in memory okay um yeah again if you want to learn more about this here you have a full lecture of you know 32 minutes in duration another uh interesting prototype also uh under development right now from SK HX is SK heix aim or accelerator in memory similar in um um Spirit to the hbm pin or hram from Samsung because it&amp;#39;s uh targeting the same type of workloads machine learning and artificial intelligence and also is a similar approach but here instead of having one Processing Unit every two Banks we have one Processing Unit every bank and the memory technology is not hbm2 is gddr6 but if you look at the internals of this pu uh you&amp;#39;ll see that what it has is um multiply and accumulate units it also has some units for Activation functions is something uh pretty interesting from this architecture uh here is where you can see the Pu you see an array of multipliers and then it has another tree why do we need that because if you think about a DOT product operation row times column uh what you need to do is first of all perform modification and then you need to accumulate right in order to obtain a single scalar uh per U dot product operation right and that&amp;#39;s what we obtain from here after that it&amp;#39;s possible to um execute an activation function that as you may know is are you know key for many machine learning algorithms and neural network layers right like soft Max or sigmoid or reu or G Etc right you have heard of them um yeah um what else oh okay another something also interesting from this um um proposal from this SK H same is this supplementary SRAM buffer inside the dun chip you have these two spaces working as a single buffer of size 2 kilobyte that can be used for example to move data from one bank to another bank you know again temporary uh storage for data movement or also to store vectors basically think about a a neural network a convolutional neural network for example the uh input is one image right uh so uh and and the and the image if if you are running the neural network inference in these pus the image must be coming from the host processor right so you could write that image into this Global buffer or at least part of it and then perform the neural network inference using these processing units that&amp;#39;s uh the overall idea again if you want to learn more 35 minutes of lecture about this accelerator in memory and the last one that I&amp;#39;m going to show you before we go into the you know the the actual contents of this lecture is this hbm team no hbm P&amp;amp;M from Alibaba um it&amp;#39;s um pretty nice from the uh you know technology point of view because uh it&amp;#39;s a it&amp;#39;s a 3D stack architecture with one d d and one logic D that are bonded using a technique called hybrid bonding that allows you know a lot of connections between the Durand D and the logic D and this way we can have a lot of bandwidth very large fwid between the drand die and the logic die what do we have in the Dr die well if you look at this it more or less looks like a regular memory uh with some uh uh iOS and amplifiers and some decoder and control uh logic right and at the bottom we have of course the necessary logic to access the ram the memory controllers but also some so-called engines for the specific operations that Alibaba wanted to Target taret here uh one of them is called neural engine the other one is called match engine and they are designed for different parts of the recommendation system you see for example this Coors grain matching good run on this match engine while this um fine grain ranking runs on this neural engine by the way if you look at the neural engine what you see here is it&amp;#39;s mostly a gem unit remember the last week we were talking about Google tpus and systolic arrays uh this gem unit is kind of a systolic array because in that neural engine what what is going to run is a small neural network okay but those are different that&amp;#39;s why we need two different types of engines because the requirements of this character of this uh two steps of the recommendation system are different um the what this part of the uh picture represents is that these three stages are typically done on GPU because they are very data parallel but these two are typically or were typically perform on the CPU because they have less uh parallelism and they have more irregular accesses and that&amp;#39;s why uh you know they decide or Aliva proposal is to replace the execution on the CPU with an execution of the processing near memory architecture that they are proposing again if you want to learn more you can take a look at this lecture from our processing in memory course so these are uh I wanted to have this slides here for kind of a motivation about you know existing real world processing memory systems but as you know we are organizing as well a tutorial on real world processing IM memory systems so um this is going to happen on October 29th you can attend because it&amp;#39;s going to be uh everything is going to be live streamed and uh we will talk in more detail about uh these architectures hopefully or we will have different talks that might be even more interesting who knows um but yeah uh but there are other real ways of doing processing near memory or near memory processing for example this FPA based near memory acceleration that weal already talked about uh last week okay so we need to think differently from past approaches and we are going to see how uh this is something that I have already mentioned there are two main approaches or two main directions two processing in memory uh we covered them in the book chapter uh we have processing using memory and processing near memory today we focus on this part here processing using memory is the uh topic for lecture five uh this is one first approach to the different types of processing in memory that we can have observe that if you read each of these you will see that we are we are you know putting all together under the processing near memory umbrella but uh there are you know certainly uh important differences between them it&amp;#39;s not the same uh using logic layers in 3D stack memory than logic in the memory controller or logic near the caches right so there are different um challenges that you will need to um uh face depending on what&amp;#39;s the type of processing near memory but that would be like the first approach to this classification of processing in memory there is no let&amp;#39;s say widely accepted process memory taxonomy but but if you look at you know different uh taxonomies that people may have proposed in the end they all focus on different um things or different um characteristics that you can use to make the classification the first one would be the nature of computation the type of computation in processing near memory we are really placing compute elements near the memory like a small CPU code or small GPU core or an accelerator or something similar in the end an ALU that was was never there but now is placed there near the memory arrays or we can uh talk about processing using memory when we take advantage of the analog operational properties of the memory structures the the different uh so the classification can also be uh uh about the memory techn ology but we are mostly focusing on dram in this course but there are also proposals about SRAM or flash memory Etc and we are going to mention some of them as well in these lectures and also where exactly the processing in memory capabilities are is it in the sensor or is it in the storage or is it in the hard drive or the SSD or the main memory or the cash or the network or the interconnect Etc okay so if you take one of these let&amp;#39;s say near memory dram and the main memory then we have a different type of processing in memory but it&amp;#39;s still you could classify even further because uh one thing that is not uh not here is the type of computation that you&amp;#39;re doing is it more general purpose is it more application specific it depends right so in the end the classification is uh pretty complex anyway best thing that we can do is to start discussing some of the approaches to uh so or some of the processing near memory proposals and um and um and I think that you we will see all of these much more clear remember that a lot of the inspiration and for about for the recent academic and Industry research on processing in memory came from the hybrid memory cube a three stack memory technology with multiple layers of dram and and a logic layer where we have memory controllers but also can have um uh some uh processing elements there it&amp;#39;s just one type of processing is one type of 3D stack memory there are other uh 3D stack memories for example HPM as you can see here hbm has been uh over the years more successful than uh HMC but but you know from from uh the research perspective they are uh really very similar okay so um if we have uh one of these uh 3D memory Technologies where we have the possibility of placing compute units near the memory what&amp;#39;s the approach that we are going to follow how are we going to do it uh well one possibility could be uh creating an accelerator sort of a for graen accelerator based on 3D stack memory and um and and and and that could be kind of you know equivalent thing to a GPU right if you think about your system you have your CPU there there&amp;#39;s a PCI Express Bus and you have a GPU a large GPU on the other side probably discret GPU for gaming or something else so that is what we can call a quarz grain accelerator why is it quarz grain because the amount of computation that we offload to the accelerator is relatively large right we launch an entire kernel that runs for a few milliseconds a so that would be one approach to processing in memory as well we can create an accelerator of 3D stack cubes that contain a lot of memory but also contain um execution units near the memory um so um that somehow requires to change the entire system right because now you could be uh integrating an an an large accelerator in your system or maybe we can do something different instead of of loading a lot of computation to that accelerator what we can have is some let&amp;#39;s say simpler units inside the memory and just perform simple function of loading in a similar way as you could do with some um of the execution units that a regular CPU or GPU have if you think regular CPU they have simd extensions no like AVX and you can write that say a short program or a short function that makes use of these simd extensions and and you upload the compu ation there so in a similar way you could also do with some um execution units near the memory so it&amp;#39;s another potential approach or even uh simpler what would be the minimal processing in memory support minimal changes to the system and programming that&amp;#39;s also uh a different uh proposal that we are going to cover as well so as you as you see um we have started this part of the lecture talking about the PIN taxon we can have processing near memory processing using memory it might be the type of processing memory might be with different memory Technologies or in different uh places of the system the cache the main memory the storage uh but that&amp;#39;s not the or let&amp;#39;s say those are not the only uh features that we make use of to perform a classification if we uh decide to focus on 3D stack memory and we decide to focus on dram and on Main memory is still we will have to think about how the processing elements themselves need to be do we want these processing elements to execute you know a large amount of computation like a big kernel and consider them a quen accelerator or are we going to upload just simple functions just maybe I don&amp;#39;t know 20 100 instructions something like that a relatively simple function of loading or maybe even finer grain maybe just a single operation as if we were using one floating Point Unit in in in in the in the in our compute system right or or just a different type of ALU in some sense so we are going to discuss the these different approaches the first one we we start from the bigger one to the smaller one and the biger one is the coar grain accelerator where we need to change more things in the system the motivation for this quars grain processing in memory accelerator is graph processing why is that well the first reason for that is that graphs are large and they are becoming even larger observe that this is from well these numbers here are from 2015 almost well eight years later right uh if you think about the graphs for Wikipedia or Facebook or Twitter or ex which is called now or Instagram they would be uh much bigger right because the number of users has increased the number of connections has increased as well so these graphs are larger and larger right so that&amp;#39;s the first motivation we need a lot of memory and we we will need a lot of data movement to process these graphs right and the second reason is that scaling is very challenging that&amp;#39;s why uh well we we have already mentioned this application scaling um if you uh take you know more relevant um graph processing algorithms and execute them on a multicore system and try to and increase the number of course or increase the number of threads according to the number of cores that you have in the system and uh and perform the computation you will see that the performance saturates at some point and the reason is that even though you continue increasing the number of cores the am the total bandwidth that is available to this course is limited it&amp;#39;s very limited because remember we need to access Access Data through a thing a narrow funnel which is the memory the memory channel right or the memory unit in the for f um um model uh so yeah these are the you know real um example results for 32 cores and certain graph processing algorithm you can obtain some performance if you increase the number of cores four times you only get 42% more performance and that&amp;#39;s why because we are saturating the bandwidth so it&amp;#39;s not as simple as using more cores in the system is uh we require different solutions more uh you know like um um smarter Solutions in the end but yeah let Let&amp;#39;s uh discuss why this happens what what are the key bot in graph processing the first reason is that graphs are very large and the second reason is that uh the graph algorithms are typically iterative and that means that we require many iterations to process the entire graph uh entirely the the the the graph entirely right and also one additional or two more additional bottlenecks are coming from the fact that uh accesses are typically random because graphs are very spars and if you think about each of the nodes in a graph for example representing a user a user a social network uh this user is connected to many other users but the near near boring user might be connected to completely different other users right so two of you might be sitting next to each other but your friends might be in completely different parts of the world in the end and that&amp;#39;s what makes that when it comes to processing the graph and going over all the vertices as you see in the uh outer loop going over over all the vertices maybe the vertices themselves are in nearby positions in memory but when it comes to visiting the successors or the neighbors these successors might be in very you know sparse areas of the memory so this entails irregular memory accesses or random memory accesses meaning that you might be bringing one entire cach line to the core and only use a couple of bytes or four bytes or eight bytes of the entire Cash Line because all other neighbors are in completely different cach lines right so this is a problem in terms of data movement but also an additional problem is that we are not going to make use of that cach line for long time because there is not really so much computation to do if you think about this algorithm for example this um um three lines of code correspond to the page rank algorithm uh where we just need a multiplication and an addition to update dat the rank of each of the successor of one vertex right so there is uh there are random or irregular memory accesses very little computation so the data movement problem or the data movement bottleneck gets exacerbated here so potential solution the tessera system for graph processing is a quarz grain accelerator based on processing in memory Technologies as you see this is how the system the entire system looks each of these is a stack of 3D uh memory 3D Dam memory and each in in in the in each of the stacks there is a logic layer containing multiple processors multiple small processors in order cores that are communicated through or communicate to each other through this crossbar Network and they can access data in the different uh different layers in the in in the stack uh we take a closer look at each of the cores we see that there is there is an there is an inorder core that has access to drun through this uh drun controller or can access uh the dram of other cores or even dram in other uh cubes of the of the accelerators through this network interface and there are some more uh units to perform communication across course and also to prefetch data um so here one interesting thing in this approach or in this course is that uh you can either uh bring the data from remote places for example this core here might need data that is directly accessible to this core here and that&amp;#39;s something that you could do by using prefetches for example or the other possibility is to use this message CU to communicate uh actual computation to communicate instructions from One Core to another core so that&amp;#39;s um you know those are are the key ideas of the tesak system um that you know remote execution of instructions are are called the remote function calls we are going to see uh some um example um quickly well actually based on the page rank algorithm and again this is what I what I was saying remember uh in the page rank algorithm we are visiting the successors of each of the vertices in the um in the graph some of these successors might be in the same bolt might be in the same part of the memory or they might be in a different part of the memory right so if we need to perform that update operation that consist of one multiplication and one addition one thing that we could do is replacing that multiplication and addition that you see here replacing it with this put remote function call what this is doing is that uh is send sending a request from one bolt or from One Core to another core to perform the update operation that&amp;#39;s why it&amp;#39;s called remote function call the good thing is that this remote function call or the execution of this function can be done asynchronously as soon as one core finds you know the uh that one neighbor is in a remote bolt will send the put instruction to the remote bolt but observe that there are no dependencies right we can we can go to the next accessor and then to the next accessor and then to the next note or vertex of the graph and there is no dependence across um iterations right we have we we send the update to the remote bolt that will be get updated there at some point but we don&amp;#39;t have to wait until that update operation is being performed so that&amp;#39;s why they are called non-blocking because we don&amp;#39;t have to wait until the uh computation is completely done right if we need that to happen we can use a barrier and that&amp;#39;s something that the programming model of intak is also proposing the use of barriers and this is um with a little bit more detailed uh how these remote function calls are done you have a local core through the network interface you send the function and also with necessary data to the remote core that has access direct access to the memory on top of it and will perform the update operation right so um and yeah that&amp;#39;s why we also need these message cues because uh the the one cor will be receiving request or might be receiving requests from many other cores right and you need a queue to temporarily uh store those um uh those requests okay what else well also the prefetching capability right so uh tesak is very complete in that sense you can e either send computation to a remote place to a remote Core to perform the put operation there or you can bring data um from the uh remote um uh from a you know remote bolt to the to the local to the local core to perform computation okay so then yeah let&amp;#39;s take a look at some uh evaluation results uh um the authors of the tesak paper they uh tried different systems or compared the tesak accelerator which is here on the right hand side uh to some different baselines the first Baseline is uh this this could be like a normal multicore CPU as you see uh eight well in total 32 out of other cores running at four gahz so really powerful beefy uh out of order cores and they have they have access to a bunch of memory but this memory is DDR3 the DDR is relatively low bandwidth so that&amp;#39;s why the nominal uh bandwidth of this system is 102 gigabytes per second um it&amp;#39;s not ideal because DDR3 is not a fast memory so what they propose as well as a second Baseline is replacing DDR3 with HMC memory is high bandwidth memory because it&amp;#39;s a it&amp;#39;s a more sophisticated memory technolog olog in this case the total bandwidth can be 600 gigabytes per second but yeah they wanted to have also another basine as well in this case they replace the out of order cores with in order cores more similar to the ones that we have here the tessera cores that we have here so in total as you see 512 in order course running at 200 uh two thou uh yeah two gigahertz and and here we also have the same number of um tesser cores uh in total there are 512 tesser cores what&amp;#39;s key advantage of Tarak the key advantage of Tarak is that these tesak cores are right under the theun layers right so they can enjoy much higher memory bandwidth and if you account for the total aggregate bandwidth in the tesera core it will be8 terabytes per second which is significantly higher than the external bandwidth of the HMC cubes which which is in total for this systems 6 40 gigabytes per second okay so that&amp;#39;s where uh tesak should have the key Advantage the fact that it has lower memory latency and much higher memory bandwidth okay and that results in up to 13 times performance Improvement notice that well this is a normalized execution time or speed up over the first Baseline which is the out of order uh CPU system system with DDR3 memory using significantly faster memory because remember this goes from it&amp;#39;s more than six times more bandwidth right from DDR3 to HMC more than six times but due to the scaling issues of graph applications this results in only 56% more performance right or 25% more performance if you use a uh in order course with tesser act and especially when uh we include also the prefetches with Tarak the speed up can be much much higher yeah uh no no it it uh both types actually that&amp;#39;s why the uh fence instruction or the baring instruction is necessary and and and and and I mention let&amp;#39;s say the most ideal scenario which are the asynchronous calls that are the put operation but the programming model also supports a get operation with where you perform the update and return some value because this value is the the output result for example because it&amp;#39;s necessary for further computation so uh yeah I don&amp;#39;t you can check the paper you probably have to read this paper you can check what are the uh different um benchmarks that were used but this corresponds to the evaluation of uh benchmarks with or workloads with different characteristics okay okay so where is the uh performance benefit or performance Improvement coming from uh it&amp;#39;s coming from the increased memory bandwidth right so if you see what&amp;#39;s the you know effective bandwidth consumption in all cases you&amp;#39;ll see that tesak provides much more bandwidth to the course and that&amp;#39;s where the benefit is coming from but but it&amp;#39;s not only bandwidth it&amp;#39;s also that here the authors are completely rethinking the system because think about the um uh think about the graph right the graph is huge so that&amp;#39;s why you need multiple of the HMC cubes and you&amp;#39;re going to map the graph onto the whole memory space that you have multiple uh um layers in in the HMC memory right and also um um so yeah you need also a way of programming the course and Performing the computation and in the end redesigning the entire algorithm because in some cases you&amp;#39;re going to bring the data from uh abroad or you need to send the computation uh there with the remote function calls that uh we were talking about so there is a lot that needs to be rethought in the system right not only uh not not only the system itself but how we program it and how we uh change the algorithm to make it more suitable for the specific Hardware that we are using and indeed some of the benefits are coming from there one uh interesting analysis that the authors did was comparing the uh one of the baselines uh the the multicore the the the um Baseline with multiple cores with multiple in order course you using um hnc memory um but they also simulated the same system that instead of using you know the external bandwidth of hnc remember that that was 640 gigabytes per second they are simulating uh an hnc U ideal hnc uh Cube uh that can provide in total the same bandwidth as tesak the eight terabytes per second you see and that improved the performance a lot by 2.3 times right but still that was really far from the performance of the actual Tesseract system uh that uh increased the speed up to 6.5 times and uh so what the the authors want to claim with this um analysis is that a lot of the benefit of the performance Improvement is coming from the programming model itself it&amp;#39;s coming from the way they redesign the algorithm and made use of the programming interface that is provided by tessera System okay uh yeah in terms of uh energy also a strong reduction in energy consumption like eight times um also pretty um um uh interesting results and then yeah to start summarizing uh Tarak has advantages and disadvantages and advantages as a specialized processing accelerator it can provide large performance and energy benefits it&amp;#39;s taking advantage of 3D stacking for an important workload and it can be more General than just graph processing even though the authors focus on graph processing algorithms but this kind of system would be potentially useful for other applications as well even more uh modern applications than the ones that were used by the authors of this work in 2015 today for example graph neural networks ER are probably could probably be a pretty good fit for this kind of accelerator but it&amp;#39;s all this also has disadvantages one of the key disadvantages is that you need a lot of changes in the system you need a new programming model and that&amp;#39;s always challenging because users programmers need to learn how to use this system and the the teser cores are specialized for graph processing so if you want this to be say more suitable more general purpose you might need a different design for theera core cost is also a disadvantage you can expect that hbm memory or sorry 3D stack memory like hnc or HPM is more expensive than DDR memory right so that is also going to make this accelerator expensive and um and and in the end there might be still some uh scalability uh problems because uh we need to partition the graphs and um and um and yeah um it&amp;#39;s to a remote Core if you need to bring data from a remote Core in the same Cube or even from a different Cube um not in all cases uh scalability is great because you might need to have many um accesses many remote accesses and in the end uh remote accesses are at lower bandwidth than local accesses and that may also affect the performance and that&amp;#39;s why some of the later works that have built upon this tessera system uh one of the key things that they are targeting is uh mapping is um or graph partitioning is how you need to map the graph onto the tesak system with multiple cubes um in order to minimize the data movement even further in order to minimize the amount of remote accesses that you have so still is not a perfect solution but anyway the paper is there there are also some slides that you can check and we continue talking about near memory processing do you guys have any questions no okay so then maybe it&amp;#39;s a good time to uh to do make a break now like 15 minutes is that okay now it&amp;#39;s 2:30 so uh we will continue in 1550 minutes does it make sense okay let me know in the meantime if you have any questions or anything that you want me to try okay I think we can continue okay um yeah remember uh we are talking about processing near memory and um and and we are mostly focusing on processing near memory with 3D stack memories like HMC or like hm and um and we are covering different examples of uh of how we can build processing in memory systems using these 3D stack memory technology right the first one is a proposal of a quar scoring accelerator for a specific type of applications in principle that has have uh you know very challenging requirements very large data sets they also have uh frequent random memory accesses very little computation so uh these characteristics make the workload very suitable for processing in memory but we are not going to design accelerators or large uh cor spring accelerator for any application that might get benefit from processing in memory right because that would be really CH really costly and probably we cannot have so many coring accelerators in our system um so there are other approaches to processing in memory um one uh simpler thing that we can do is extending the main memory with some compute capabilities that can uh perform certain operations that are important for the specific applications that we want to focus on and that might be very much related as well with the type of system that we want to focus on or we are focusing on for example in a mobile system uh or in um you know relatively small environment and embedded systems Etc we may not need to have for sure not a course green accelerator but we may not need to have a strong variety a large variety of different operations to perform in memory because um these systems are potentially more specialized on different on specific applications right so it could be nice to identify for these specific applications that are important in the let&amp;#39;s say embedded or mobile environment um identifying these specific applications what functions what calculations can get benefit from execution closer to the memory and that&amp;#39;s what was proposed in this Google workloads for Consumer devices paper that was presented in 2018 is a paper that we have already mentioned because the famous 62.7% of data movement span or energy G spend of data movement is coming from this paper and energy is in the end the main motivation for this work why is that because this work focuses on consumer devices for example tablets or cell phones or Smartwatch that are mobile devices they rely on batteries there is no connection to the you know electric uh network uh that can provide a continuous supply of energy right uh these systems rely on the use of batteries and batteries have a certain duration right and we want to maximize this uh the duration of of these batteries how accelerating the workloads making their execution more energy efficient and uh that&amp;#39;s why the work focuses on these type of workloads workloads that are typically used in consumer devices for example Google Chrome or tensorflow mobile or the vp9 video playback and video capture uh applications right uh the interesting thing of this work is that first of all um a lot a lot of characterization work was done for these important useful applications that are basically everywhere right because they are from uh from Google so there are in many mobile devices that we are using these days and one um interest as as I said one first interesting thing of this work is that there is a lot of analysis of what&amp;#39;s the cost of the energy cost and the performance cost of executing uh these applications in existing um customer um consumer devices right um so there are one some important observations for a start the 62.7% of total system energy and span of data movement that you already know some other uh important observation is the uh that a significant fraction of this data movement is coming from relatively simple functions so that inspires the authors of this work to propose logic near the memory to accelerate those simple functions in memory but there might be different ways of doing it you could have a more general purpose cord that can execute the different operations or you could have an array of specialized accelerators near the memory for each of the specific functions and the authors evaluated both uh and so what were the potential energy and performance improvements uh as you see more than 55 and 54% uh savings um uh uh in in you know these are average results of course we are going to see more details retail results soon for these different workloads we start focusing on tensorflow mobile tensorflow mobile is used in the um consumer device for uh neural network inference so uh you&amp;#39;ve for and whatever neural network is being used will feed some input and will get some predictions some inference output one um first uh um outcome of the workl characterization analysis that was done in this work is that 57.3 uh of the inference energy is spent on data movement and 54.4% of the data movement energy is coming to from two main functions uh simple functions like packing and unpacking and quantization that are used in tensorflow Mobile so we are going to see what&amp;#39;s going on with these functions indeed and why they are you know so um uh so much they have these uh large data movement requirements and we are going to see what are possible solutions for that if you think about packing is a quite simple operation that just requires reordering the values the data of matrices in order to perform matrix multiplication faster U minimizing cash misses but doing that requires us first to um spend 40% of the inference energy and 31% of the inference inference execution time on this packing operation which is just moving data right it&amp;#39;s just rearranging data in order to make the data structure more um suitable for for the later computation and one other observation is that packings data movement accounts for up to 35.3% of the inference energy so lots of performance and energy are wasted with due to this packing operation even though is a relatively simple re data reorganization process that doesn&amp;#39;t require you know a lot of computation just requires simple arithmetic the other one is quantization in quantization uh what we are doing is converting floating Point numbers into 8bit integers remember that I talked about quantization earlier today uh when when we were we were introducing the Samsung hbmp architecture uh neural networks do not require full Precision to uh perform well in and in especially inference so that&amp;#39;s why quantization is a really uh good approach to reduce the amount of memory space that we need and also accelerate the execution right but doing this quantization is costly and is costly in terms of execution time and energy here you have some uh numbers coming out from the workl characterization analysis A lot of the uh quantization energy is coming from data movement and quantization is pretty simple as well it&amp;#39;s a data conversion that just requires shift addition and multiplic obligation operations so not very costly operations what is what this work proposes first of all move the computation regarding packing and unpacking or quantization near the memory have there some either some general purpose core or some specialized accelerator both uh cases were studied and and obtain some performance and energy improvements indeed you can see here normal energy uh this is for the different applications that are being used in this work and different steps in this applications we are focusing now on ter of Flow Mobile for parking and for quantization we see a strong reduction more than 50% or around actually more than 50% in some cases around 50% in other cases uh Energy savings from the execution on the CPU core the execution on a pin core or the execution on a pin accelerators so there are very good chances of uh improving uh the performance and energy uh consumption of the system if we do uh acceleration uh processing in memory acceleration for tensor Flow Mobile or for the other applications as well and these are results for normalized runtime where we see you know um performance improvements as well in the same ballpark around 50% performance Improvement just by offloading those functions that are really memory bound as the wordl characterization step has shown right um yeah in the paper you can see you can you can find the the analysis for all these different applications in this slid we also have a few details about the uh Chrome uh browser uh you know the uh how the the way that Chrome renders a web page has different stages first of all loading and parsing then this layouting step uh where what so some calculations of visual elements and positions of the objects is being done and finally the pain painting step first r rization that paints the objects and finally this compositing that assembles all the layers into a final screen image right and U well each of these steps have different requirements um the thing is that for a SAT satisfactory user user experience what they you know you as the users of the browsers expect uh to have is fast loading of the web pages smooth scrolling and quick switching if you are uh changing tabs right um but um but yeah um the these operations might be costly that&amp;#39;s why um the focus here is on two of these operations page scrolling and tab switching because both include page loading loading entire Pages web pages from the main memory to the the um um to the um display so that they can be used by the uh by the by the user for example in TP switching uh something that uh was observed is that you know when you have a chrome with multiple tabs open there is a lot of data movement that is coming from Context switching and loading new pages you have your Chrome browser and you have multiple tabs there and every time that you switch one tab you need to contact switch or the browser needs to contact switch and load a new page because the pages are not active all the time if you are not using it it&amp;#39;s just compressed and sent to memory and that&amp;#39;s uh and that compression um uh indeed requires a lot of uh data movement um what the uh Chrome browser does is uh uh when when one top is inactive it&amp;#39;s compressed and sent to theam it&amp;#39;s stored in part of r that is called the Z RAM and whenever the user wants to open that tab again uh the CPU will have to access the compress tab decompress and then show the tab show the web page uh to the user all these compression and decompression operations require a lot of data movement and this was studied with some uh interesting experiment emulating how a user switches through 50 tabs and how much data movement was needed the um observations are that uh 18 18.1% of the total system energies spent on compression and decompression and a lot of data was moved between the CPU and the zram almost 20 Gigabytes for um of data for these 50 tabs so how can we use processing in memory to mitigate this cost instead of doing the CPU only approach where compression and decompression are are done on the host and this requires a lot of data movement uh what can be done is moving the compression operation to processing in memory and this way save a lot of data movement and potentially use the CPU for other tasks in the end accelerating the overall performance of the system right so this was also uh you know prop studied and and and and tested pin cores and P accelerators and indeed I think the results are in the previous plots that we have seen and uh yeah so this is for uh tab switching and that&amp;#39;s uh basically what we have about this work as as you seen you have seen we we we have shown two of the examples one is tensor FL mobile the other one is Chrome browser and how some uh specific operations here can be um accelerated and the overall performance of the system and um energy consumption uh greatly improved okay questions about this work or the previous works okay yeah we have a few more indeed I want to cover we have enough time I think uh I want to cover at least two more proposals the second the the the last one is kind of similar to this one because it&amp;#39;s also focusing on um a specific parts of uh of a specific applications and uh and and and one more is about the you know minimal of loading um to uh to processing in memory right what if we just want to upload single instructions to the processing in memory site and what are the you know challenges that we need to face there and solutions that uh we need to we need to use uh but there are many more uh slides uh in the presentation showing other interesting works on processing in memory that not are not only for CPUs this is also a processing in memory solution or processing memory proposal for uh gpus right in gpus as well gpus also suffer from the data movement bottleneck even though gpus these days use high bandwidth memories in a similar configuration as the one you see on the slide uh but still all computation is being done on the main GPU which is pretty beefy cheap with uh you know many cores and many threads that can run concurrently on the GPU but still all of them need to go through you know relatively narrow funnels uh to access data uh from the hbm hbm uh memory Stacks right um well there is a there are several interesting BS on how to do processing in memory four gpus one of them is this transparent of loading and mapping uh presented in 2016 that I identifies uh a specific sections of the GPU kernels that are more suitable for processing in memory and of loads them to some uh processing elements small uh GPU cores placed under the bolt of the U HMC memory and it&amp;#39;s not the only one this is also another interesting one scheduling techniques for gpus archit GPU architectures this one considers a different of loading granularity instead of relatively small sections of the kernel um execution it offloads entire kernels it characterizes the kernels first and sees that the authors saw that there are some kernels that are more memory bound other that are less memory bound and depending on that they decide where to schedule but there are a few more works I think tomorrow we will talk with a little bit of detail about this one uh because it has some uh interesting uh ideas about how to handle uh virtual memory management in processing in memory systems this is an accelerator for pointer chasing applications uh and uh we are talking about you know different types of processing in memory processing near memory processing using memory but also different memory Technologies and also different places where we can have the processing near near memory elements right um uh we we we are mainly focusing on main memory and 3D stack memory Technologies but processing memory can also be done in the memory controller for example uh this work uh proposes a way of accelerating uh cach misses uh by an enhanced memory controller that can do sort of processing in memory or this other one called continuous run ahead that uh kind of places a a run ahead execution engine in the memory controller as well well so these are are all interesting readings and examples of the different places where we can have the processing in memory or near data processing capabilities or this one for example we have already mentioned this one in the past um is an FP based near memory accelerator or this other one that uh yeah it&amp;#39;s also near memory in this case we used HPM memory and uh and and it&amp;#39;s is uh Al a nice um example of uh algorithm Hardware Cod design with an algorithm for um approximate stream matchine that is perfectly tailored to the um underlying Hardware or this other one uh for time series analysis so we keep covering examples of uh re or or processing in memory um research work and uh and we have already seen how to design a course accelerator for a specific class of applications or how to uh um design processing in memory architectures for relatively simple function of loading but uh one other question that we can ask ourselves is what&amp;#39;s the minimal processing in memory support that we can provide and we can get benefit from and something that makes the integration into the endtoend system uh simpler because uh just minimal changes would be needed that&amp;#39;s the next proposal that we are going to cover it&amp;#39;s called pin enable instructions and instead of using designing here an Accel an entire accelerator for relatively large kernels and computation or something that uh implements an entire function like packing or um tab switching or the the the examples that we saw in the Google consumer workloads paper um this one just proposes to upload a specific instructions for example an addition or for example um dot product operation something like uh pretty uh simple to the memory site and is still obtain uh significant performance and energy benefits so the idea is to develop mechanisms to get the most out of near data processing with minimal cost minimal changes to the system and no Chang to the programming model so the key idea number one is to um expose each Pim operation as a cache coherent virtually address host processor instruction that is called Pim enable instruction or Pei that operates only on a single block why is that important because uh if you think about how memory is mapped so how data is mapped onto memory you&amp;#39;ll see that you are going to typically map uh consecutive cash lines to consecutive Banks or memory channels Etc right so uh if you need to if you in your computation you need to access um more than one Cash Line you will likely need need to access more than one memory bank right and that requires that the processing in memory unit has access to everywhere in some in some way and this access to different parts of the memory might be different with different latencies right ideally we want to access the closest memory the local memory remember in in HMC for example we have the memory divided into multiple bolts and under each of the bolts we can have one in order core for example in tesak this in order core will prefer to access data in its own bold because that&amp;#39;s going to be faster right but that&amp;#39;s not the always the case in indeed that&amp;#39;s why in tesak we do remote function calls or we we do prefetching from bring dat to bring data from other bols right but if we just uh perform computation on a single cach line the biggest Advantage here is that uh we will upload the instruction to a specific processing element near the memory and this processing element just needs to access one cach line in its own local memory in its own Bol if you think about an hency memory so that&amp;#39;s um the key idea and because they are just simple instructions we will just need to replace some lines of code in our program with the a specific Pi instructions that we want to do for example if we just want to do one addition let&amp;#39;s just replace the addition with this pin ad instruction okay so advantages uh we can easily modify our programs we don&amp;#39;t need to big changes in the programming model there are no changes to the virtual memory system why is that because the instruction starts it&amp;#39;s the the the Pei instruction starts its execution uh in the CPU pipeline right and the CPU pipeline itself can find what&amp;#39;s the memory address that we want to modify or we want to update for example with this pin ad operation so the CPU itself which has a memory management unit right can perform the memory address translation and send the pin ad instruction to the specific processing unit that resides close to the cach line that we want to update remember that here we are just updating one um memory address the content of one memory address so no changes to the virtual memory system minimal changes to the cache coherence uh why is that just because we are just operating on a single cach line and as you will see if this cach line is already in the cach hierarchy we won&amp;#39;t go to memory which is something that also makes a lot of sense in processing in memory right compute where it makes sense or compute where the data resides so we need just minimal changes to the cach coherence and there is no need to worry about data mapping because each processing element will execute Pi instructions on data that resides closer to it in the local memory right in a single memory module okay the key idea number two is related to the cach cerence as well is that we can dynamically decide where to execute the pii if we know that the cach line is in the main memory okay great send the execution of this pin ad to the main memory and execute it there uh in the um whatever pin processing element that you have there but if the cach line is already in the cash hierarchy why do you need to go go to memory just execute the operation in the host okay and uh well just for a little bit of motivation remember uh the algorithm is the same we use for an example Yer uh before uh the the page rank algorithm um remember that this only needed one multiplication and one addition uh the multiplication now is done uh earlier but but in the end the computation is exactly the same uh if you want to execute this on a conventional architecture and remember that it&amp;#39;s we will typically have frequent random random memory accesses there so we are going to bring entire cach lines to the host processor perform some update operation and then we need to return that cach line probably mostly unused right because of the random uh memory accesses and we are moving in and out 64 bytes that&amp;#39;s not a great idea we are moving a lot of data for maybe just eight bytes or four bytes that we want to update if we replace the the the update operation with this pin ad we could just send the value to the main memory and perform the uh update operation the pin ad operation in the main memory and this would require just eight bytes in okay well but executing always in memory is always in memory is not always a good idea and actually uh the the paper has this nice anal is for uh different graphs and of different size right so more more to the right means and more vertices so larger graphs and what the authors observe is that not in all cases offloading the computation that we have just seen in the previous slide to the main memory not always is a great solution why is that because sometimes the graphs are pretty small and they fit in the caches or part large parts of them fitting fit in the caches so it makes no sense to upload to memory so there is where it makes sense to come up with ways of doing you know a smart scheduling based on where the data really resides uh one advantage of the P enable instructions as well is that they can be executed asynchronously in many many cases it&amp;#39;s it&amp;#39;s similar to the remote function calls in terak right we upload the computation to memory and we can continue uh doing uh computation if there are no dependencies as it&amp;#39;s this case if we eventually need to um synchronize we can use this Pence that has a similar purpose as a barrier in tesak right so it&amp;#39;s making sure that the so making sure that U there are memory consistency guarantees right we we are sure that all updates were done before these Pence uh the paper uh I mean the the idea itself is not limited to any Specific Instructions right pable instructions you could come up with your own P instructions the authors proposed a few ones that were nice nicely um implemented for the specific applications that they were targeting and uh well yeah they have some characteristics that we have already discussed like cash coherent virtually address they target a single cash block they are Atomic between different piis so if two um let&amp;#39;s say two you have two CPU threads and and and both of them are uh sending a pii to the main memory to execute execution is going to be in anatomic manner that means that uh if they want to update the same cach line the execution of One update will be first and then the uh the second update from the second thread will come later so they are atomicity is is guaranteed and at some point we might want to use p fenes and and here you see well that those are um more characteristics of the P enable instructions in terms of localization interoperability and simplified locality monitoring so localization key idea is that uh because we are targeting a single cach cach line uh yeah we don&amp;#39;t need to perform remote accesses uh so that uh pretty convenient is also pretty convenient for the system integration support okay and and here you see um yeah some uh information about the different uh data intensive workloads that were used for the experimental part uh of this work uh graph processing data analytics machine learning and data mining and here you have the uh uh what are the different applications I average teenage follower BFS payran so as you see several graph processing algorithms but also data analytics and machine learning and uh the paper shows uh pretty good results they evaluated three different data sets for each of the applications of data sets of different size because if they are very small they will likely fit in the cach memory so pin might make no sense um the more interesting analysis for sure is for the larger inputs because that&amp;#39;s where we can see um most benefits from uh either pin or either the locality aware execution the locality aware execution adds a certain unit near the caches of the uh host CPU and locality monitor that is going to check whenever we want to update or we want to operate on a cash line with a beam enable instruction the first thing to do will be for the locality monitor to check if that cach line is in the cach hierarchy if it&amp;#39;s not then we upload the computation to the memory and and that&amp;#39;s why you see a little bit higher uh performance Improvement for the locality where execution because in not in all cases makes sense to offload the computation to the memory in fact here you see the results for data for small data sets uh where you know because you know data sets are are pretty small they likely fit in the caches so it makes no sense to overload all the time to processing in memory that can only cause um performance reduction yeah you see uh here some more [Music] um um yeah some more uh results and Analysis to justify the the the performance that we uh see in the other plot these are the onchip sorry off chief of Chip uh memory transfers for in Pim only we are always of loading the operation to the memory even though the cach lines might be in the host so that requires a lot of data movement so it&amp;#39;s always good to understand the workloads understand the data sets and understand the systems in order to make the right SCH scheduling decisions and these are results for medium um data sets that you can sell check yourselves and these are uh energy consumption analysis as expected uh larger uh um energy savings are for the large data sets up to 25% uh Energy savings just by offloading some individual instructions these Pi instructions that can be performed near the memory so advantages and disadvantages one advantage of these Pi is that uh they are simple and and is a simple and low cost approach for to processing in memory there are no changes to the programming model or the virtual memory it&amp;#39;s possible to to decide dynamically the locality monitor does that uh where to execute an instruction but there are also some disadvantages uh we don&amp;#39;t take full advantage of the pin potential because we are just uploading individual instructions due to this um single cach block uh restriction and um and yeah in the end if you have uh Pi units near the memory in your system you might be able to only execute some um relatively limited operations near the memory if you think about the this um uh Pi proposal and the real world processing in memory systems that I presented in the beginning you&amp;#39;ll see that there are certain similarities right um here in pi the main idea is to upload uh just individual instructions simple instructions to the memory that operate on a single cach block if you think about Samsung hbm Pim or SK heix aim they are uh using uh they are implementing processing elements that are simd but these Syd processing elements are operating only on uh Changs of 256 bits uh that&amp;#39;s not exactly the size of a cach line as considered in this PI work but is a relatively limited amount of data that can reside close to the processing element right so in that sense it&amp;#39;s also kind of A fine grain type of of loading uh one difference is that here in pii uh each individual instruction is uploaded from the host if the locality monitor decides that from the host to the main memory while in um a system or in an architecture like Samsung hbmp or skyx aim we would be uploading uh a little longer right a few more instructions to perform multiplication and additions on um a little bit larger um amount of data to perform for example a do product operation so it&amp;#39;s not exactly the same but there are um some important similarities and uh at the same time similar challenges for system integration as well okay so this is the paper um pretty likely uh another reading that you will you will do in this course um uh very uh I think is very inspiring and and and also pioneering work um in the you know res and research in processing in memory okay any questions yes is I mean it depends on the type of uh pin right if you think about tesak for example in tesak for each core is possible to to access data in its own bolt or in remote bolts even in remote cubes because there is an interconnection Network and there are the architecture itself has mechanisms to allow this remote communication right either in the form of of loading computation or bringing data uh from from outside um the idea in this uh work Pim enable instructions is to provide a way of doing pin that requires minimal modifications to the system right because that&amp;#39;s important for adoption and for you know n to system integration uh so that&amp;#39;s why uh the authors propose to operate on a single cach line because in for many workloads that might be fine right if you think about page rank every uh update just updates one neighbor uh vertex right so it&amp;#39;s it&amp;#39;s just eight bytes or something like that that you want to update or if you think about a histogram calculation in histogram calculation you have a large input image uh but you can partition this large input the much over the different cores that you have in the system and each of them just needs to update one particular element of the output histogram so there is not much uh um um there are not many uh requirements in in terms of um accessing data for the specific operations that we of loading here if you restrict the operation to a single cach line first of all that means that you can send the computation to the processing element that is closer to that cach line so we avoid remote accesses that are always more costly because you need an interconnection Network and because they need more time longer latency right so that&amp;#39;s the first Advantage and one important Advantage as well is virtual memory support why because if you send the computation to the processing core and the the processing in memory core and the processing in memory core needs to access multiple cach lines in different parts of the memory it needs to do virtual memory addressing uh virtual memory address translation right and uh in order to really figure out where in the physical memory the data resides right so that needs more complexity in the processing in memory core and probably you want to uh it it&amp;#39;s challenging already to integrate cores inside the memory so even more challenging if you need to have these sophisticated mechanisms such as virtual memory address translation and you probably prefer to expend that area into placing more alus which in the end are what what you need there right to compute so uh the the the good thing is when targeting a single cach line virtual uh memory address translation can be done by the host the host figures out what&amp;#39;s the physical address and just sends the instruction to that core that is next to that physical address and that&amp;#39;s it so it simplifies a lot the design in the system okay so those are the that that&amp;#39;s a key reason of course uh it&amp;#39;s also more limited and it&amp;#39;s and if you see what are the performance improvements and the Energy savings of these PI work they are significantly lower than tesak for example but the other one is uh is more challenging to integrate right and and to adopt okay uh yeah we continue talking about uh processing in memory uh proposals um but this one is uh again um we we we we could again consider it like function of loading right this one is also about Google workloads but in the end in in in this case uh neural network models for Edge devices similar motivation again Edge devices will likely rely on batteries right and we are very much interested in accelerating their performance and uh minimizing their energy consumption and this is called the Mena framework and the Mena framework uh well proposes different accelerators for the different um layers that we can find in uh ml models um again well there are um if you think about uh mobile device uh potentially with an htpu uh it might need to um uh make use of different NL models back then when this work was was done uh these these models where uh recurrent neural networks or convolutional neural networks or lstms or uh or recurrent convolutional neural networks that is different types of neural network models that have some similarities but they they also have significant differences and that&amp;#39;s again uh understand your workloads before you um accelerate them before you figure out uh how to improve their performance of the system you need to understand them well and that&amp;#39;s uh again why um some interesting warlock characterization work was done in this project here this uh graph shows the float overbite versus the parameter footprint for different neuron networks and different layers of this neuron networks these flop flops per bite represent the arithmetic intensity so when we say uh bytes it means the bytes that we are bringing from the memory to the processor or to the accelerator in this case the htpu and the flops are the number of floating Point operations that we perform per bite right per each of the bytes or data that we brought from memory right and as you see the this arithmetic intensity changes a lot from some models to other models or from some layers of the models to other layers of the models and at the same time the parameter footprint so what&amp;#39;s the size of the layers themselves the weights of those layers might also change a lot and as you see well several orders of magnitude difference depending on the type of network and the type of layer we can even do more analysis because even within the same layer we may have or within the same network we have layers with very different characteristics for example in one of the CNN that was anal Anze there is there are large differences in the in the number of multiply and accumulate operations that are performed in different layers or the arithmetic intensity again flops over bytes um it&amp;#39;s uh it&amp;#39;s very different in different layers so what could be ideal does it make sense to have a single type of processor or a single type of accelerator for all these different models and all these different layers the answer is no right because if you have just a monolithic Accel accelerator such as the Baseline htpu which might be a nice design with a systolic array to perform you know a lot of the computation in in all these networks is dot product gmv operations that map pretty well onto this systolic array but as we have seen size of the parameters number of uh operations that we perform per bite changes a lot in different models and in different layers so that&amp;#39;s what makes the monolithic accelerator inefficient in many cases so the idea in the Mena framework is to characterize the models and depending on the characteristics of the models Define different families of workloads meaning neural network layers or neural layers of the different neural network models classify them and base on their characteris overload them to a different type of accelerator the first of the accelerators is pretty similar to the original Baseline monolithic accelerator and is near the CPU why is that because this one is going to um uh to be to this one is going to accelerate workloads with higher arithmetic intensity uh that workloads that can take advantage of the large cash here key in the CPU so when you bring data from the memory to the CPU or to this caches here you can reuse that data several times right and and that amortizes the cost of bringing the data and makes an efficient use of uh the hardware that we have here but that&amp;#39;s not the case in all cases we have a strong arithmetic intensity variability um in the different families so that&amp;#39;s why it might make more sense to design other accelerators two more types of accelerators that sit near the memory in the logic layer of 3D stack dram and yeah uh the the paper itself I don&amp;#39;t know if you will have to read this one or not but it has a really interesting um analysis of the of the different um layers how to identify the different families and depending on what&amp;#39;s their uh what are what&amp;#39;s their me memory footprint or arithmetic intensity or number of or Mac intensity number of multiply and accumulate operations some of the layers will be more compute bound or compute Centric or more data Centric or memory bound and depending on that observe that here five different families were identified so families one and two are more comp compute Centric so they will likely be executed on the accelerator one while the others are more suitable for the near memory accelerators and the overall energy reduction of Mena for uh well baseline or either Baseline with high bandwidth memory or the MSA frameware itself with the three different accelerators um as you see the um energy consumption reduces by three times compared to the Baseline htpu uh it&amp;#39;s a pretty nice uh breakdown of the energy consumption uh with you know each of the uh part of the bars corresponds to the energy consumption in each of the individual components of the system uh so pretty U nice analysis for the different models that were evaluated and the average is there on the right hand side and this is the Thro put Improvement 3.1 Times Higher throat putut inference throat putut than the uh Baseline also for yeah the plot here is normalized to the uh Baseline the Baseline with with hm memory and the MSA framework uh pretty significant performance improvements okay well you can find much more uh in the in the paper itself and um yeah I guess we can continue do you have you guys have any questions about the Mena framework okay let&amp;#39;s um keep making progress I think we are going to have time to um um start introducing the ABM architecture and the how to program the ABM system I think that&amp;#39;s good because that will give us uh more time for um tomorrow&amp;#39;s lectures to go to go deeper into tomorrow&amp;#39;s lectures okay so this is the Mena paper uh but yeah this is another one that I mentioned before um this Tom paper transparent offloading and mapping um it&amp;#39;s a as you&amp;#39;ll see is a pretty nice proposal to identify what code can be executed near the memory and what are the best ways of mapping data to take advantage of processing in memory system um um of loading of the code when to make use of the pin system is always a is always a question in all types of proposals and something that is usually Target in the um different papers but there are more questions that need to be answered we are going to talk about them tomorrow tomorrow uh with more detail uh but but yeah uh here in this slides we already have some sort of introduction to this issues for example how to keep it simple right remember what are the what&amp;#39;s the main motivation for this team enabled instructions work um uh keeping the modifications of the system small keeping them simple but still be able to uh enjoy or take advantage of processing in memory capabilities and there are more questions to answer right one of the advantages of the PI work the previous slide is that it&amp;#39;s a single cach line so it&amp;#39;s much easier to handle virtual memory add translation handle memory coherence but in some cases especially for example in this lazy pin work uh when you have both the pin units and the CPU or the host processor accessing memory at the same time and potentially working concurrently on the same data sets you need to enable cach coherent mechanisms between the memory side and the uh CPU side why is that because the one pin core might be uh updating the content of one specific memory address and maybe the corresponding cach line is already in the cach hierarchy right and the CPU could be reading a stale data if the processing in memory core updates the contents of those of that cach line that is already in the cash hierarchy so we want to keep data coherent over across the entire system and that&amp;#39;s what this lazy pin work tries to do I think we are going to talk about this one um tomorrow but yeah here uh you can see that uh yeah that that&amp;#39;s indeed possible the Lacy pin approach got pretty close to an ideal pin system with no coherence overhead um yeah I think we will elaborate more on this uh tomorrow and this is a follow-up word that is called K that um yeah also proposes efficient cach coherence support and in the pin system we typically have multiple multiple cores right we are that&amp;#39;s the kind of a scenario that we are considering uh in all cases right in um tesak for example is a course scen accelerator with many tesser cores that need to communicate and the work itself uh proposes ways of Performing this communication or synchronization remember that there are barriers right what the barrier guarantees is that the execution of all pin cores uh stops when you when when the cores reach that barrier until all other cores all other threats in the system have reached the barrier and right after that execution resumes right so there are ways of synchronizing across pin cores some of them might not be efficient in all cases if you have a thousand core a thousand um course in a system and you want to synchronize the Thousand course in the system it will probably take a few Cycles to uh perform the barrier operation uh that&amp;#39;s why this more recent work from 2021 proposes efficient ways of Performing synchronization in a processing memory system with many pin cores I think we will cover this one as well um with some detail tomorrow and um so support for virtual memory we are already discussing ways of supporting virtual memory in pi uh we have seen one example but there are we might need more complex ways of supporting virtual memory and and uh in the in tomorrow&amp;#39;s lecture we will talk about this um uh work presented in 2016 that has a pretty um nice way of uh of uh supporting virtual memory in the processing in memory course okay yeah so those are some of the let&amp;#39;s say barriers that we still need to overcome in order to support processing in memory in real systems right how to enable processing in memory in the real world that&amp;#39;s a a question that has many different phases and for each of these phases we need uh different solutions right uh and one other thing that we need to enable processing in memory are real systems real systems that allow us to experiment with processing in in memory and um explore what are the potentials of processing memory for different workloads that&amp;#39;s why uh we can consider this part processing in memory in the real world as part of eliminating the adoption barriers um in this course you already know you are going to have a lab where you will need to program a real world processing memory system from ABM and um and uh you know the most uh interesting thing of that experience is that you don&amp;#39;t I me for you it&amp;#39;s going to be I think uh pretty nice because it&amp;#39;s a a parall programming lab so you you can uh refresh a little bit on Parallel programming which is good uh but at the same time you will see as well what are the issues with using a real processing memory system right what are the the what what&amp;#39;s the potential that you can have for for performance Improvement but also what are the limitations that the system has so um working with a real system allows us to understand um what are the uh what&amp;#39;s the true potential of the of this real system and identify its inefficiencies and propose ways of improving the system and overcoming those inefficiencies right this way we can build more energy efficient computer architectures and high performance Computing architectures Computing architectures with minimal data movement okay so um yeah so uh these are you know um different points that we will cover in tomorrow&amp;#39;s lecture in more detail but uh starting from the top applications and software for PIM if we want to enable processing in memory in the real world we need to understand existing processing in memory systems we need to understand how to program them how to make use of them and and how software needs to be written because likely we will have to learn something else to make use of these processing in memory systems and that&amp;#39;s something that it&amp;#39;s okay as long as we get improvements we get benefits from doing that right uh 15 years ago when gpus became general purpose uh people had to learn how to program these gpus for general purpose processing and that&amp;#39;s why many people started to learn how to program in Cuda or in opencl and today you will find uh GPU programming courses in probably all universities teaching uh computer science and um and and that&amp;#39;s something that makes sense to do because we are using gpus for important workloads these days right um of course uh before you know GPU programming became mainstream um it was necessary to show what were the large potential benefits from using gpus for general purpose computations so at this time we are more or less in a similar situation so that&amp;#39;s why uh we want to we need to understand what are the potential benefits of processing in memory and at the same time we need to develop ways of making the usage the use of these processing in memory systems uh easier and that um yeah that&amp;#39;s um basically what we are trying to do here so that&amp;#39;s why you want or we want to experiment with a real world processing memory system uh we uh have a little bit of experience with this appman pin system we have studied it in in in several papers this was the first one we published and we have public we have made open source a lot of code as well for this pin system this is a benchmark Suite that uh we develop been the first work exploring different compute patterns for uh this processing in memory system and that well you as you already know it&amp;#39;s based on ddr4 dims uh with small processors inside the uh drun chips um and the the the what what this this slide actually is coming from the vendor itself from appm um where you can already see what are some of the promises that they make in terms of uh Energy savings in terms of um um um speed up as well like you know estimations about 20 times or 10 times faster than doing it uh on the CPU but uh that probably was not for free one thing that they are claiming here in this slide is that fabricating uh chips de and chips with processing elements is pretty challenging this is something that I mentioned briefly in the beginning of the lecture today uh transistors are significantly slower inside theam because they are designed with different characteristics right we don&amp;#39;t want a transistor to be so fast because they have a different purpose they are just uh switches that um you know that connect uh one Dr Cell to the beat line and and they have a different purpose right and so typically the logic in dram is less dense than in the ASC process in simos uh so there are not so many metal layers for routing so that makes designing a processor inside the D chip much more challenging but in the end they made it and they filed patents about that and I like this abstract of the patent because already tells us what the system is about first of all it&amp;#39;s a Memory circuit that has a memory array as we could expect and then inside the memory array there is a processor it&amp;#39;s called first processor here it&amp;#39;s a small processor uh in the the actual product is called dpu as you might remember um and there is also a control interface why do we need a control interface because this first processor in reality is sort of a slave of a central processor which is the CPU or the host right um so the central processor sends uh request to the first processor sends kernels to be executed by the first processor and while the first processor is Computing the dam banks are only accessible to this first processor and the memory banks become accessible to the central processor when the first processors H are already done right so this is what the system is about uh in the end we should see the ABM dims the ab pin dims we should see them as as an external accelerator in that sense if you think about the system that we are starting to describe now you see that it&amp;#39;s sort of a quar gring accelerator in a similar way as tesser was right even though the memory technology is completely different because this is DDR the other one was um HMC and the type of processors as well are different but we can see uh we can see um this upman system as a an external accelerator so it resembles tesak but also resembles GPU Computing where the GPU is a let&amp;#39;s say B accelerator that sits on the other side of the PCI Express Vass and we want when we want to make use of it we&amp;#39;ll have to send the data from the CPU memory to the GPU memory then perform the computation on the GPU memory and finally return the results to the CPU memory right so this is what we are going to do with this abman pin system as well um we will have first of all the CPU in this slide called s so system on a chip right the CPU loads data to be processed to the dram memory bank the dram memory bank here is the PIN enabled memory right then trans transmits a data processing command to the D processors to the dpus it&amp;#39;s like launching the kernel onto the dpus and then the execution by the dram processors start right um these dram processors compute for certain time for certain number of Cycles maybe millions of cycles and in the meantime the CPU would be checking whether the computation has finished or not because when the computation finishes the banks become again accessible to the CPU that can go there and bring the data being bring the results right so observe that is pretty similar to what we do in the GPU comp as well and here you you have another picture of the uh well or the first picture of the system organization is pretty uh simple because it&amp;#39;s coming from the patent you can see the CPU there and then you can see the different D deems or D chips with memory arrays and with processors but here you have a another say um different picture of the entire system with the host CPU the main memory because still there is there are conventional D dims that we use as main memory and the Pim enable memory the Pim enabl dims and we can take a closer look at as at each of these dims uh the dims typically have well uh one or two ranks most recent ones have two ranks of eight chips right and in that sense is pretty similar to Conventional ddr4 dims and then uh in each of the ranks we have eight chips and inside each of the chips we have these we have eight Dam Banks each of the dram Banks has a size of 64 megabytes and they are called mram right even though they are let&amp;#39;s say conventional dram and on the other side what we have is a processor a small processor if you look at this it looks like a I mean it&amp;#39;s a pipeline right it looks kind of similar to the mips pipeline L for example that you might have uh studied um in your Bachelor courses right um I think yeah we have we we will have more detailed slides of this but anyway um as you see uh we have uh eight different banks near each of the banks we have one Pipeline and there are two small SRAM based memories one is for instructions and the other one is for operan this one behaves as a kind of cache but it&amp;#39;s not exactly a hardware managed cache similar to what you have in your CPU in your laptop or your cell phone is not Hardware manag this one needs to be managed by the programmer so we should explicitly request accesses uh of data or request data from the dram bring the data from the Dr to the W RAM and then the the the scratch pad or or software manage cache and then once we have the data here we can use it we can operate on that data using the alus okay yeah we will see this in in more detail later um this is another picture of the say current upand based pin system that has uh up to 20 dims uh in total uh 40 ranks so in total uh uh 2560 dpus as you see it&amp;#39;s a dual socket CPU connected to main memory and to Pam enable memory in the Pam enable memory we have all these many dpus and in total 160 gigabytes of P enable memory and this is one picture that you have already seen before probably now it makes much more sense right you can see all the uh DS there in the in the memory slots some of them are devoted to the pable memory others to the uh main dram okay any questions so far about this system okay so then let me continue and for maybe four or five more minutes and we will be done until tomorrow we&amp;#39;re going to uh talking about this abman pin system and about how to program it we will not cover the entire presentation about this system for sure but I want to give you the basic background on uh abmin programming at least because that what you need for the uh for the Le for the lab sorry as our first example vector addition this is the simplest thing that you&amp;#39;ll have to program is our first programming example in vector addition the only thing that we are doing is the element wise addition of two input vectors A and B and storing the output in a vector called C what we do if we have a system with multiple processors and the processor here processors here are called dpus what we will do is partitioning the workload across the different processors imagine that we have four dpus okay so we divide uh input and output vectors into four large chunks and we assign each of them to each of the dpus how what&amp;#39;s a what should be the size of these chunks well it depends on the amount of memory that you have in each dpu you guys remember how much memory we have a access to a dram Bank of 64 megabytes right so that&amp;#39;s the dram that we have for each of the dpus so that defines what&amp;#39;s the largest chunk that we can assign to H dpu but then inside the dpu inside the we can call I will call it dpu or sometimes maybe pin core right uh so inside the dpu we are going to have multiple threads running right in the same way as you can um program threads with open MP or P threads or C++ threads in your multicore CPU something similar we can program these dpus to use multiple software threads that are called tasklets so inside each of the dpus we are going to partition the workload again and we&amp;#39;ll assign different chunks of the input to different tasklets this way we can exploit the large data level parallelism that this simple vector addition has okay uh by the way typical numbers are not four and two here we have more than 2,000 dpus as we have seen more than 2,500 dpus while typical number of tasklets are more than 11 we will see why tomorrow and you know typical number that you will use in your um your lab will be 8 16 12 something like that the maximum number of tasklets in each dpu is 24 okay um here uh you can see uh well a link to the user manual actually that&amp;#39;s uh an old version there is a newer version from 2023 uh it&amp;#39;s it&amp;#39;s a key reference for you not only or lectures but you can also access the SDK documentation that explains everything you know all the uh different things that you you need to know to make use of this uh system and and program it um I would like to First and and I will finish the lecture here give you some general programming recommendations uh that um yeah that come from different places um so first of all and but but as you will see they you know are are kind of intuitive right make they make a lot of sense so if you want to if you want want to make use of of an accelerator like the entire pin system is an accelerator right remember we can see that a qu graen accelerator uh we we probably want to accelerate the workload as much as we can right we have a lot of bandwidth a lot of pin cores so why not let&amp;#39;s try to use them as much as we can so that&amp;#39;s why one General programming recommendation is to execute on the dpus on the D processing units portions of parallel code that are as long as possible that&amp;#39;s something that you would do in this system but you would also do in other types of accelerators as well because if you upload more computation that means that you will have less communication between the host and the accelerator right between the host CPU and the pin system or between the host CPU and the GPU if we are talking about the GPU so this is um kind of General because it does not only apply to uh the pin system that we are uh working on um here okay so second important thing or second uh uh important consideration is to split the workload into independent data blocks which the dpus operate on independently why is that well first of all because if you are a pin core and you have direct access to some part of the memory that you can access faster than other part of the memory well you want to stay there right because that the performance will be higher if you are accessing the local memory that&amp;#39;s something that makes sense everywhere it makes sense in tesak as well right as we have discussed but it makes even more sense in this pin system why is that because each individual dpu has access only has only direct access to its own Dr bank so all communication between dpus as we will discuss in more detail tomorrow needs to happen through the host CPU so that&amp;#39;s not a good idea and that&amp;#39;s why we want to operate on we want to work on Independent data blocks of course if you have a parallel system and you want to accelerate the performance of your workload you should use as many processing in memory cores as you want as as as you as you can right so there are more than 2,000 or more than 2,500 dpus in the system use as many as possible and finally launch at least 11 tasklets remember the software threats that run on the dpus Y 11 well I think that we are going to see this tomorrow but it&amp;#39;s related to the number of pipeline stages remember that the dpu is a a in order pipeline processor and it has a certain number of stages so um we really want to keep all stages in the PIP line BC and that&amp;#39;s something that we can do when we have at least um 11 threads 11 task LS okay any questions no okay so then I think it&amp;#39;s uh enough for today we will continue tomorrow with this admin programming we just started uh it&amp;#39;s going to be a little a little bit longer probably the the first half uh of tomorrow&amp;#39;s lecture maybe one hour or so we will talk about admin programming lecture 4 a and then uh we will continue talking about processing in memory in the um enabling pin lecture that is 4B So yeah thank you very much for your attention and see you tomorrow

Transcript for:Computer Architecture: Processing in Memory (PIM)

Transcript for:
Computer Architecture: Processing in Memory (PIM)