Transcript for:
[Lecture 13] Exploring Computer Architecture Advances

okay yeah we had some technical issues what a surprise no okay great um yeah so today first of all welcome everyone uh to lecture 13 of Compu architecture and we are live I hope right good um Professor musler is traveling so I will be teaching today and tomorrow also one of my colleague John will also teach so today uh we going to have two lectures uh the first I'm going to continue memory controllers like our discussion on service quality and performance and then um after finishing the first part we will jump into emerging memory Technologies actually that's going to be the main topic for today but before that I prefer to conclude our previous lecture any questions just started okay great so we were discussing about memory scheduling for heterogeneous systems and this was one of the systems that we were discussing like the current system on chip architectures they have quite a lot heterogeneity you can see that we have we can can have different CPU cord like large CPU CES as well as small CPU CES GPU Hardware accelerators dma and essentially like the memory controllers here should be able to handle all these different devices with different characteristics and applications that they are running on all these different cores and different processing elements they have different characteristics could you please also close the door thank you so yeah so we wanted to see how to allocate resources to hetrogeneous agents to mitigate interference and provide predictable performance I'm not going to go into detail about all of this again I'm just just going to flash some of the slides that we showed um so essentially if you want to learn more about heterogenous systems scheduling you can watch this lecture from Professor mlu and in general if you want to learn about heterogeneous Computing system uh you can check this lecture even though we're going to also have this kind of lecture in this semester as well but if you are kind of impatient you can uh you know go ahead and watch it but essentially we have done a lot of work in this domain and we designed different memory scheduling techniques uh stage memory scheduling is one of them that um it provide quite good performance in CPU GPU Integrated Systems and I mentioned also previously long time ago I guess in lecture two or so that uh gpus usually they are throughput oriented and applications that they are running on GPU they issue a lot of memory accesses and at the same times um these are kind of error tolerable so for GPU applications or for GPU in general it's more it's mostly important that you provide quite good bandwidths to access memory but for the latency as long as DM access latency is on their you know some T label uh region you can hide it but using uh trade scheduling trades basically but for CPU actually applications are quite latency sensitive most of the time so dealing with u these two U is quite like challenging and this work actually look into it we have done also more work on this area this work you can also take a look um also about other topics like handling CPU and iio interference like decoupled direct memory access like we isolate CPU an iot traffic by leveraging a dual data for the As I said I'm not going to go into detail of this um but then we also jump into predictable performance like how we can have strong memory service guarantees again we get back to this example that we want to have basically uh allocate resources to these heterogeneous agents to mitigate interference and provide predictable performance if you want to learn about more about predictable performance you can go ahead and also watch this lecture from Professor Mutu yeah and these are some papers that I'm going to also cover some of them a little bit later like by doing Source throttling that essentially you think that you consider that your resources are dumb they don't they don't really do intelligent scheduling or whatever but you can try to throttle from The Source you know to reduce in a in in my example for for example GPU and CPU uh one option when you see that okay CPU trades are getting uh like degraded like the perform because they cannot get enough scheduling a slot one way is just to ask GPU to send less number of requests so you can just throttle GPU threats to not s to not send too many requests so that's an example of throttling for example and that can be useful to avoid the interference and provide predictable performance in many cases uh yeah as I said we also continue in this direction and we have published quite a lot different works and um yeah but I want to basically look into Qs um approaches a little bit more detail so remember that fundamental interference control techniques that we discussed like the goal is to reduce and control intert thread memory interference and we can do prioritization or request scheduling also data mapping to Banks channels ranks core source throttling and application Trad scheduling I like to you know the touch upon these these three techniques two three and four here again if you want to learn more about Qs techniques you can also watch this lecture from past and the memory Channel partitioning that you're going to see actually is one of the technique so this work I'm going to discuss a little bit about how memory Channel partitioning can help like in this example consider that we have two applications red application and blue and uh if you have two channels to access memory so one case could be you know all requests are scattered across two channels so in Channel Zero you have to Service uh request from Red application and as well as blue applications so this um basically this can cause interference with between applications request right so one way to handle it is to do Channel partitioning so basically you send all requests from Red application to Channel Zero and all requests from Blue application for examp for example to channel one and this can eliminate interference between applications request so yeah so in memory Channel partition goal is to eliminate harmful interference between applications and the basic idea is that map data of badly interfering applications to different channels so essentially it also involves data mapping like the application data mapping in this domain so there are some key principles that you can you should consider here uh we we need to separate low and high memory intensity applications and also we need to separate low and high Rob offer locality applications I will show some example about them uh later so in key Insight one that we want to separate by memory intensity like high memory intensity application interfere with low memory intensity application in shared memory channels so you can see here that for red applications we have requests to different banks and uh red application actually cause I mean issue a lot of requests at the same time blue application has only a few requests but since we don't uh really partition the like we don't have any channel partitioning uh blue application request needs to wait for it application and that's not good so in a better way of Channel partitioning you can actually if you have all the requests from application red to Channel Zero you can essentially address all of them like this and then request for for blue application can go only to channel one and you can see that you can save a lot of Cycles for blue application and you can also save cycle for red application at the same time so in general it cause better performance so map data of low and high memory intensity applications different channels is one way to you know to uh provide better predictable performance and quality of service another Insight that we like to focus here is that separate by Rob offer locality sometimes you there are applications that they have high Rob offer locality that they basically you can you know uh address requests from l buer so you don't need to close the Rob buer pre-charge by doing pre-charge and activate another row but there are some other application that they have P low buffer locality so if you actually you know mix these two applications to the same channel or to the same bank it's going to be missed basically right so it's so clear that essentially if you partition these two application nicely like you can see that red application here you have quite good Rob offer locality but for blue applications you don't so yeah so essentially for blue request you need to wait more because you have Channel you have you don't have a low buffer locality but at the same time you don't need to wait for uh wait that much for red applications and you can benefit from U quite good Rob offer locality that it has however there is one issue here um basically so you need to profile applications of course you need to classify application into groups and partition channels between application groups and after partitioning you can assign a prefer channel to each application and then allocate application pages to prefer channel so what we do we do profiling applications at hardware and all the other steps are system software they need to be handled at the OS basically so if you if you look at the time for the the first interval current interval you do profile applications and then after finishing that you need to do other steps like classifying partitioning and assigning and then you can do the next interval like enforcing channeling pre preferences and you can benefit however there are some observations that we want to discuss now because these can actually uh enable some other optimization so applications with very low memory intensity rarely access memory so when we do this channel partitioning uh we don't want to have one channel with quite High utilization and another Channel with quite low utilization so that's not good right so dedicating channels to them results in precious memory band space which is not good uh they have the most potential to keep their uh core busy so these applications that they have low memory intensity probably they are also qu quite compute intensive so they can actually they can be scheduled on CPU cores and uh they can keep the core busy you know long time so it might be a good idea just uh I don't know why animations are going on their owns but yeah maybe these all these slide also have some automatic animations I don't know anyway um yeah so they they can have most potential to keep their core busy and they interfere minimally with other applications essentially proing them does not hurt others so if you proze these applications with low memory intensity they don't interfere with other applications like because they don't have a lot of memory accesses and at the same time if you partise them they're going to keep core busy which is good so we'd like to keep all the elements in our system busy so this is a good observation so that bring us to this new technique which is integrated memory partitioning and scheduling so you do partitioning but at the same time you also do scheduling so you do these two technique together so we always priortize very low memory intensity applications in the memory scheduler that's part of the scheduling but at the same time we also do Channel partitioning to mitigate uh interference between other applications so here are also some results for Hardware cost like the memory Channel partitioning uh we do profiling um in Hardware so we need some counter we don't need the modifications to memory scheduling Logic for memory Channel partition partitioning and we also have some storage cost um per number of cores that we use for the integrated memory partitioning and scheduling technique that we do scheduling so at the you you also need to do scheduling so there are some logic that you need to implement and that's a single bit per request essentially now let's take a very quick look at some results you can see that the old results are normalized to ffcfs and uh all these techniques like mCP and IM IMS uh they actually provide better performance compared to State ofth art um back then so better system performance than the best previous scheduling U at that time at lower Hardware cost and always we need to avoid bad Channel partitioning this is a very good example in Swiss airport that you can see how poor Channel partitioning can make you know long waiting for everyone so it's very important that we you know use this analogy also to when we are designing computer architecture and essentially computer systems so combining multiple interference uh control technique is a good idea now you all you see that we had this memory partitioning technique like Channel partitioning sorry and uh we that provides some performance but at the same time we we observe that while we combine it with scheduling technique you can actually make much better performance so um it's always good to think about combining different ideas or techniques such that you know you can uh hopefully combine the good uh capability of them but there are some key challenges of course like deciding what technique to apply when or partitioning work appropriately between software and hardware and of course there are upsides and downsides so at upside like the mCP and imps so we can keep the memory scheduling Hardware simple uh we combine multiple interference reduction techniques and that can provide performance isolation across application map to different channels it's a general idea of general idea of partitioning can be extended to a smaller granity as well in the memory hierarchy it doesn't need to be at that channel partition it can be across Banks or subar or Etc but there are some downside of course like reacting is difficult because you know we do profiling and then we make a decision for the next interval so you you always need to basically should consider that reacting is not that fast so sometimes you cannot really react to you know workflow changes that happen frequently and the overhead of moving Pages between channels that can restrict benefits sometime when you want to um obey some Channel partitioning so you need to move some data from one channel to another channel that and that can cause actually memor data movement which we don't like it that either right so the these overheads you should always consider and they can essentially restrict the benefit of this technique if you want to learn more you can actually check this paper but let's also talk a little bit about Source throttling um how we can any questions am I going too fast okay I wanted to finish the first lecture in 30 minutes but let's see I mean 30 minutes from the time that we we were supposed to start okay so Source throttling the key idea is that we want to manage inter thread interference at the course like sources not at the Shared uh resources so we dynamically estimate unfairness in the memory system and feedback this information into a controller and then we can throttle cores memory access rates accordingly like back to my example you can for example ask GPU to not like to throttle GPU threads to not send memory requests basically what that mean that means that I mean very simple implementation you can just install those trats right but there are also other ways to do do that for example we have this mshr uh for cach L1 cach you can for example try to limit the number of outstanding uh request right in MSR so that can essentially reduce the how uh aggressively you access memory so if of in if unfairness is greater than system software specified Target so there should be some threshold here uh we Throttle Down core causing unfairness and throttle op core that was unfairly uh treated essentially you can of course check this paper uh to learn more about it but there are two components here so we have this runtime unfairness evaluation in Harvard there should be some mechanics to understand that you know do we have unfairness or not basically and it needs to dynamically estimate the unfairness by for example calculating application slowdowns uh in the memory system and then estimates which application is slowing down which U other and after that there is the second component of this technique that we do this Dynamic request throttling and with that you can adjust how aggressively each core makes requests to the shared resources and by throttling down request uh you can basically avoid some unfairness that you already have for example you can limit M buffers like this mshr or limit injection rate of that mshr TBL so yeah so this furnace VI Source throttling this the paper also in aspl 2010 we have different intervals um in in the run time you can do this runtime on furnace evaluation and in inter first interval we do a Slowdown estimation and we find application with the highest slowdown and find application causing most interference for application like the app slowest like the application that has the highest slowdown and we call that application as a app interfering basically and now that we have this unfairness estimation and we know which application is the slowest one and which application is the inter interfering one we can enter the dynamic request r and essentially you can Throttle Down app interfering and throttle up app slow basically so let's see how we can do this Dynamic request throttling uh we can adjust how aggressively each core makes request to the shared memory system for the mechanis we can use this mshr uh like the Kota that controls the number of concurrent request accessing shared resources from each application we can also uh request uh injection basically we can um tune request injection frequency that controls how often memory requests are issued to the last level cach from the MS right so here you can see an example like this uh throttling level um that assigned to each core and for example when throttling level is around 10% your MSR Kota is 12 and you your injection rate in injection rate is also one every 10 Cycles so by controlling this threatening level and when thring level is 100% meaning that you don't really actually you don't throttle at all so is the other way around and when it is 2% so you you are actually throttle that thread or that application a lot fairness objectives can be configured by system software keep maximum slowdown in check uh we can consider that estimated Max slowdown should be uh lower than Target Max slowdown so that's one way of doing that or we can consider that slowdown of particular applications uh in check to achieve a particular performance Target like estimated slowdown of application I should be uh basically lower than targeted slow down so there might be some applications that quality of service for them is important for you so you can consider them in your uh fairness objective sometimes all applications are the same I mean you can treat them the same so you you can do them consider the max SL down and just try to consider that as the fairness technique there can be other techniques as well and the support for thread priorities we need the basically this R rated The Slowdown that uh you calculate um yeah so if you if you if you also want to um take into account trade priorities you can do weight at a Slowdown so each slowdown can be times time to the weight of that so with that technique you can also consider that you know if you have different applications with different priorities probably this metric can be more useful for you yes um so my short answer is that I don't remember uh but nothing in my opinion that's it can be actually quite you can actually go fine green even more also I don't see I don't see any reason that you cannot for example like um yeah for example you can consider that okay I'm going to Define throttling level of I don't know 20% and mshr kot is going to be I don't know 20 you know you can just come up with some and there might be I think these these might be some examples but that being said I also don't remember what was the exact mechanism but fundamentally I think doesn't need to be only these short levels okay so let's uh conclude with some takeaways about Source throttling Source throttling alone provides better performance than combination of smart memory scheduling and fair uh caching this is good um and neither Source ring alone nor smart resources when we talk about smart resources we consider this like scheduling technique you know that your resources also try to once you have a shared resource you're going to Al try to schedule that resource uh in a in a Smart Way such that you can avoid interference uh so neither Source string Al loone nor smart resources alone provide the best performance so as we discuss also combined approaches are even more powerful and we observe that actually in our previous example that we combine memory memory Channel partitioning with uh memory scheduling uh that we provide better performance so let's uh also discuss about some ups and downs in Source throttling as a good things is that core and request strling easy to implement and no need to change the memory schedule algorithm these are really good and can be a general way of handling shared resource contention and it can reduce overall load contention in the memory system so it's just you just lower down the volume of that thread or application and the whole system can free can be free for for for some time basically and other uh other trads or other application can Breeze for example but of course there are some disadvantages like it requires slowdown estimations it's difficult to estimate and sometimes it's not of course it's not also accurate you know to estimate and TS can become difficult to optimize so how do you tune your T essentially and it can cause throughput loss due to too much throttling for example and it can be difficult to find an overall good configuration of course in uh doing Source Str makes sense right okay so if you want to learn more you can check this paper um we have done more work also on Source rening these are I'm going to flash some papers you can check them later if you are interested they're going to be also part of your reading um in homework three I guess yeah so also in onchip networks um we can do this Source throttling and these other works but now let's summarize this lecture so the goal as you remember was to reduce and control interference we discuss different uh ways to do that like by prioritization or request scheduling data mapping to Banks Channel ranks uh core and Source throttling application trade scheduling and of course Bas is to combine all and we discuss how we can do that in some examples and approaches uh that we discuss like smart versus d resources like a smart resources approach could be like Qs Qs a memory scheduling but for Dum resources we also discuss approaches like Source throttling Channel partitioning for example and we also discuss how the both approaches are effective at reducing interference and there is no single Bas approach for all workl so in the end combination of the Mak sense and we designed some techniques based on these like request trade scheduling Source throttling and memory partitioning and we show that all approaches are effective at reducing interference and can be applied at different levels Hardware versus software and there's no single best technique for all work CLS and at the same time we discussed that combination of them provide quite good performance so when you have a Q on a memory you don't you have uncontrollable and unpredictable system uh that's why providing Qs awareness improve performance performance and predictability fairness and utilization of the memory system and we discussed many new techniques to minimize memory interference provide predictable performance and many new new research ideas needed to integrate these techniques we also we didn't have time to discuss like or cover a lot of other techniques you can see some leas here like for example prefetch a shared resource management or Dam controller code Design Cache interference management and many more so what the future may bring the memory Qs techniques for hetrogeneous S so systems are essential and many accelerators processing in near memory uh like better predictability higher performance combination of me combinations of memory uh Qs performance techniques like data mapping and scheduling and at the same time using machine learning techniques to manage resources and uh and this is quite interesting that we also discussed in the beginning of this course like using uh datadriven architectures that you uh use machine learning techniques to learn uh from your data and also we we also showed some memory controller right uh using U self optimized memory controller that was U that was learning from the the application and data characteristics okay so with that I'm going to conclude this this lecture any questions okay I think it's too early for break who wants the break now okay no can we swich to next picture e everything looks good on Zoom okay finally we are into this topic which I'm quite also excited about emerging memory Technologies so in this lecture we're going to uh discuss a little bit about what are these emerging memory Technologies some we're going to see some case studies and uh techniques uh that we have developed and other people have developed um to enable these Technologies as we also discussed in the past uh like Dam we we showed a lot of scalability issues for Dram and it's true that we need to work on Dam and make it smarter like using you know intelligent memory controllers such that you can push the boundaries and make Dam better and better but at the same time it's also good to think about other Tech memory technologies that they can potentially replace the or some other also memory um memory basically modules not only data so if you think about flash memory U flash memory was also emerging memory Technologies in the very very past I'm actually not very very past like in um yeah in 1980s or back then basically and people have work on it to make it work and there were also a lot of people that were quite pessimistic about this architecture but at some some point all these uh researches and development that people you know have made that paid off and now flash memories are I would say one of the key reasons that you know we all can have these smartphones in our pocket otherwise how can I carry you know a smartphone that has for example hard disk you know it's going to like rotate in my pocket you know it's just impossible right so we're going to actually have a very troll lecture on uh Flash memory and solid estate drives next week so stay tuned for that but today we are going to also look into more you know memory Technologies like they are still emerging so flash memory is not considered emerging anymore I would say so it was emerging in the past okay so we discussed a lot about limits of charge memory let's just has a very quick uh reminder to jog your memory so we have difficult charge placement control in flash memory we have a floating gate and you need to control the charge in that floating gate we're going to learn about background of flash memory in detail actually next week so I don't want to spend much time today and for D also we have this capacitor that you need to store charge and essentially there is this transistor leakage in Flash also you have some uh leakage it's not that fast as dam so in Dam you need U frequent refresh but in flash memory also you need to do refresh at some point so if you if you if you for example store data in your flash memory and don't touch your flash memory for a few years it's very unlikely that you can recover your data nicely there might be some data that you cannot even recover them at all but fortunately our flash memories are implemented are integrated into ssds now and SSD use a quite intelligent controller uh SSC controller and Flash controller and those controllers actually you know do a lot of intelligent stuff to make flash work but we're going to learn about them next week hopefully okay so reliable sensing becomes difficult at charge a storage unit size reduces and that's quite understandable so when you want to scale it down they're going to it will get really smaller and smaller and at some point you cannot do reliable sensing because you cannot reliably store charge and you cannot reliably sense that charge so that's why these architectures have fundamentally have uh basically scaling issues we discuss about different solutions like the solution one was the new memory architectures that we want to overcome memory shortcomings with memory Centric system design for example novel memory architectures inter interfaces functions better uh Waste Management that for example you have seen in our latency uh work you know that we don't need to design always for the vce casting you know and also for the voltage you know we all we or for refreshing for example so there are some key issues to tackle and we actually tackled the many of them in the works that we have discussed in the past few weeks like to enable reliability at low cost reduce energy reduce latency you have seen a lot in in our lectures how to improve bandb how to reduce waste U and enable computation close data there there were also some works that we didn't discuss like uh compression for example we didn't cons discuss memory uh compression that you can compress your data to improve your effective Capac and that can be quite useful um in order to provide the let's say you know you can um you can utilize your current memory capacity for more data and that can that can actually solve some of the basically push for more scaling and scale because one reason to do more scaling and scaling is to you know provide capacity provide density but sometimes by using this compression you can for example you can with the same capacity you can store let's say four times you know larger data and these are some Works um that we also have discussed some of them in in the past weeks but now we want to focus on solution two which is emerging memory Technologies there are some emerging resistive memory Technologies and they seem more scalable than D and they are also nonvolatile remember that dam is volatile meaning that you need to have it connected to the you know electricity to keep it charged and you also refresh it but these resistive memory they are non volatile so they don't need electricity uh to keep their data there are some examples uh here one like phase change memory uh data is stored by changing phase of material there is this material here um which we'll see about it in few next slides but basically depend depending on the the phase of this material that can have two phase amorphus and crystalline you can consider like zero and one you can decode it encode it to that and data read by detecting materials resistance this uh actually PCM has been developed in U 1916s 1960s if I'm not mistaken and this has been actually used in U uh rewritable CDs because this these uh PCM like the material also has uh different uh light reflexives so if if you shine light on the material and depending on how the material reflects that you can actually encode zero and one and that has been used to write CDs and basically rewritable CDs of course that's a that's not a fast writing or programming and reading process but that has been used for reable cities but later on people have work on it and uh realize that that phase of material can also cause resistive memory and that can be also translated to um electrical current and and people design some circuits some sensing circuitry if you will and based on that uh circuitry they could you know decod the state of that material and they can encode it to zero and one IBM for example has done a lot of work in this area Intel and other companies have done as well so this is one of the work uh one of the earliest work I guess uh from IBM which has published in Journal of IBM Journal of research and development in 2008 and they designed a prototype of PCM at 20 nanometer and this is also quite interesting that you can see that this prototype in 2008 was in 20 nanometer while D technology reached to that around the 20 nanometer in 2015 or so so that show these architecture are easier to scale this is clear and when Dam also reach to that 20 nanometer scale then you You observe more issues as well and it's interesting that these architectur are expected to be denser than Dam as well because they can store multiple bits cell so here in my example I said that your resistive you can encode it to zero and one but if you chop that resist resistive U window into let's say four regions you can encode it to two bits for example or if you chop it to eight region you can encode it to three bits for example of course at some point it might not be uh reliable but this is possible and people have done that for example flash memory that we're going to see later however emerging technologies have many shortcomings and the question is that can they be enabled to replace uh which is quite hard to achieve but augmenting and surpass Dam might be easier I mean by augmenting to Dam might be easier but basically this going to be the topic of this lecture that we're going to see and we're going to actually touch upon all these techniques like for example can we replace uh Dam completely with PCM or can we but argumenting can we do use PCM and dram together or can we use some uh so there are some also uh characteristics of these U techn technologies that they are not available da for example as as we discussed these um architecture these memory Technologies are non volatile that means you can for example use them for persistent memory currently when you want to to when you have persistent data you always need to rely on your disk or SSD essentially but uh if you're if you have a main memory which is nonvolatile you can also consider that as a persistent memory and you can program that that use that as a persistent memory so that can actually improve the performance for many database for example applications you know because you have much faster access for persistent memory yeah we're going to see a lot about these topics and here are some example Works uh that we're going to also discuss today not all of them of course but yeah so industry also look into into this architecture and this was one of the architectures that was commercialized uh Intel obtain persistent memory in 2019 this is of course nonvolatile memory based on 3D Expo technology they call it 3D exp Point technology but if you really look into the data sheet and learn papers about it it's essentially very much similar to PCM so um yeah but they call it 3D expon I'm not sure if it's still available to buy because Intel put this obtain out of the business uh but in the past it was available maybe it's still available to buy but not sure does anyone know yeah but back then was also quite expensive by but people have bought these deems and they put these deems to their computers for example as as a replacement of the D for example and they have done a lot of research on them you know to how they can help but that's good that uh we are also observing movement from industry uh to for these emerging memory Technologies this paper is going to be one of the main paper that we're going to discuss today that we want to discuss how we can architect phase change memory as a scalable d alternative so here we are quite aggressive so we want just replace Dam completely with PCM and then we want to see how it works and then if it's not it's if it's not good how can we fix the uh you know shortcomings or how can we eliminate them to make it happen and um yeah so if you want to learn more also you can check this but we will also discuss this paper today so now let's also get before going into too much detail Let's uh discuss a little bit back about charge versus resistive memories so in charge memory like Dam and Flash you write data by capturing charge and read data by detecting voltage essentially in resistive memory like PCM satm Ram that you're going to see later and memory store that you're going to also see later you write data by pulsing current and you read data by detecting resistant uh r so that's that's the main difference there are promising resistive memory Technologies like PCM that you inject current to change material phase and resist resistance determined uh by phase in estm Ram you inject current to change magnet priority so inm Ram you have a there is a magnet quite a small magnet you can call it I know nanomagnet let's say and resistance determined by polarity of that magnet essentially memory store R RAM and RAM that we're not going to discuss a lot about them today but they also have similar um basically fundamentals I would say similar like approach like you inject current to change atomic structure and resistance determined by atom distance basically but basically you can change their there is a way that you need to we can change or determine resistance and that's the key difference here that provides a lot of good things because resistance you as we discussed is non volatile and uh for resistance you can actually fundamentally you can U basically scale it down easier but at the same time also provide some other issues like when you want to inject current and change these polar rate for example inm Ram change magnet priority or in PCM change material phase essentially you need to spend a lot of energy a lot of temperature and those will cause a lot of issues that we're going to see we will start with PCM um what is phase change memory phase change material um is a calgonite calgonite glass exists in two states amorphus which is a low Optical reflexivity and high electrical resistivity and crystalline which is high Optical reflexivity and low electrical resistivity so we already discussed that using these Optical reflexivities you can actually uh you know use that for Reliable CDs but now we are more interested in uh you know their differences in electrical resistivity so this is the PCM material that you have essentially this is the the material and this is the heater and the metal which is the access like this is the like the access transistor here and you can consider your resistance here as a variable resistant that you can uh tune the resistance by that so PCM is a resistive memory and in high resistance you can encode it to zero for example and row resistance you can encode it to one and the PCM cell can be switched between states reliably and quickly but that quickly we'll see it's not that quick um it's not as quick as for example D so how does PCM work uh when you want to write you need to change phase via current injection so there's two uh basically fundamental operations set and reset for set you need to sustain a current like this you need to sustain current to heat cell above temperature the crystalling temperature so you provide a a current and you need to keep it sustainable uh such that you can heat up that material above the temperature for crystalling so very important that you need to do it longer time so that's the main difference here for reset operation you need to you basically inject quite High current so sell heated above melting temperature and you quickly uh quench it and that can cause basically you you can get to amorphus um essentially State and then for reading you need to detect phase via VIA this material resistant like we in amorphus or crystalline interesting thing is that when you do set and for crystalling you will get to uh the resistance of 1K to 10 uh 10K o basically but in the armor State you have high resistance and this is quite a lot like one Mega around or or 10 mega so you can see clear difference between these two resistant State and that huge difference is important in order to make it possible that you can read the uh you know reliy from these U memory so there are some opportunity that PCM provides like it scales better than Dam Flash and and Flash it requires Uh current pulses which uh scale linearly with feature size and it's expected to scale to nine nanometer that's the report from itrs in itrs in 2022 and we have even Prototype at 20 nanometer from IBM uh work and can be denser than D because first of all it can actually scale down better but also it can store multiple bits per cell due to the large resistance range there are some prototypes with two bits per cell like in this uh paper in this conference isscc and four bits per sale in another prototype in two 2012 but there is of course a trade off it that so when you when you increase the number of beats per cell um so you You observe that you you had quite good difference between these resistance value right so it's quite difference is quite large if you consider you have only two states but if you want to chop it down to four states then the difference is not that huge if you want to chop it down to 16 States in order to store four bits for example then your difference is not that much and then that's the pro the problem is that your sensing secretary needs to be quite accurate uh to detect uh the differences basically any question yeah like how expensive um yeah yeah here also it's not U yeah I cannot say it's cheap it's expensive of course but similar to Dam uh um the like the overhead can be amortized when you design a main memory because you are shading that circuitry from several uh you know for the whole array for example so if you want to have several many small emerging memory Technologies you know and put all this sensing circity for all of them then yes the the overhead might be actually huge right but as long as you have sensing circuitry and you are using that for a quite you know big array then you can the cost can be amortised it's very similar to Dam of course also I don't really have that uh number or on top of my head any other questions okay but that's very good questions yeah and P is also nonvolatile it can retain data for more than 10 years at 85 Centigrades which is quite good so without refreshing the PCM it can uh the retention time is greater than 10 10 years no refresh needed low idal power so here we will discuss uh that how we can chop this resistance value like one and zero we can have chop it down to two we can also have multi-level cell PCM as we discussed and you can chop it down for example chop this cell resistance window to four for example here you are storing two bits um per cell then of course you have less margin between values and need more uh precise sensing modification of cell contents higher latency and energy and two times for reads for example and four times for wres so compared to SLC uh like single level cell PCM here you need to spend more time and more energy in order to read and write essentially which is uh which makes sense I guess does it make sense to you as well because you need to control it more you know so phase change memory properties uh we we wanted to learn about PCM properties and we did some survey uh prototypes from uh 2003 to 2008 from different uh conferences including itrs um electron devices meetings for example vlsi and so on so forth and there were some people that they were working on that and they were designing prototypes reporting characteristics so we look into all these numbers and uh we come up with this table essentially so you can see that uh like for example there are some prototypes that you don't need Need You Don't Know Much unfortunately so there are only like this set time which is like 100 nond like the Set current which is 200 microampere uh reset time which is 50 nanc and reset current which is 600 microamp and also of course you have this right endurance um do you know what is endurance we're going to actually learn about it but it means that essentially how many times you can write to your cell um reliably so if you if you have a if you for example in this prototype if you write to that PCM cell more than 10 to the power 7 then uh you can think of like that that your cell is going to be dead basically uh there are different reasons for that and sometimes your cell is not dead but it's not reliable then that means in order to avoid errors you need much powerful let's say EC seting and that can to cause a lot of issues but there are also some memories that uh after this endurance they're completely dead for example they have this stock at zero for example so every time you access it you will only reach zero for example you know so endurance is actually one of the unfortunate issues on of all these U emerging memory Technologies unfortunately and you can see that uh endurance for PCM and like the survey that we had from different works is uh ranging from 10 to the power 4 which is uh quite pessimistic in my opinion to 10 to the power 9 or so which seems a bit optimistic so yeah uh so you can see that basically when you want to do research on emerging memory Technologies You Don't Know Much unfortunately so there are we look into uh many reports many reports from prototypes uh from industry from Academia and uh we wanted to just you know get some numbers but you can see that essentially uh these numbers are not also consistent much because sometimes people have used different methodologies people may have designed actually different sens in circuitry so the Sens in circuitry can can be also different so how you going to you know use these numbers so yeah that that uh bring me also to this part of you know the doing research in emerging memory Technologies or emerging technology essentially it's not easy and you always need to make some estimations you know some guesss and working on a on on a range of parameters is always a good idea so it's very hard to pick on one parameter or say that the read latency of PCM is for example four times longer than da that might not be very good decision so it better that you consider a range for example you know and you show the sensitivity of your design when while you're varying this uh this uh this parameter for example okay let me uh discuss about this few next slide and then we can take a break uh so where can PCM fit in the system we have this main memory caches and storage we wanted to replace it as a I mean consider as a main memory if if you look into this uh plot here like this scale you can see that the the xaxis is the typical access latency in terms of processor cycles for a 4 gahz processor so for example here L1 cache is here like the access is around two cycles and we have last level cache and Main memory system like d and and also disk so you can see that between dram and hard this drive we have quite a lot um like Gap and that's why actually flash U found its way nicely you know so flash F the Gap nicely between hard dis drive and so now the question is that how PCM can find this way for example so PCM latencies are close to Dam but not as good as dam so but yeah in this work we'd like to see how we can use PCM for data read latency of PCM is around around 50 nond I mean using all these study for example which is four times longer than D but it's faster than non flash memory for example right latency is around 150 nanc which is 12 times longer than d right band is also um lower than Dam and is similar to n flash memory Dynamic energy is unfortunately is also higher than Dam like two times um I assume two times is for read so read Dynamic energy for PCM is two times of the and right uh dynamic dynamic energy you consume for right is 43 times larger than D and dynamic energy is similar to n Flash and for endurance uh it's like uh 10 ^ minus 8 times of the so for Dam we don't know actually endurance because there is no report for that we don't know actually if Dam has endurance problem Mar at least we haven't observed endurance problem of dam until we use our system at some point you also you just wear out your whole system as well but Dam also doesn't I mean at least from what we can also see the the architecture doesn't have that kind of issues that can cause endurance problem but in PCM clearly we have that problem but as a good thing cell size um is better than Dam so basically uh it's denser and you can also scale it better uh with using mlc technique sorry actually cell size is larger cell size is larger than D but it's not that large is also larger than non flash memory but you can also scale it better with feature size and using technology so yeah the takeaway here is that you can see that it's very unlikely that PCM can uh replace D because there are many shortcomings but yeah let's conclude again about Frozen cons over D like the pros is that we have better technology scaling as we discussed we have non volatility persistent memory and low idle power so we don't have refreshing issue but we have higher latencies higher active energy and lower endurance and reliability issues so we don't for example have U uh issues like like Ro Hammer For example you know that you have this this read disturbance for example it's not uh but you have similar other issues here as well in PCM that U accessing your PCM or memory at some point can cause some resistance shift so your resistance window can shift and if you don't update your references you're going to read basically wrongly and that's also the case for flash memory and that's why we always need to update our reference voltage in for for flash memory as well so there are challenges in enabling PCM a dam replacement so to mitigate PCM shortcomings and find the right way to place PCM in the system so here are different example like this system can be like completely D this one is most likely the way that we're going to have PCM in the future probably if we have uh it's going to be kind of hybrid that you have both Dam and PCM and also a bit like more aggressive solution that you completely replace Dam with PCM so this work that we discuss we will discuss today is actually consider this hybrid solution and this work that we're going to also cons discuss uh consider completely replacement of data and that was one of the initial study that we want to replace the completely with PC okay so before going into this uh paper let's have uh any question before yes yes so you have some circuitry right sensing circuitry and those are you you know seos based probably or whatever so they also consume some power here any other questions I mean that that's one reason I can't think of but can be also other things okay so let's uh take a break until uh 2:35 for 10 minutes and then we will uh continue from this part thank you e e e e e e e e e e e e e e e e e e e e e okay let's uh continue am I audable on Zoom so we want to uh completely replace D with PCM this work and here are all the numbers that we consider uh like for the latency uh is like four times higher read latency 12 times longer right latency energy is two times U for read and 43 times for uh right and endurance is also 10 to the^ 8 so these are the numbers that we come of from all these surveys and we did some simulation that with basically replace D with PCM with in a four core for Megabyte L2 system and PCM organized the same as D so we consider that basically PCM is this design as as architecture is completely like the drr like the Rob offer Banks preal everything the same but we just consider that okay when you access that cell you need to spend more time you know more time so at the simulation level you can do such things right so the uh long story short is that results are not good so we have around 60% uh longer execution time um 2.2 times energy uh higher energy and 500 hour average lifetime and this is actually average there are some applications that basically the lifetime is even much lower like for fft for example and there are some applications that delay uh like this is also 60% is on average for some applications U our performance overhead is much more so the results are not good and that's kind of we were EXP in you know by just replacing all these numbers you cannot get good performance of course and good energy but now the question is that can we overcome these issues and we look into it so this paper have discussed different ideas that we don't have time to look into it but I'm going to only um showcase two idea of that paper the first idea is that we can use multiple narrow Rob offers in each PCM chip so in the uh we have a big sense amplifier because while you're reading from D read read operation is this this destructive right and you need to this sense amplifier to do this charge illustration to make sure that everything is correct but PCM doesn't have that issue right so you don't need that sense amplifier to be as long as big as the whole array the whole uh like the row of the D so you can actually have have a smaller size sense amplifier and with that you get you gain some area like area budget and you can use that area budget to add some latches some buffers and with those latches and buffers you can essentially you can hide latencies so they can act as your cach if you will make sense yeah yeah in PCM we don't need the basically sense amplifier as as big as the the whole R so you can just put the sense amplifier for a few blocks or whatever that you reading you you care at that moment yeah it can come into yes let's Rob off our locality yeah but but you can do that uh consecutive of time right so there are different ways so you need to basically you know fill up these latches you know you can fill up these latches and these latches can be filled up by the by this blocks or wordss from the same Ro you know by reading you know consecutively into the time or you can actually fill up these latches from different rops so that's a key difference here you know but in Dam you cannot do that at all because yeah so essentially that s amplifier should um although there are some new dam architectures like sector Dam that atob um you know has published recently that you know by uh redesigning Dam architecture you can actually also Implement some uh lower granity access but in general it's not easy in yeah that's a good question uh the second idea is that we want to write into array at cach block or word gr so you don't need to so in D um when you want to write you write to the sense amplifier essentially and then you need to uh basically weight such that that value which is in the uh that right buffer should be ReStore in the dam role but here in PCM uh you don't have to basically you know update the whole row you can only update the part that is that is different maybe you only change a bite you know or two bytes or a v or a cash block so you only want to update that and that can essentially reduce a lot of uh basically uh right Cycles remember that we have this endurance issue so with that you can reduce the number of Rights there are actually some works that they also look into it a bit even more aggressively and they want to control individual bits Samsung has done also work in that and so essentially whenever you want to write you only update beats you only write to beats that they are different so you capture you read what you have and you know what you want to write and there are some differences and you only program those differences and that can essentially reduce a lot of programs so these two ideas by implementing them you can get like U performance Improvement compared to the Baseline that you had but it's still uh you're uh increasing U the execution time by 20% your energy consumption is on par and the average lifetime is longer still there are some applications that you know the like the lifetime is not that long but for the average is good but there are some caveats here first of all worst case lifetime is much shorter so we don't have guarantee here which is not good second cab is that intensive applications see large performance and energy Heats um which is again accept I mean except expected and caveat uh three is that do we even consider optimistic pcing parameters because we didn't know right we survey through a lot of papers and we come up with this set of uh parameters but are these parameters even optimistic or pessimistic in turns out actually in future that some of these parameters was actually very optimistic some of them not that maybe even pessimistic but some of them optimistic so that's the thing so when you're uh if you're thinking of completely replacing the main memory with PCM then you you don't get performance Improvement while you actually get degradation and also there are some issues like there is no guarantee on Lifetime and many other things that we already discussed if you want to learn more you can check this paper um and this is also the Dee that we discuss about it like Intel obtain that realized in 2019 which again technology 3D exp point was very similar to PCM yeah and uh so of course when you have this PCM based main memory you can also think about how you can do efficient data mapping and buffering techniques for multi-level cell phace change memories I think I'm going to yeah I have some slides for that so yeah essentially uh the thing is that when you maybe this yeah so when you have multiple multi-level cells like here uh you you are storing two uh beats in one cell so you have different uh latencies when you want to write essentially you can see that right so you need to if you want to write one one you can actually do it quite fast but when you want to write one zero you need to wait more and 01 longer and 0 Z longer so that's the difference so then we have this observation that we have a symmetry in reads the re latency and energy of bit one is is lower than of Bit Zero why is that you can actually see here like the bit one is the most significant beit you know in order to ensure the most significant bit you only need to wait until T2 right but if you want to make sure about the least significant bit you need to wait until T4 so you have this asymmetric so yeah if you can also we can also show it in this simplifi example that uh we there is a capacitor field with reference voltage and when we connect mlc PCM cell with unknown resistance depending on how what's the resistance this is actually the simplified version of our sensing circuitry depending on the chart depending on the resistance of that you might have different current basically and that current would determine how fast this capacitor will be basically dech charged so yeah in here you can see that this was the initial voltage of that capacitor and we connected and then if this amount of time uh takes to discharge that capacitor we can then uh encode it to 01 basically right so in existing devices both MLBs are red at the same time and we must wait maximum time maximum time to read both beats however we can infer uh information about bit one before this time and then we can only uh basically we can use that like so time to determine bit one's value like the most significant bit is faster so we can actually uh read faster for that so how you can benefit from that there are different ways which I'm not going into detail of that but you can think of for example uh you can store consider one like one page of memory in MSP and the other page in LSP basically and the page that you have in most significant bit is faster and LSP is a slower and then you can think of how you can handle it or control it as a like a cache for example you know that data that they are latency sensitive you can map them to the pages that they are mapped to the MSP beats that they are faster and the other to the LS so there are different ways this one just an example but for that you can also read the paper uh for more detail we also have asymmetry for rights depending on what's the initial State we have we have different latencies uh to get the final value so essentially for example when the value is 01 and we want to write we you want to go to the state of 0 you have quite U yeah it's it's quite faster for example so you can also use that uh and benefit potentially from that by using know data mapping and buffering techniques I haven't discussed uh I haven't explained like how this can be implemented like how you can benefit from these techniques if you want you can also check this paper but you can assume uh from the example I mentioned like this any questions now we want to see how a CM Ram um can replace as main memory first of all let's see what's what is a CM Ram a CM Ram is has magnetic uh tunnel Junction mtj device you can see the device here and there is a reference layer with a fixed magnetic Ori orientation this is the reference layer and there is also free layer that it can be parallel or anti- parallel so when you are when you have it this free layer parallel with the reference layer you can encode it to logical zero and when it is anti parallel it's you can um encode it as logical one for example so magnetic orientation of the free layer determines logical state of device High versus low resistance and for right you need to push large current through mtj to change orientation of free layer essentially and for it you need to sense Uh current flow essentially if you want to learn more about ctm Ram this paper you can actually check people have looked into using CM Ram um instead of actually stram as well because latencies of a ctm ram is actually you're going to see that it's actually much better than D but they are also they are not but they are not better than Sr but at the same time the problem withm Ram is density so unfortunately dram is I mean STM Ram is not as dense as dam so their density is very similar to stram so that's why people also look into how they can use CM RAM and SRAM together and they design different architectures uh like caches um for GPU CPU and other basically devices but in this work we we are discussing about how we can use STM Ram as a uh replacement for Dam so there are some fra over Dam like better technology scaling uh capacity and cost and it's nonvolatile again so we good persistent and low idle power at the same time also ctm Ram latencies might be even faster uh than Dam as well but there are also cons for right latency unfortunately ctm Ram has higher light latency and so for read latency CM might be better and right energy is also High unfortunately it has poor density and um and reliability of a CM Ram is also questionable so for a CM Ram there are still a lot of work going on to see you know to make it reliable so that's why we have this question mark here but this uh memory technology is actually quite interesting because of this level of Freedom as well so you can trade off nonvolatility of this memory uh for lower L latency and energy so when you reduce the size of this mtj you can reduce retention time potentially at some point your retention time is not that not that high such that you need to refresh it but for memories like cache that you frequently update data or register five for example that you frequently update that you may not need that actually High retention time as it right so you can think of this trade-off by reducing this mtj and that essentially cause lot of reduction in right latency and energy but at the price of nonvolatility I mean the it can it like reducing nonvolatility uh characteristics so at some point you your cmam actually needs refreshing and people look into it um people are even look into designing different different uh STM Ram architectures in one chip like as a cache they have some STM Ram part that they consider as a high retention that which is slow they consider as a like bigger cache and also a smaller um ctm Ram cach which which has low retention time but is faster and they try to you know cash most of the important like the working set in that um basically smaller resistance like the the city M Ram cach so it's quite interesting that you have this level of freedom and you can do a lot of tradeoff analyses here as well so by architecting a ctm ram as main memory you can see in the same uh methodology you can see that we have a less performance loss compared compared to PCM so it's only six% and uh we we even have 60% energy saving um compared to Z which is good so that's why actually u a CM Ram has some potential to basically replace D but at the same time the problem as as I say is that its density and its reliability so it's not still clear and you can check this paper for more detail as well if you are but now let look now let's look into the other approach which is hybrid main memory remember the this uh picture from the past that we discuss about like having like CPU and different controller so you have Dam controller for example that control Dam which is fast and durable however it's small leaky volatile and high cost and there can be another memory technology like pH change memory which is large non volatile and low cost however it's slow wears out and has high Active Energy so while using these two together uh hopefully you're going to benefit from you know to achieve the best of uh multiple Technologies so that's the idea of using this as a hybrid memory but of course this this also has a lot of uh challenges for example how do you map your data how you decide which data should go to Dam which should go to PCM for example how you migrate them and U or are you going to architected as a cache like the dam is going to be the cach for PCM or Dam is part of the memory address space so like Dam and face CH face change memory together you know constructs the whole address memory memory address space so that's if if that's the case then the programmer needs to work on it as well like you know use some you know have some instructions to store data in D part or some instruction to store data in PCM part for example or compiler can also help but it can be also as a cache like Dam can be cache for the PC but should also say that this doesn't need to be at the only main memory there has been work that people try to like use this hybridm Ram SRAM cache for example um in my PhD actually I did also research on H designing hybrid register file that I used the stram with with a nonv memory as a GPU register file so it can be also in in that level at the register file of GPU for examp example it can be actually also at the storage system that you have different storage devices with different characteristics and you want to design it this hybrid technique but this is quite interesting so the goal is to like providing the best of multiple metrics with multiple memory Technologies and that's the area that basically heterogeneously comes to play and configurable programmable memory systems of course as I said there are many questions or challenges to address like the are we designing as a cache versus main memory as I discussed what should be the granularity of the data movement or management it should be fine grain or course grain uh it should be Hardware control or software or Hardware software Cooperative as you can guess probably should be Hardware software comperative um when to migrate data like because migration is always costly so when to schedule it essentially and how to design a scalable and efficient large cach sometime if you want to use it as a cach that cach is going to be quite large it can be like tens of or hundreds of um like gigabyte right so that cach is quite large so how do you design it like the tag array is going to be also huge so these are very important questions and hard questions that they are hard to answer but one option is that we can consider Dam as a cash for PCM so PCM is the main memory and Dam caches memory rols and blocks the benefit is that it reduce latency on Dam cach heat and it also does right filtering so it can improve this the it can eliminate uh the the right endurance problem of PCM and memory controller Hardware should manage the DM cache so benefits that it eliminates system software overhead but there are at least three issues so what data should be placing D versus kept in PCM so how you're going to manage your cash intelligently essentially and what's the grity of data movement across time and how to design a low cost lowcost Hardware manage the dram cach so there are two idea directions here like locality a data placement and the chip uh chip tag source and dynamic granity that we're going to also see a little bit later so data as a cash for PCM the goal as we discussed is that achieve the best of both theam and PCM or NVM so it minimize amount of dram without sacrificing performance and endurance and dram has cash to tolerate PCM latency and right bandies and of course PCM as main memory uh to provide large capacity at good uh cost and power so here is your Dam buffer how you manage your Dam buffer uh is a very good question to ask and there are some technique that this paper also discuss like for light filtering techniques you can actually have this approach of lazy wrs when you bring data from flash or hard disk drive you don't write them immediately to PCM you just you first write them to this D buffer as a lazy right and pages and then after after all if you um realize that they are needed or basically or they are dirty because you got some you capture some rights in that D buffer so you need to write them back to this PC but at some point you don't need to write them back even to PCM you just write them back to the flash or hard this right so you can also do this page bypass so you just discard pages with po reuse uh on Dam rection if you want to learn more about detail you can check this paper of course here are some results so this paper did some analysis um the simulation that they have done is actually a bit at higher level compared to what we what we have done in the other work that I showed is actually like it's a combination of also some you know analytical model they have but essentially they consider that they have the simulation of 16 core system with 8 gab D RAM and the main memory access is at 320 Cycles it has also HDD at 2 millisecond with also flash memory at 32 microc and they consider that flash heat rate is 99% so 99% of request will hit in Flash and you don't need to uh go to the htd and their assumption was that PCM is four times denser so here they they even also consider this good thing about PCM so in our work in the other work that I discussed we haven't uh we consider that both PCM and Dam they have the same density we only uh taking into account difference between their latencies and U like access latencies and energy consumption but here in this work they consider that PCM is also denser so they are using this uh properties to you know provide better performance and they they also consider that PCM is four times slower than D which is a um which is quite optimistic in my opinion like in our work for example we consider that PCM was 12 times like at least the right latency uh TW 12 times longer than the and the block size equals to P PCM page size so you can see some results here that uh like while you go from and the Y AIS is normalized execution time so lower the better and here are some workflows from IBM assume and uh so when you have this 8 gigabyte D you know results are normalized to this um Baseline so having a PCM memory as a standalone memory uh with 32 Gaby uh you you actually on average You observe some performance benefit why is that because you have larger memory so now you're are actually interfering the the good thing about you know having a larger then larger capacity the so what once you have larger memory you go to the you access flash less frequency right so that's why you're observing better performance compared to Da but comparing these two Baseline like the PCM and dram they both have exactly the same capacity you can see that Dam provides better performance and actually these results is very similar to the results that we also um we reported in that work actually this paper and that work was quite parallel both of them was published in Isco 2009 so it's also quite interesting in that sense as but having a 32 gigabyte PCM um plus one gigabyte Dam as a cash you can actually get the performance benefit and the performance on average is very quite close to 32 Gaby dat Baseline so you can almost get the performance of 32 gab Dam by using 32 gigabyte PCM plus one gigabyte Dam as a cach which is good we wanted that actually as a from hybrid memory system clear okay great so yeah here also we discuss about uh power energy and also the life time you can see that uh basically the hybrid design provides better like energy consumption and better uh energy delay like the kind of Energy Efficiency metric and the average lifetime is around also 10 years but of course there is no guarantees because worst case scenario can actually wear out your uh system quite um fre quite fast yeah but if you want to learn more you can check this paper uh actually yes this was it was also from IBM research scalable high performance main memory system using face change face change memory technology any question yes yes yeah yes very good question uh and that's true we we will have one brief slide actually on that at the end but yeah we potentially that's actually one of the issue as as well in D um you need to do that cold wood attack but you really need to you know cold your d a lot you know to very low temperature such that retention time is long enough such that you know you can just um remove your D from that system and put it on another system but with non vola memory everything is there like nonvolatile right so they potentially you can actually steal data and have security issues and people have done research on that as well very good question not question actually would a very good comment okay great uh so let's let's think about this data placement in hybrid memory at the little bit higher abstraction so essentially you have these cores Cates and memory controllers and several channels can be and one channel can be Memory a which is fast and small and channel B can be Memory B which is large and St and the idea is that you want to basically allocate your pages to these uh uh different memories so at some point you want to you move page two you migrate page two to Channel A because you want to have faster access but the problem is that so it it caus some data movement but in addition to that now you also have some contention on channel a so it can also comes at the price of some lower bandwidths if you implement like this as well so which memory do we Place each page in to maximize system performance these are the question but at the same time you should also answer other questions that like you shouldn't basically make Channel B idle and potentially cause load balance issues okay so data placement between D and PCM the idea is that how to characterize data access patterns and guide data placement hybrid memory but there are workflows that they have streaming access so when you have a stream access they can be as fast as as fast in PCM as in D however in random accesses they are much faster in D because for Random Access you always you need to pay for Access latency but when you have stream access hopefully you can hide those latencies by benefiting from your bandwidths essentially so that's why uh if you when for the applications that they have a stream access you can actually map them for example in this example to PCM and for random accesses you can you should map them to the data and that's the idea of this work um like Place Random Access Data with some reuse reuse reusability is also important um so you also need to consider some locality as well so Random accesses with some reuse we map them to D and for streaming data to PCM and uh here are more detail about like so row buers exist in both Dam and PCM so roow heat latency is similar in Dam and PCM however row Miss latency is a small in Dam and large in PCM so that's why actually low row conflict is costly for PCM so that's why you don't like random accesses so while you have a streaming accesses you can actually benefit from your Ro Rob offer locality so that's why uh you can actually um M them to Dam or PCM and both can work but when you have random accesses most likely you're going to issue you're going to have a lot of Ro conflict and that's why it's better to have a uh use a d use a memory that has less uh Mis latency question okay so Place data in D which is likely to miss in the Rob offer low Rob offer locality which me pante is a small and and is reused many times which is cash only the data wors the movement cost at and Dam space so these are the two criteria that if they are M you can actually um you know move that or migrate that data to the uh to the data and here are some results for that um that you can see good speed up you can um check this paper also to learn more about this work how you can benefit from Rob offer locality um like make your hybrid memory management technique aware of Robo for locality and as well as yeah exactly also memory access pattern okay so yeah there are some weaknesses of existing solutions they are allistic that consider only a limited part of memory access behavior and do not directly capture the overall system performance impact of data placement decisions so since we don't have enough time I'm going to skip these slides uh I'm going to just give you some idea about it like in the end uh you want to get some uh yeah you want to Define some utility technique basically utility based hybrid memory management so a memory management that works for any hybrid memory so this work wanted to you know to design a memory management hybrid memory management technique that can be extend U extended or can be applied to different baselines so essentially you need to define a metric which is which is called utility so for each page we need to use comprehensive characteristics to calculate estimated utility which is a performance impact of that decision uh here decision is migrating page from one one memory to the other one and we only migrate pages with the highest utility and that means pages that improve Pro System performance uh the most uh when migrated so how we Define this utility Tech metric and how we capture it and calculate it of course you need also some modeling um you need some performance model at the run time here you can see some form for example formula for that and based on that you make your decisions yeah but you can learn more about it uh as part of your potential learning U reading as well that you can do from homework three I guess okay so um any questions so far yes sorry yes why is that um yeah you can I no I think you can consider the same size in that example yeah in that no so the thing is that when you have me in Rob offer locality you need to pay for when you have missing Rob offer you need to read again you need to sense a PCM rule right and you need to pay for PCM access latency but when you have roit you just need to pay for kind of stram access latency right because R buffer is essentially this RAM memory kind of things right okay so yeah there are a challenge and opportunity like enabling an emerging technology to augment D and managing managing hybrid memories and designing effective large dram cache is one issue here so let's also look into this problem that we have with large Dam caches so when you have this large Dam cache it requires a large metadata like tag plus uh block based information store how do we design an efficient Dam cach basically so when you issue a load requests uh you first check the you want to see if the data can be achieved from this DM cach right so you need to check the metadata so you need to access it and then if it's hit you can get the data from da basically otherwise you can you should go to the PCM so then the idea is that how we store Tags so if store Tags in the same one idea is that we can store Tags in the same row as data in the D so in your Dam row you have cash blocks and also you have tags so unfortunately the the downside is that I mean the benefit is that you don't need the unchi tag storage overhead because actually that's that's huge overhead your tag array needs to be quite high you can actually do some calculation at home but essentially that's because your uh cash your dam cach is quite big here and then you need quite large tag array metadata so you didn't you don't need to pay that for that but there are some downsides that cash heat determined only after Dam Maxis which is quite basically time consuming and cach heat requires two Dam accesses so you need to access Dam to get the tag and then after checking the tag um and if it's a heat then you need to access again and read the block basically um so that's an issue again the second idea is that recall idea first idea we can still store all metadata in thean but we can reduce metadata storage overhead sorry we can uh so that's because we want to reduce metadata storage overhead but you can also cach uh a part of that metadata in an unch stram for frequently access metadata so those blocks that you are accessing them more frequeny you can cach the metadata of them in a Str base so yeah so that can actually reduce the cost for accessing metadata so now you have probably both fast metadata access as well as also lower storage overhead data access together and the third idea can be like Dynamic data transfer gity so there are some applications that benefit from caching more data so they have good special locality potentially you can consider that they have uh larger cash blocks uh others do not so large grity was bandwidth and reduce cash utilization so the third idea is that we can have simple Dynamic caching guity policy by cost benefit analysis to determine best Dam cach block size and then we can group main memory into set of rows for example and then different sample row sets follow different fixed caching larities the rest of my memory follows the best grity for example yeah you can check the paper for more detail but with this technique com combining all these uh together you can actually get the performance very very close to the ideal um case that you store like ideal but um very costly that you store the whole tag into in your stram basically so in this Baseline stram the whole tag array is stored in Ram but uh from this technique we have tag in the D but with all this technique that we build on top of that we can actually get the performance very similar to that but with uh quite uh lower complexity in terms of Energy Efficiency you can get better Energy Efficiency because you don't uh reduce performance much but you reduce the cost and power a lot because of lower storage okay so you can also check this paper if you are interested to learn more it's going to also be part of your reading as well but essentially d u caches there are many also recent options s people also have worked on it a lot um we have worked other people have worked also a lot and there are many uh different studies like different uh that you can see here different scheme this paper actually has done a very nice comparison across them like compare all these Techni schemes or techniques in terms of like Dam cash heit how they handle Dam cash heat how they handle DM cash M um how they handle replacement traffic replacement decision and so on so forth so if you're interested you can actually check this paper as well okay any questions so now I want to look into other opportunities with emerging memory Technologies so here I'm going to lease some of these basically opportunities we're going to cover a few of them one opportunity potential opportunity is merging of memory and storage So currently they are not to merge like we have Dr and we have SSD or hard disk so potentially with emerging memory Technologies because they are non WTI you can actually merge them and you use a single interface to manage all data you can also Al consider new applications like Ultra fast checkpoint and restore which is quite important for persistent memory you can have more robust system design uh with by reducing data loss and also you can think of processing tightly coupled with memory like using this uh processing uh in memory techniques so I recall our processing using memory lectures I'm not going to go over them again like we can do a lot of computation for example using D like m and R clone SD ram um people have shown that for example emerging memory Technologies they they can also do b bewise u like bewise operations and R clone in this work Pinot tuu you can check it if you're interested but interestingly uh these architectures can do also some operations that processing using DM cannot do for example easily and that's because the way that they are built like in memory crossb array operations so these architectures are like designed in a crossbar way like some emerging NVM technologies have crossbar array structure like memory store resistive Ram pH change memory CM RAM and these crossb arrays can be used to perform dot for do operations uh using analog computation capability so by doing Dot for do operations potentially you can do U Matrix vector multiplication potentially you can do neural network convolutional neural network so and many many other these techniques so they uh they can operate on multiple pieces of data using kovs law and bit line current is a so that's a kov law right bit line corrent is a sum of products of V nine voltages times conductive of each cell and computation is in analog domain so inside the cross paray so that's why you need a pre FL circuitry like digital to analog and analog to digital conversion to for the inputs and outputs here is the one example so you can see we have this digital to analog converters and here is the your cross array this is a sample and hold if I'm not mistaken like you can think of as a latch you know that you sample your analog dat print here and you have analog uh to digital converter here so essentially What's Happening Here is that the voltage that you apply here which can be determined based on your input Vector so you have some input values like zero and one and you use this D like design digital to analog converter to map them to some voltage value right so these voltages can cause some current to this line to this line Sorry depending on the conductance of U of this resistor and then the the current value in the in this basically bit line is going to be some of these uh current values based on kov law right so that's why you can actually you can see that this is actually kind of dot product you know so you can use this analog perspective and then use it um then convert it to digital to do easily do product and people have done it so I already explained this this you can also see it more from this uh like animations like this is the current value that you can get so essentially you can U consider that for example you have this input input array and this is your Matrix your weight Matrix and you can have your your weight matrixes as the resistive value um in your crossb aray and then you input all these values to the input voltages and then you get the outut essentially so that can Implement yeah yeah yeah you need to pay for um basically right energy right latency to program like a R for example right or R and that's also part of the uh like the endurance issue like people have shown that for example using rum array is quite good for inference that you program your weights for one time for example and then you do a lot of inference Rod but once you want to do training that you need to update your weights then your uh Crosser is going to we out very very quickly yeah so and this this has been actually uh uh basically presented first um in like um you know people have used it for accelerating convolution neural network in ISAC paper for example um I do remember that paper very clearly because the first author of Isaac paper was the was my first Mentor um in 200 uh 11 when I was doing uh when we were doing together research on um adaptive routing algorithms for interconnection Network that we're going to also see some lectures on that as well later but yeah so later on uh he joined the University of Utah and he work on this topic as a p now I'm not sure where he is but I I also learned a lot from him as well okay so yeah so you can think of how you can use this idea for convolution so here as an example convolution that you you essentially have input feature map and there is this window that you are moving or sliding through your uh input and you are doing some computation whenever you know at each step and then you calculate the output feature map essentially so of course you also need to do some padding such that you make sure that you know you always have some some data because there are some regions that your uh window can go outside your input feature map so yeah so you can actually think about it and then you can use it for doing uh this operation for um for convolution neural network but in the in convolution neural network uh you don't only do um basically dot products there are also some other layers like non linear layers like REO that those also so in in your chip essentially you can have this NVM based uh Pam array that you do these uh dot products and also you need to do some nonlinear uh function array you can consider it as a processing near memory engine that they also integrated here such that they can accelerate nonlinear operations and with that uh they could um basically improve performance we also have done work on this topic like J pip um that uh John might also discuss a little bit about this work tomorrow in our genomic lecture so in U and this work also is wordfish uh I'm not sure if we are going to discuss it tomorrow but I'm it's going to be also part of one of your reading so this is also quite interesting because people have used the ream a lot you know this uh basically memory stores uh for doing U non for doing neural networks basically for doing this convolution neural network or these applications like deep neural network but the problem is that these memory resoures are non ideal and then as a result of that whenever you do computation you have some errors so this work actually try to look into how these errors can make your you know assumptions wrong and how these you know essentially get accuracy loss or performance loss uh by using uh these memories so it's very good to know about all these issues you know such that you can design techniques to overcome them yeah this is uh the ISAC paper that I mentioned um and also other papers in this area uh that people have developed it's also part of I would say lecture in Al sary and convolution if you're interesting that's beautiful topic that you can also check from Professor M's leure okay now let's uh look into the other um any questions I like to finish the lecture and it's doable actually but it might be at the price of no break is it fine or do you guys prefer have a like very short break like two or three minutes who wants five minute break okay then I will continue apparently the topic is quite fascinating okay uh so we discuss about merging of memory and storage let's uh jump into it a little bit more so you can have actually uh so this is the conven conventional system Essen right you have memory and you have storage and uh so processor access dm with uh instructions we load and store instructions however to access disk we need to look go into iock basically uh subsystem iock basically stack and you need to open file read the io and then update and whatever so that that shows that whenever you want to access storage you need to pay for a operat system overhead essentially and that wasn't the case in very very early in this computer system that we had a core memory if you if you if you know about it there was core memory that people used that as as a main memory as a storage everything that wasn't also very big memory but that was the only memory that people had but later on by advancement in technology and people have developed like d then um time to time basically these U uh inter interface actually get uh got diverged so now we have different interfaces to access main memory and also access storage so we know that hard dis Drive is non volatile but is slow and block addressable and we know that dam is volatile fast and bite addressable at the same time we also know that there are some nonv memory like PCM or STM Ram that they are relatively fast at least compared to flash M or R this drive and they are bite address and they are nonvolatile so now the question is that can we use these nonvolatile memories to combine me memory and storage such that we have one uh shared interface across it so that's the idea of a two-level memory and storage model so the traditional two-level storage model is a b like with NVM uh the reason is that yeah so yeah the reason is that when you yeah consider this example like the two level store you have this processor and caches and you need to whenever you want to access your storage you need to go into operating system and file system process essentially but that overhead might be okay because hard disk drive is slow by itself you know so you can you can pay for that overhead because that that um basically iock latency might not be uh you know that long compared to the latency of accessing hard dis drive but it turns out that actually that's not true for SSC and that's why people have developed uh uh emerging also interfaces to access SS like nvme interface that we're going to also learn about it next week um however if we consider that we have PCM here here so PCM is much faster than SSD for example than Flash and then HDD so now the you need to pay a lot of latency going through the operating system and then then your access latency to that PCM is quite fast so clearly that's not a good decision so now you can see that this design of two level is a Bott leg while we are using NVM non volatile memory as a storage so yeah I already explained all this okay so the goal is that we want to unify memory and storage Management in a single unit to eliminate uh wasted work to locate transfer and translate data so and that can improve both energy and performance of course and simplifies programming model as well so programmer uh program model does not need to deal with u difference basic basically um like a different class of memory and if also is persistent or not on different interfaces so everything is just unified so that's that um naturally provides program ease so with NVM you can ALS you can provide persistent memory as we discussed and um so that's the opportunity to manipulate persistent data directly in the NBA memory so here is an example for example uh so you have different U memory modules like Dam flash NVM and htd like different memory technology let's say and they are all U at the hardware but they are you are unifying the interface as a persistent memory manager essentially and you access these persistent memory manager between load and store and this manager will access D flash nbm HT so processor will only see this load and store essentially in the end and there are also some hints from software OS and runtime and you can use these hints to uh to map your data intelligently across D flash NVM and htd essentially so here an very U easy like let's say example of persistent object there's a file that you just you know allocated as a persistent object and then you can uh you know map that file exist persistant object to flash or NVM or htd depending on the uh basically characteristics of that yeah as I said U persistent memory manager uh uses access and hint information to allocate locate and migrate and access data in the heterogeneous array of devices so the persistent memory manager uh exposes a load store interface to access persistent data which that is the key difference here and applications can directly access persistent memory there is no conversion and translation and location overhead for persistent data and it manages data placement location persistence and security to get the best of multiple forms of storage and Al also it manages metadata storage and retrieval this can lead to overhead that need to be managed of course and uh exposes uh interfac for system software books and interfaces for uh system software okay so for efficient data mapping among among hren devices just to give you some idea about it like a persistent memory exposes a large persistent address space but it may use many different devices to satisfy this goal like from from Fast low capacity Volatile D to slow high capacity non volatile htd or Flash so you have different basically uh memory like memory Technologies or storage devices that you can manage them nicely and there are also other NVM devices in between so performance and energy can benefit from good placement of data among these devices that's why you really need to learn from these hints and uh you know map your data nicely and utilizing the strengths of each device and avoiding their weaknesses if possible for example consider two important application characteristics like locality and persistence here is one example like uh you have one applications like database applications like columns in a column store that are scanned through only infrequently plac on flash so clearly it has less locality because like is a infrequency and also you scan true but it is persistent because it's part of your database so in this uh like two dimensional U let's say characteristics um you can classify it as this note at the same time there can be some application like frequently updated index for a Content delivery Network like CDN and that is has more locality and also temporary so you need to place it here and that's that's way you can actually manage uh your uh basically your storage your persistent memory so uh we evaluated uh this like this idea using U you know these systems we consider that we have htd Baseline like the traditional system with Volatile D memory and persistent htd storage we also so of course this has the overhead of operating system and fight system code and buffeting as well as also htd access latency we also consider NVM Baseline which is exactly the same as HTT Baseline but HDD is replaced with nbm so that's uh the good thing is that it still has still the problem is that we have that operating system and file system overhead but uh because of this two level storage model but you have faster access compared to htd unfortunately back then we didn't have uh let's say uh SSD Baseline as well there should be also flash or SSD Baseline here as well but that's one of the limitation of this work and persistent memory that we uses only we use only NVM and no D to ensure full system persistence and all data access using lows and a stores so there is no two level storage model essentially and it does not waste time on system calls and data is manipulated directly on the an NVM device so you can see that from uh going from hdd2 level to NVM 2 level we got 24 times speed up which is quite good because NVM is quite fast and by going from NVM 2 level to persistent memory we are getting like five times five times speed of Ste so that shows also the effectiveness of this idea I believe a flash memory is somewhere around here and know if you have it like because the latency of flash memory um I mean is much better than the HDD like for example accessing htd is takes around 10 millisecond but accessing read latency of SSD considering all other latency going on can be around like 200 microc or so for example so it is like orders of magnitude faster access um by using flash memory and in terms of energy also you can observe quite uh great perform speed up when we use persistent memory if you want to learn more you can Al check this paper for more detail but yeah so these are like Challenge and opportunity that we also discussed here like they combined the memory and storage in a unified interface to all data and uh so people industry have done work on on it I mean this slide I have shown m multiple time this today but this uh obtain memory people actually have shown that how they can uh add this obtain memory uh as add that I mean insert it as one of the de one of in deems of the system you can see that is actually like a like DDR Dems basically right and then um they use it as a persistent memory and people develop libraries uh that they can use this optain memory as the persistent memory and they report result compared to having a dam um like as a non-persistent memory and a storage uh SSD as a persistent memory that you have and people also add processing Dam engine I'm not sure why this is here but anyway um so there are many also challenges in persistent memory one key challenge I like to point it out here is that how to ensure consistency of a system data if all memory is persistent like there are two extremes like a programmer transparent let the system handle it so you don't want programmer you know do anything want to provide quite program ease so system should handle it it's very good for program but it's very hard to implement of course and programmer only that let the programmer handle it and there are many alternatives in between that I believe actually the good solution probably should be in between but in in research uh sometimes also good to take extreme actions you know because you want to see how you what are the challenges you're going to achieve you're going to face up and how you're going to work on them when you when you want to push boundaries and that's why also doing research in emerging memory Technologies is quite interesting so this is an example like this crash consistency problem which is actually quite well known so you want to for example add a note to a link list so there is a link to the to next and also link to the previous node right so you always want to do it in atomic manner because if you don't do it and a crash happen then you get you may get to this kind of Link list that the connection to the previous node is is there but the connection to the next is not there and then your link list is not basically um is broken essentially so that is that's actually known problem and um and cause inconsistent memory state so there are current Solutions U that basically explicit interfaces to manage consistency essentially um you need to do do it in atomic way essentially so there are some Atomic begin and end that you insert your code here to make sure that all these operations will happen in atomic Bay like either both of them will happen or none of them essentially so that's the key that you need to ensure when you're doing it in in an atom way but of course it's a limit adoption of nbm um because you have to rewrite code with clear partition between volatile and nonvolatile data essentially so that that's why um you know this this idea this you know putting the burden to the programmer and compiler is not uh is not a good idea so there are also some quote here that I'm not going to like show you can check it uh later if you are interested but essentially there are some um let me see yeah so you you need to use a transactional memory uh in order to basically make sure that you're are doing it in in an atomic way and there are also some lists here that you like Library access that you need to do it in again transactional memory such that you make sure that the G has been done in atomic way but you got the idea basically so it should be like manual Declaration of persistent components and uh sometimes you need a new implementation and also third party cord can be inconsistent still so you need to put them is a in a um basically transactional memory access but yeah so essentially you need to benefit from these libraries and in these codes um to provide consistent code but we wanted to have an have an approach which like the software transparent consistency uh in persistent memory and completely we wanted to be completely software transparent so the key idea is that periodically we want to checkpoint State and to recover to previous checkpoint uh when when crash happened s basically so there is a new hardware based checkpointing mechanism that checkpoints at multiple gruties to reduce both checkpointing latency and metadata overhead and we overlaps checkpointing and execution to reduce checkpointing latency and it adapts to dram and NVM characteristics so we have done um a lot of work in this and that wasn't really easy and it's not also easy and I'm not sure if you should actually also recommend people to use it uh because it adds a lot of complexity to the system of course and this uh so in this work also we consider that we have one single note like you only you need to deal with you know your memory your storage essentially but assuming that you are thinking of a distributed system and you have data coming from IO coming from Network for example your network card uh so how are you going to checkpoint that you know and people have look into also checkpointing in a distributed system like in a database but you can assume you know how much overhead that can basically that can cause like all the checkpointing can cause a lot of also data movement and energy essentially uh so that's why actually finding a solution in the middle between the you know completely put it on programmer and completely um program transfer is probably the best uh way to go but we also we got we get very good results like uh our performance was with within 5% 4.9% of an idealized Dam with zero cost consistency we have done also work on how we can do overlapping checkpointing and execution because if you do such things that you run and then spend time to checkpoint and then run and checkpoint that is quite dumb right because you add a lot of U overhead for checkpointing so the idea is that you want to overlap these checkpointing by running other kers other threads or whatever so essentially you should be able to schedule them nicely such that you can overlap checkpointing and execution as much as possible yeah if you want to learn more um you can check this paper it's going to also be one of the readings in your uh homework but there are another key challenge in persistent memory as well uh like program is to exploit persistence and people have work on it uh a lot you can if you're interested to learn more you can check these papers as well now we reach to this security and data previous issue that uh your your colleague actually brought it up so essentially in NVM we have this security and privacy issues so first of all with endurance problems we have this we outs um we out attack so people actually can design attacks to to to cause you know to cause your chip to wear out quickly for example so if you for example attacker for example issue or yeah issue a lot of rights to your pcma memory for example you know uh that can cause you know we out of that cheet um of course there are some techniques like we leveling that people have been using try to level the you know the the number of R like for example in your PCM assuming that endurance is 10 to the^ 8 and if you write to one cell only 10 to the power8 then that cell is going to out right but if you balance your right you use some technique as like a word leveling that you basically try to you know write in a balanced way in your the whole memory array then hopefully you can actually you know tolerate more rights essentially but when you're when you're designing an attack attacker can actually be also intelligent enough or um creative enough and try to I don't know to trick your um you know we labeling technique such that essentially cause a lot of Rights uh to your uh celles uh so in hybrid memories you can also have performance attacks uh for example people can people have work on it as well uh data um the thing that actually your colleague mentioned that data not erased after power off so in D when you just power off your system your data is cleaned up unless you want to do this cold boot attack and you you know cool your Dam a lot but in general when you P it off uh things are done um but uh things are basically yeah completely erased but in non vola memory that's not the case and you you might have some privacy issues here that you need to look into it so let's conclude any questions I was a bit fast in some slides but I Tred to convey key ideas as much as possible so hopefully you got the very good insights some good insights from this lecture so the future of emerging Technologies is right regardless of challenges uh in underlying technology and overlying uh problems requirements it can enable orders of magnitude improvements and new applications and compute systems and yet we have to think across the stack as we discuss all the time uh and design uh to enabling systems so here is an very good example like you have a new device this emerging memory technology and you can design different techniques you know like we have observed like techniques at microarchitecture software and people have done work on algorithm design you know such how you can change your algorithm in order to have less number of rights for example you know so when you design your algorithm normally you don't consider that at all right like the with the dam you never thought okay how many rights I have you know but with this new like angle here like computer scientists also need to be aware that the people that they are working on the theoretical computer science let's say that these algorithm should be also aware of the number of rights as well you know and people have done interesting work in that direction as well and of course if in dot refer to flash memory a very doubtful emerging technology that for at least two decades um but now you can see that flash memory is I mean everywhere I mean in all our pockets and and we're going to also learn a lot about flash memory next week there are many research and design opportunities in this area like enabling completely uh persistent memory as we also discussed computation in or using NVM based memories hybrid memory systems security and privacy issues in persistent memory reliability and endurance related problems virtual memory systems for NVM and like example is virual block interface so this is also interesting as a kind of food for thought like U the reason that virtual memory has uh emerged is that basically your main memory was not large enough and that actually made you know that was making uh life for programmer quite hard you know the way that they wanted to design program so people have developed This brilliant idea of virtual memory that's basically that gives illusion of much bigger memory but using this NVM uh you can actually have quite large memory so like tens of or hundreds of terabytes so we need to rethink about all these things do we really need virtual memory anymore you know and all these things are quite interesting to look into and such uh topics are quite interesting you know when you want to when you are when you want to work on emerging memory Technologies you really in order to overcome the issues you really need to reink the whole stack and reink all the assumptions that you have made decades and decades you know and that's quite insight ful and that can you know push the boundaries and you know the you know you can overcome with many out of the box uh solutions to the issues so this uh virtual block inter interface is one of the work that we have done also to you know U think about virtual memory a little bit different you can check it if you're interested or you can watch nastran talk in one of the uh computer architecture lecture in Fall 2020 four years ago and with that uh I'm concluding this lecture and we are five minutes early thank you questions we could take that five minute break okay then uh I see you all tomorrow have fun for for