[Lecture 8] Enhancing Memory Latency in Computing

yeah yeah okay you can start it it's very loud here what do you think how about now better people in the back can hear me okay okay it's all good okay we still have one minute so guess other people are having the same problem you had automatically locking doors okay let's get started uh so today we're going to cover whatever we didn't cover yesterday at least a portion of it and then we're going to jump into memory latency which is really a critical topic uh latency is often ignored as we had discussed in an earlier lecture but it's really the root cause of a lot of the complexity that we have in our systems and we should change that mindset also there's a lot of mindset that we need to change I think in designing Computing systems but latency is very well known uh yet it's still uh a problem okay let's finish uh with data retention uh remember we were discussing how to reduce problems that are caused by uh data retention issues with a focus on DRM but I mentioned that this issue exists in any type of memory uh I will mention quickly uh uh this idea of refresh access paralyzation parallelizing refreshes with accesses so that you reduce the impact of uh refreshes on the accesses in terms of latency in terms of performance and I think that's a very nice idea in general it was described in this paper I'm not going to talk about this paper in detail but what this paper does is while a refresh is going on to another one subur you can actually do access to some other Subways uh I will talk about this paper that GE I led uh in the in the recent past which showed essentially that you can actually do this in real DRM chips to a limited extent by violating the timing parameter so a lot of these ideas that uh were actually shown in simulation like in this work in 2014 uh are possible to a limited extent in real DM chips by violating the timing parameters and you've seen this before right with rclone for example in this CA you've also seen it for with uh uh bitwise operations in the chips now we actually also see it in uh parallelizing access of different subways so this is actually a nice way of doing research in my opinion you first build a theory the theory says you should be able to do this uh based on the circuit principles and architecture principles and then you show this in simulation and maybe after some time you develop the infrastructure to show that oh actually you can do it in real TM chips with a limited capacity okay so we'll talk about this hidden row activation idea and uh given that this work was done in 2022 refresh actually is becoming a bigger problem as we have discussed there's periodic refresh but there's also preventive refresh that happens when you have roll Hammer uh bit flips and you want to prevent these bit flips and several mechanisms have been proposed as you have seen uh to refresh or preventively refresh physically adjacent rows that might be vulnerable to these bit flips if uh a row is being hammered enough times you that's all familiar right it all makes sense so a refresh problem is becoming worse in a sense right uh because of R Hammer so I think we've covered this I'm not going to go through this in detail but uh there periodic refresh is getting worse and this is some projection also at large DM chip capacities you actually get significant slowdowns this is based on some other projections more more reasonable projections than we had done in 2012 let's say but things have changed also you have more data points uh and Ro Hammer You' also seen that uh as the chip density increases Ro Hammer vulnerability worsens and we have shown that there's real data from the field that shows that and actually if you do uh preventive refreshes uh as rowhammer threshold goes down from 1,24 to 64 with some mechanism per I believe in this case you get a lot of slow down because you have to ensure that uh you don't get bit flips that means that if you actually get a bit flip after 64 uh activations uh you need to do a lot of refreshes to overcome to make sure that you don't get that bit flip especially with a probalistic mechanism like uh uh par okay so essentially the overhead of refresh is increasing with due to both periodic refreshes as well as prevent the refreshes so the idea of refresh access paraliz is to essentially enable some part of DRM to do refresh while other parts are servicing accesses that's the major idea basically and there are multiple ways of doing it uh one way of doing it is changing the DRM another way of doing it is basically violating the timing parameters and see what happens basically uh G has that if you activate two rows in quick succession that are in different Subways in the same bank you can refresh one row while concurrently activating the other R that's the idea you do two activations and in quick succession meaning violating the timing parameters and if these two activations go to a different sub Rays uh the first activation refreshes row a while the second activation activates row b does that make sense and the details are in the paper I will not go through it in detail but it makes sense to think about it because each of these Subs have their own local Rob buffers and while the first activation is actually bringing data to the local ruffer of uh subr X from row a the second activation enables uh the the local robot uh the second activation enables row b to be brought into the local robare as well as the global IO and you can Del data from the second activated drove even though you cannot read data from the first activated drove you can that actually accomplishes refresh because you brought a to the local Rob off and then once you do the pre pre-charge all of the data will be written back to the rows okay so that's the idea so what's the benefit if you look at the benefits in the Baseline system you first activate R A which accomplishes a refresh refresh you can think of as an activation and then you pre-charge it because these go to the same bank you need to activate row b after the pre-charge of the bank is complete and then you can start reading from row b after that point if you do if you use this hidden row activation you first activate row a and then quickly pre-charge it pre-charge the bank and then you activate row b at the same time almost almost at the same time is it three NCS yes okay those are the delays he he remembers it that's good maybe I remember it too uh so this way you can hide Rob's activation while Ro is being refreshed or you can think of it the other way right you can hide Ro's refresh with Rob's activation they're overlapped essentially and you can issue a read uh over here this way you can save time using this mechanism okay and G tested this on many Dam chips and he found out that uh on SK hinx chips uh you can actually do this uh he can tell you why it doesn't happen in other chips probably because some of them have idiosyncratic mechanisms that don't allow you to do back-to-back activations like this uh which is not nice I think I think the manufacturer should be better at exposing things so that people actually discover these things or enable new function in DRM as opposed to preventing some experimentation with the which I think goes against their interest but frankly I don't know what their interests are it doesn't seem to be uh being intelligent about business for sure even even even though I'm not a businessman I I can see that also okay but basically you can see that after testing this on real DM chips uh we found out that you can get more than 50% reduction in the time spent for refresh operations and there's lot more detail in the paper you can see it and or you can talk to GE eyes I think this is also very interesting because uh we have seen similar things in simulation like eight years ago before this paper but now this shows that you can do it for a limited uh set of uh Subs inside a DM this doesn't work for all subs because there's some internal circuitry that we don't quite understand and clearly you cannot easily manupulate by viting the timing parameters but this shows that if you actually go and design the DM chip to be to do this carefully you should be able to design it there's no reason why not you why you cannot okay and then uh we designed a memory controller that essentially uh uh tries to exploit it uh I'm not going to go through this in detail but memory controller buffers each refresh requests and tries to hide the refreshes that go to the same bank as accesses to the same bank so you try whenever you're activating a bank you don't just activate uh that bank but you also schedule a refresh that goes to different Subway currently with an activation that goes to the same bank to another Subway so you need to design a memory control mechanism that takes advantage of this of course right okay I'm not going to go through this but if you do that and if you go and simulate using ramador you get significant benefits especially on preventive refreshes meaning if your row Hammer vulnerability is quite bad you get you actually get a lot of speed up uh by reducing uh the performance impact of preventive refreshes on accesses but even even if you don't have preventive refreshes for whatever reason uh periodic refreshes by by by doing this on periodic refreshes you still get significant speed up as you can see and there's Hardware complexity analysis if you actually do this in the memory controller uh it's it's a small amount of logic and very little latency okay so if you're interested there's a paper you can ask questions to G ey or if you have a burning question I can answer right now as well thoughts yes do you have a question for yourself okay go ahead that's true yeah exactly I think uh as I discussed in the memory Centric Computing lectures and as we will discuss more when once we talk about processing using memory now we know actually how to activate many rows in different Subs uh and we know how the hierle decoder is organized at least we have a much better understanding of it so maybe there is a better way of doing things like this maybe you can do access access paralyzation to Subways which is what we're going to talk about later in this lecture which is called subr level parallelism but that's going to be very tough I think without changing the D okay so I should also say that uh these issues are this discussed uh by industry industry is clearly trying to improve refresh and there are a lot of papers well maybe a lot is an overstatement but there are papers in circuit level conferences that DM manufacturers right to uh optimize refresh in different ways so they really care about it and there's a good reason because it's a scaling challenge as we have discussed that's why we started looking at it and this is a paper that I mentioned in earlier lectures maybe Muhammad mentioned it that was written by Samsung and Intel in the memory form in 2014 uh and uh this is a nice paper I would still recommend it to people but if you can if you look at this they talk about three major challenges VRT we already talked about variable retention time and this paper also says that we add error correcting codes to handle VRT so that part of the paper is really about error correcting codes the second part is Right latency which I'm not going to talk about right now but it's the subject of let's say the next part of this lecture on memory latency so they care about right latency is increasing and their solution is actually one of the solutions that we will discuss sub level parallelism you can tolerate the right latencies by doing one access in one subray while doing right to another subay and they actually evaluated and find good results this this is essentially validation of our work that was published two years ago before this but the important part well one important part over here is the refresh as you can see that they say that it's very difficult to uh handle the leaf fish problem because leakage current of cell access transistors is increasing cell capacitance is decreasing so we need better Solutions uh to this problem in this paper they did not talk about roow Hammer because it was it was exposed two days after this paper was published in this in 2014 but I would probably add that also but I there's another interesting thing in this paper which uh essentially goes along with what we have been saying for a long time which is to overcome these challenges they say it's very difficult to do circuit or device level Solutions we should really design controllers and DM together essentially intelligent memory controllers um and also there's another interesting thing in this paper this is the only paper that I know of where Samsung and Intel collaborated and wrote a paper if you find something else let me know usually these companies don't talk to each other without a lot of lawyers present in the room but these guys actually were forward looking to collaborate uh and I have more stories about this paper later on any questions okay how many people read this paper it's four pages I think it's part of the homework right or one of their suggested papers I don't remember but if it's in the next homework and again in the homeworks you you have some required papers but you have a lot of suggested papers that are for extra credit if you actually review those papers you get a lot of extra credit so the goal of this course is not to uh like nickel and dime you with every single grade the goal of this course is to enable you to learn so you can actually not worry about a lot of grading uh while learning so these extra grades are all opportunities okay so this is something will not cover in detail but I will flash some slides as you can see there's flash memory because we don't have I think time right now but this is an issue in flash memory also very quickly uh flash memory is even though it's nonvolatile memory it's still charge based memory so there's floating Gates charge and you you trap the charge in the floating Gates and then you basically sense the charge with different mechanisms unfortunately uh charge essentially leaks through the the floating gate again using leakage mechanisms which we will discuss later on but uh this problem also becomes worse in flash memory as a charge storage unit size reduces and as as a result retention becomes a problem in fact even though it's non volalle memory uh flash memory gets refreshed in real commercial ssds uh because uh as as things age uh you the retention becomes a bigger problem and over time people need to refresh flash memory not at the same scale as their DRM is refreshed but at a longer time scale but I will flash some papers and we'll talk about them later and one of the issues is as a flash memory gets older as you write more data retention errors start increasing and accumulating and this refresh starts increasing and other mechanisms to tolerate these errors in the flash controller starts increasing as a result you start getting performance degradations and these are actually very well known very well reported which I'm not going to talk about right now but as I said I will flash I will I will flash some of these slides so basically charge leaks over time and this leads to a retention error and we have analyzes a lot uh in our early and later papers and to overcome this you need to read uh make make the reads take longer either to sense small amount of data or you need to map the data somewhere or you need to add ECC or there are many mechanisms you can think of including variable rate uh refresh Etc so we have characterized these errors and in two as early as 2011 the paper was published in 2012 using an fpj based infrastructure for Flash again and uh essentially uh We've looked at read errors erase erors and program errors which you will talk about later on uh and also retention errors uh essentially this is whether an error happens dep uh basically this depends on the retention time how long you want the data to be retained in Flash and this becomes worse in a flash memory where in a single cell you you encode more uh bits this called mlc multi-level cell uh which is not the best naming perhaps but multi-bit cell is probably another multi-bit and qued cell is probably a better name but essentially you use the same voltage range and you chop up the voltage range to smaller interval so that you can encode more bits like two bits three bits four bits Etc and if you do that then the the the width of the voltage window all voltage range reduces there allocated to a particular encoding and the probability that you will get a uh when when when uh when essentially you you get a charge loss the probability that uh one encoded bit will shift into the domain of another encoding the voltage levels actually increases okay as I said we did a lot of studies over here this just to vet your appetite if you will so this programming race Cycles is how many times you have written to a particular location in flash memory and you can see that uh this raw bit error rate that we observe in flash memory increases as you keep writing to flash memory this is the endurance problem that we see uh and you can you can see that the biggest errors are really the retention errors so most of the errors that are responsible for the the robit error rate that we see even when the flash memory is Young meaning it has not been written too much it's still retention errors as you can see so if you want to retain data for three years you cannot do that easily uh especially if the flash memory is old as you can see over here but if you look over here this one is one day retention error so yellow ones if you want to retain data for one days or only one day you still get a bit error rate over here that's quite High which is 10 Theus 4 over here assuming that you written to flash a lot but even if you have not listened to flash a lot which is only 100 times you still get a pretty high bit error rate over here which is a little bit uh around 10 the minus 7 so you can see that retention errors are still problematic in flash memory especially mlc okay this is just to vet your appetite but we'll talk about some of this when we talk about Flash M so what is the solution essentially the solution is to refesh periodically uh but then what is the period you can change the period based on how old uh The Flash cell or Flash page is meaning how many times you have written to it because clearly there's a relationship between how many times you have written to it and how much charge it can retain and the reason you cannot uh it it becomes harder to retain charge is as you write to the cell many many times you wear out the cell and this wear out uh essentially hurts the retention mechanisms and we will discuss this more uh in flash memory there are interesting tunneling mechanisms that happen so basically you refresh more often at higher programming erase Cycles and uh the question is how do you refresh uh so in flash memory actually you can have multiple different mechanisms in DM when we refresh data we did it in place right you you access the data you activate the row which brings the data to the r buffer and then you pre-charge the array right that's the way to refresh I call it kind of in place refresh but it's not exactly in place you're moving the data between sense amplifiers and the uh r that you're refreshing in Flash you can do that also by programming the cells Flash doesn't in Flash you need to operate things by programming so in place means that you program the cells so that you actually compensate for the Lost charge and you add more charge to each cell that has lost charge but you can also do something else which is you can remap the data from one page to another page somewhere else and this is feasible in Flash because Flash controller is an intelligent controller it maintains a mapping table that tells you where each block is in in the physical address space this doesn't exist in D in DM you get a address uh and then that address is directly mapped to uh uh directly sent to the you get an address from the processor and that address is directly sent to uh the uh DM chip internally there's a mapping mechanism in the dam chip but it's very limited in Flash you get an address from the processor and that Pro that address is remapped completely using a logical to physical page mapping table uh and then uh you can actually change where each logical page is located in the physical domain in the flash memory so this is the example of an intelligent memory controller this enables remapping based refresh in flash memory and again if you're interested you can read some original papers yes yeah these papers a lot of these papers study uh yeah there there's basically spatial variation certainly yes yes exactly but uh usually yes certainly there's spatial variation flash uh not all frash controllers try to exploit that but it's possible to exploit that much more easily in flash memory than in the because of the because of the remapping mechanism exactly if you had that remapping mechanism in DM we could also exploit it like when when G discussed the SAR paper right there spatial variation Ro Hammer vulnerability today it's very hard to exploit because you need to have some sort of table to keep track of what has uh what kind of vulnerability what Dr has what kind of vulnerability but that's a very good observation okay so this is the original paper that actually discusses how to do the refresh and flash memory using inplace and remapping and has a lot of data as well okay so I'm not going to go through this in detail but there's actually more detail characterization and different mechanisms that we have proposed to do data retention some of these are actually uh recovery so even if you get errors in flash memory uh meaning somehow you pushed your flash memory for too long and eventually you got some errors you can still recover data by doing post-processing you take off your disc and postprocess it in a probabilistic manner and you can still recover a lot of data this is actually very interesting in storage uh how do you recover data could be used for good or bad purposes that's why don't lose your flash memory okay okay so uh if you're if you want to fetch things this is the paper that I would recommend to read uh it's a long paper that summarize a lot of the research it doesn't cover everything we have done but it covers a lot of the state-of-the-art in industry and Academia at the time it was written and a lot of things are still employed in the field okay there's a more up toate version also okay we're going to get to flash memory uh later on in this course okay so I think to conclude we've discussed like the raer idea early on right I think Raider idea is a good idea but one can come up with many good ideas in my opinion assuming you have the right mindset and right preparation it's preparation and mindset and a little bit of luck uh and but then good ideas may be difficult to implement right the key is really finding the good ideas I think that's a good step uh but after that point making them work is actually an even harder step potentially depending on the idea so we've kind of seen the story of how to make ret uh variable rate refresh uh work as much as we can right and that concludes the data retention lecture I'm happy take more questions if you have more otherwise we're going to switch to switch gears to memory latency no okay still too early to take a break so we're going to switch let's see how we switch okay there's a desktop okay okay I assume this is the okay 8B cool screen oh okay screen share I forgot that that's what happens better I don't want this display thing okay looks good okay okay so the last part of this lecture was actually really about the latency impact of refresh right we talked about paralyzing refreshes and accesses and in that case we're not saving energy we're really uh well you could be saving energy by reducing uh execution time right but we're not directly saving energy by eliminating refreshes in that sense like variable rate refresh is very nice because it eliminates refreshes which is good for energy as well as performance but the refresh access paralyzation mechanism we talked about doesn't eliminate refreshes but it does eliminate the performance overhead of refreshes or tolerate the performance over of refreshes so in a sense it tolerates the latency so now we're going to talk talk about more direct mechanisms to reduce latency which will be applicable to both refreshes and accesses and I think this is a very important areas as we discuss so I'm going to motivate this a little bit so we've talked about uh robustness we've talked about Energy Efficiency which we're going to get to now we're talking about low latency and later in the lectures we're going to talk about these different directions as well uh so stay tuned now we're over here I'm not going to cover the predictability part uh because that's a separate thing right latency and predictability are two different things and the way I like describing is I'm I'm waiting for my train and I'm hoping that according to the schedule it comes at uh 1355 15 minutes from now if it's low latency and it comes at 1353 that's not good right I don't want that in that case I don't want low latency because I'll miss it and this happened to me multiple times in zerck the train was supposed to come at a particular time but it came one minute early and I missed it you know so this is where you want predictability I don't care about low latency in that particular case uh so that's the difference basically but in many cases you want low latency and on top of that predictability so it's keep good to keep this in mind but now we're going to focus on low latency so why do we want low latency as I said earlier I think latency is caused of a lot of the problems in our Computing systems and we don't probably pay enough attention to it and uh latency is low latency is needed to solve problems right clearly a lot of problems demand high performance and a lot of them are ball neck by latency in the end I'm just flashing these without making an obvious connection to latency but you can imagine uh but I'm going to use some analogy which I have not used in uh this year's incarnation of the lectures because I wanted to combine everything together uh there's something weird in the slide okay but anyway I think that theory of human motivation is kind of double or is my eye are my eyes playing some does it look weird okay okay anyway we'll try to fix that but this is how many people know about this hierarchy M hierarchy okay that's good that's good yeah it's a famous one because it's not just about psychology it's really about used in many domains but this person americ American psychologist maslo came up with this and he basically says uh if you want to uh basically the first things you need to really care about is Safety and Security and food Etc and if you don't have that then you cannot think about anything else I think humans need to be better at this in understanding that this is the case for everyone in the world these days unfortunately we don't have that sort of empathy in the world but this needs to be applied everywhere I think uh but we're applying it to the Computing domain uh maybe you argue that uh you start with energy right what is energy it's food for the computer right if a computer needs to run you need energy is that the most important thing I don't know maybe robustness which one is more important which one do you care about more robust Computing or energy efficient comp if you don't have energy maybe you don't care about robustness right it's not even running but then maybe low latency is enough maybe you don't have enough energy or enough robustness but you have enough latency you're so quick at fixing things uh then you may actually uh be better okay anyway this where this is where the analogy kind of ends should think about this a little bit that's uh I think maybe if you are very Speedy in correcting things or maybe Speedy in generating energy who knows right uh yeah okay basically latency is important I'm not taking sides on which one's more important but we we really need uh everything optimized right as we discussed earlier energy robustness and latency so we're going to talk about latency now but if you're interested in this there is actually a very nice talk by my former colleague Satya who was a professor at CMU who talks who who works on both distributed computing and Edge Computing and he's done a lot of great work in this area I'd recommend that you watch this talk where he really talks about the import importance of latency as well as other things uh when you actually do Edge type of computation uh basically if you want to be highly responsive you have to actually have uh optimized for latency and there's one slide that I borrow over here uh from him uh that's basically says that if you actually want to interact with humans like if you want to let's say enable a real world over here uh this is uh the cognition of human uh mind let's say you need to be faster than uh uh these speeds and he basically argues that if you actually need to beat these speeds uh because uh you know you you need to add value right if a human does something and if you actually uh do this give the same experience to that human with Computing in virtual reality for example that's not interesting probably you probably want to have better experience right if I actually go out in virtual reality and I I would like to really understand uh many things about my environment that I cannot do today if I have the same experience I'm not going to buy anything related to your virtual reality environment so I need to be actually more superhuman in that sense if you want to enhance the experience and that requires really low latencies because human cognition is actually very fast even though it looks slow it's still very fast and then there are other reasons for latency of course but these are his reasons over here so I think certainly latency is important in uh critical situations where you need to analyze a genome for example and you need to make decisions in terms of uh uh for example if you have a critically ill patient and you need to really understand what kind of drug uh or what kind of uh um uh yeah what kind what kind of drug you need to uh deliver B that's personalized to that person right you can you cannot wait days and days to the genome analysis in this case you actually need to do it right away and you cannot make a mistake so both latency and accuracy are important so we're going to talk a lot about genome analysis later in some other lecture but just to fetch things if you want to do this in a real time setting latency is really critical and we have seen that in our work also okay so if you talk about data Centric architectures uh this is one slide that where I try to summarize the properties of a data Centric architecture if data is really important which we discussed you want to process data where it resides which is what we discussed earlier but you also want to enable low latency and low energy access to data rate processing data where it resides reduces the access latency to data compared to a processor Centric system but if the data is still long latency to access uh that's a problem basically that's why these two are actually important to do together in my opinion but independently of whe where you process the data latency and energy to access the data is important I think so we're going to see that so a lot of the studies that I'm going to show you are not going to be in the context of memory Centric Computing a lot of them are going to be in the procer Centric Computing and even then we really need to reduce uh access latency okay I'm not going to cover these two but we'll we'll talk about that later on so I should also mention that here I say energy it's not just Laten it's also energy but these are also interesting things that could be synergistic with each other sometimes when you reduce latency you also reduce energy because you reduce the amount of time the circuits are active and active circuits are very bad for energy in general so fundamentally reducing the latency is good for energy as well so I keep that in mind in general that's not always the case because depending on the latency reduction technique you may add more complexity to the system which may actually increase energy but well increase depending on the trade-off part of it will be increased part of it will decrease the energy uh but uh it's important to aim for both in general okay any questions now let's take a look at some fundamental trade-offs you've seen some of these slides so I'm going to go through these relatively quickly in an earlier lecture in the first lecture actually I showed you this graph uh which shows essentially how uh DDR uh memory capacity bandwidth and latency have improved and we've discussed some of the reasons for it clearly capacity has been the focus in memory latency has not been as much the focus and if you look at some specific latencies that we're going to cover today activate pre-charge restoration some of them have been increasing actually and in fact that's Samsung and Intel paper that I mentioned say that right latencies are going to increase so we need to somehow figure out how to tolerate them and the right Laten are going to increase because uh they cannot uh push enough current through the access transistor very quickly uh so there are circuit level reasons why these things increase okay and this is the other uh graph that we showed it's a longer term study which I would recommend everyone to do if you're interest in uh an area it's always good to investigate what has happened in that area this is done by my PhD mesh Patel uh and you can see that over the course of 54 years or so the capacity has increased essentially exponentially in the recent years maybe we're having some struggle comparatively let's say but still it's a million x capacity Improvement in a DRM chip which is fascinating which is essentially an a a similar thing to Mo's law right you can pack more transistors and pack more capacitors uh on a uh in a given area with a better technology scaling unfortunately latencies look very different overall in the 54 years this these are based on data sheet information you get adex latency Improvement but if you look at the more recent past the last 24 years the latency seem almost constant right does that make sense well it may not it doesn't have to make sense because it's real data you can argue with the data the only thing the only way you can make sure that this doesn't make sense is showing that the data is incorrect it's very hard to argue with real data okay but this is the St State basically this is the state we're in um uh I should probably change that because today we're in the Mis misinformation age people don't care about data in general right but we're scientists over here and we should we're at a university we're in an academic institution we should not be part of that misinformation I had to add that sorry because I think there is a very big danger that we're facing in the whole world okay but basically if you look over here uh this is the state and as we discussed this is partly a conscious choice but partly also be conscious choice in the sense that everybody wants capacity and that's the major design goal in the chips but partly it's because it's the harder thing to do reducing latency is really harder than uh improving capacity because technology scaling is not on your side in general when you go from uh a bigger technology node to smaller technology node latencies do not magically reduce especially the connect latencies okay uh so if you're interested in this and other things we discussed for example the relationship between uh memory controller and DRM and how the DRM should expose more to the memory controller so that we can actually get better reliability and better characteristics this paper actually puts a nice argument for why that should be the case with a lot of case studies and with a lot of interesting data I'd recommend people to look at it and maybe we can include in a future homework too as a as a potential uh um with a potential paper to Rie okay so I've already said this also in later in earlier lectures but uh there are also a lot of works these are some papers a select set of papers that show that DRM latency is important for performance in many workloads in many workloads latency is important in many workloads bandwidth is important in many workloads capacity is important so there are also different parts of the workload that are affected differently by these different metrics sometimes you actually have a really critical latency requirement you actually have a database transaction and you want that to go through right away right sometimes you have a bandwidth requirement because you have actually lots of these transactions potentially right and you have capacity requirements usually all the time especially with with large data so workloads are actually a complex mix of uh it's very hard to say this workload is completely latency bound especially if you have a complex workload because it's really a combination of these different characteristics you can be bound by multiple things okay but basically long memory latency is a performance B neck so we study a lot of workloads in this particular work where we were trying to understand the effect of different characteristics different damps on different type of workloads I'm going to quickly mention this but this is a this a nice study that we have done uh basically what did we try to do essentially there were a lot of different types of DRM that people were developing and today actually we have a lot of DRM uh we have hbm we have uh LP DDR we have Graphics DDR we have regular DDR if you want to learn more about those this paper does a very nice job explaining what are the differences between those I'm not going to do that because I think uh uh they're important but they're not as fundamental right High bandwith memory gives you high bandwith memory we're going to talk about at that level if you if you're interested in more detail you can look at the paper but basically there are very diverse set of applications that we execute and it's very hard to take a one siiz fits-all approach some applications really want more bandwidth sometimes and this is why High bandwith memory was developed and clearly High bandwith memory is a having a lot of success today because it's coupled with gpus that can execute uh machine learning workloads so the question we asked in this work uh was which the type works with works best with which application at least on average right uh we did not look at hybrid memory that's another exploration that we will talk about later on uh in this course how do you combine different memory Technologies and how do you take advantage of these different memory Technologies uh to improve an application or the system in general here we're talking about one type of memory and how does it work with different types of applications and we wrote this paper and we submitted to a conference and I I'll talk about some stories also basically uh the rers came back and said that oh you did a very extensive study that's great you studied nine different types of the ETC why didn't you do it in our real system now that's an interesting question like how do you study a given application with nine different types of uh me uh memory in a real system we discussed it in the past right actually somebody yesterday asked the question why is the memory controller on chip or is the memory controller on the processor chip because the memory control is on the processor chip there's no way you can change the type of dam that you add to the processor chip because that processor chip has a memory controller that can communicate with a specific type of DM it could be DDR it could be hbm it could be lpddr but you don't have a processor chip where you have I'm not going to say two but three let's say three of those let alone nine so I can see that that sometimes reviewer reviewers just pull things out of their hat and they say things that's why you need to believe in the work and you need to really understand what you're doing as and you don't trust the viewer to really understand what you're doing okay basically this study cannot be done on a real system because there is no real system that can support so many memory controllers this will become even more clear when we talk about memory controllers because each memory type requires a lot of pins and if you really want to add nine different type of memory controllers and if you want to support all of them you really need to have hundreds of additional pins in your processor chip and that makes no sense because it's and first of all there are multiple reasons why it doesn't make sense one is it's expensive clearly then then what are you going to do with it right that's another question why would you design a chip that can communicate with nine different types of memory with nine different types of memory controllers okay now this problem is alleviated if the memory controller was not on chip well not on the processor chip if you had another chip that is actually separate then you could actually plug and place different types of memory by changing what kind of chip that you communicate with right uh but we'll have more stories about this in memory controls okay so in the end we studied a bunch of uh applications and workloads and we looked at nine DRM types so I was WR in nine as you can see uh not all of them exist today uh but essentially I'll go through this relatively quickly these are uh DDR3 ddr4 today we have ddr5 at the time we didn't have it this is Graphics d DDR uh you can see that they have different characteristics uh they have different upsides and downsides for example going from DDR3 to ddr4 actually latency is increased and we wanted to understand whether this is good or bad for many workloads gddr4 increases area and power and we wanted to understand the effect of that on energy and then there's high bandwidth memory at the time we had High Hybrid memory Cube which doesn't exist today for various reasons power was one of the issues actually but I mean hbm and hybrid memory cube in my opinion are relatively similar they're both 3D St types its interface are a little bit different uh but essentially we studied that these are 3D stack DMS at least in the uh in the ideal Incarnation they can be 3D stacked which HPM right now is stacked in a 2 and a half D manner using interposers uh between a memory controller and the HPM but there's no reason why the memory controller cannot be in a dedicate logic layer in fact in HPM 4 that's going to happen uh I believe uh okay but essentially uh they have different kind of characteristics for example hybrid memory cube made the choice of having narrower rows and higher latency and then there were some other low power Technologies some of these are high band with low power wio which I'm not sure anybody uses these days does anybody use wio anymore maybe not I think hbm kind of overtook a lot of these things uh because uh yeah these are lower power high bandwidth technologies that kind of didn't survive let's say but lpddr survives these are low power not so high B Technologies but they're low power so there's a benefit over here uh okay without going into a lot of detail basically there's a lot of anoun in the paper that I'm not going to talk about over here you can read the paper for more detail but I'm going to point out one thing that we found essentially new DM types often increase access latency in order to provide more Banks higher throughput essentially they make the trade-off to get higher throughput you have higher latency and we wanted to understand this trade-off is that a good idea in a lot of applications and essentially we found out that many applications do not benefit from that trade-off for example uh a lot of operating system routines file iio process forking they actually lose performance uh when you uh uh when you go from ddr4 to HMC for example or D sorry uh DDR3 to HMC there are also other applications that lose performance when you go from DDR3 to ddr4 it's even though it's a slight performance loss they essentially do not gain benefit from going from one memory technology to another memory technology right ideally you would get benefits if you don't get benefit why do you go from one memory technology to another technology if this curve should not be flat basically this curve probably should not go down also as you can see but it should not be flat either so it's good to think about this basically so what benefits and what doesn't benefit okay so these are some conclusions that we uh uh draw in the paper uh essentially DM latency is still a critical ball neck for many applications and additional parallelism that are provided by some of these new technologies like high bandwidth is not really fully utilized by a variety of the applications and if you want to utilize them you need to change your application somehow meaning to explo you need to exploit the spatial locality so uh some of the technologies that reduce the spatial locality meaning the Rob offer size actually hurt the performance of many applications uh because the applic applications have spal locality and they cannot exploit that but if you actually want to exploit also larger uh parallelism you need to actually change the application okay and also we found out which was also found out earlier uh but we also kind of uh um replicated uh this finding that low power memory can provide significant Energy savings without sacrificing significant performance and I think it's good to think about that going into the future can we actually reduce latency as well as power uh at the same time okay so I'm going to skip this because we already talked about this but essentially you can find more information in the full paper uh we also actually released a lot of tools related to it this was used by industry and Academia but you can also if you're interested in this sort of study I think this sort of study is important to do especially with modern set of applications the study was published in 2019 after being rejected a couple of times so you can see that the research was done actually even earlier but it's good to revisit uh what's happening in the DRM as well as storage space for emerging and new applications also because clearly the application landscape has also changed uh given that we're in 2024 right now okay but basically the conclusion is even though uh there new DM types are coming some of them actually increase latency and this actually hurts performance maybe this is a good time to take a break I have a nice clean cut over here so I'm going to motivate a little bit more uh the memory latency problem hopefully this motivation was enough but I think I'm going to also give you my personal story because I actually worked on memory latency for a really long time my PhD thesis was all about tolerating memory latency from prer Centric systems my even my first Works before my phdc was about pre-etching how do you tolerate latency uh that's visible to the processor so I'm going to give you that story first and then we're going to jump into how to actually fundamentally reduce latency as opposed to tolerating the latency because tolerance is I think it's nice but it has a limited uh uh uh like let's say how do how should I say it it's important it's fundamental it's nice but unfortunately it's not really solving the problem the real problem is the fact that you have low high latency to begin with you can do all sorts of things to tolerate it and that all adds complexity if you don't solve the latency at the heart of the problem then I think you're going to keep solving the Toleration of the latency and you're going to make your systems more and more complex and this is I'm going to give this to you because I think this is my personal story and personal belief based on what I have done over let's say 25 years or so in computer architecture and I think it's fascinating this doesn't mean that those tolerance techniques are not important don't keep don't don't get me wrong I think we'll see that also okay so now let's take a break oh is there a question okay let's take it before we take a break given that we're gonna go to the second part of the lecture okay sorry you couldn't read it I can read I need a microphone okay you need a microphone okay basically this person says uh will uh with n's SSD speed scaling much faster will we have an edge system that omit DDR memory totally and solely rely on the nand SSD considering the fact they don't wear out much faster I don't know the last part but solely relying on nssd basically I think uh the difficulty is if you if you want to get rid of the main memory uh now you're increasing the latency a lot to the SSD so I think it's going to be not so easy to get rid of main memory especially in a high performance system if your goal is to actually have a low power system yes but reducing the latency of n to the level of DM today I think is very difficult I think it's good to research to reduce latency n but Bridging the latency gap between n Flash and uh DRM is very very tough proposition it's very hard to see that happening anytime soon in my opinion that's why DM is a very nice technology that has its own place right that's why we have this memory hierarchy today but of course if you start doing in storage processing or in memory processing now that changes the game in a sense right but still the latency is to access flash as we will see in later parts of the course maybe Muhammad will talk about that because he's done a lot of research along with us essentially um uh it's very hard to reduce the latency to uh flash uh to the levels of uh D so if you want to do in storage computation we will also see that you actually may want to take advantage of the DM that's inside the storage system because there's DM in the storage system okay no more questions okay let's take a break until 1419 then and then we'll continue with the memory latency problem e e e e e e e e e e e e e e e e e e e e e e e e e e e e e Maybe no okay I think it's time to get started am I correct 219 okay I thought I said 19 but predictable performance you know okay let's get started um any questions before we move on anything that came to your mind okay otherwise uh so uh clearly High memory latency is a significant limiter of performance Energy Efficiency and this is becoming increasingly so with higher memory contention which we will discuss in a later lecture in multicore and heterogeneous architectures uh this exacer based bandwidth need exacer based quality of service problem basically higher latencies are bad for uh quality of service predictability and also bandwidth uh we will see later on because if you want to tolerate the higher latency with more parallelism more prefetching more waste then you need more bandwith and also it increases procer design complexity because you have a lot of mechanisms to tolerate memory latency this is a slide That I Used first from my PhD proposal and also in my PhD defense uh essentially these are conventional processor Centric latency tolerance techniques and you know a lot of these and we're going to cover some more of these later on but caching for example it's widely used unfortunately it doesn't work all the time although this is what people do right they keep adding caches and cashes and cashes and cashes and caches locality is strong but there's a lot of inefficiency in those caches there are a lot of studies that show that more than 80% 90% of the data that's stored in cash is useless basically you can replace it without having any impact on performance but you brought it over there so there's a lot of waste in caching prefetching these are all ideas that were developed in 1960s as you can see they have been refined many over many years but prefetching is also a nice idea and I think we should work more and more on pre-etching because it's an area that's not been worked as much as caching has been worked on uh but unfortunately we don't have the best prefetching mechanisms and these are procer Centric again right it it reduces the latency from the perspective of the processor taches also reduce the latency from the perspective of the processor multi-threading tolerates latency if you actually have one long latency access uh happening in one threat why not have other long latency accesses paralyzed in other threats right this actually tolerates the accesses latencies and this is actually a very well-known topic uh it was initially employed in CDC 6600 which is a nice processor if you don't know about it there's a beautiful book by Jim thoron about this one uh but essentially uh this is also wasteful in a sense right it requires more bandwidth this is one of the reasons why we need more bandwidth uh there's because there's heavy levels of multi-threading employed by some architectures like gpus for example also like machine learning accelerators not just gpus and also there's another issue over here if you want to improve single thread performance with multi- threading that's a tough call a lot of people have tried that and it's not very easy and then there's autoo execution uh which is essentially uh executing instructions out of order such that if one instruction cannot proceed because it has a long latency memory access hopefully some other instructions can be executed and now you tolerate the latency of that long latency memory access with those other instructions executions but unfortunately this also requires extens of Hardware resources especially if want to tolerate very long latencies so basically none of these really fundamentally reduced latency this doesn't mean that they're not important they're employed actually in our existing systems and they're actually refined a lot including caching uh but they don't really reduce latency yes so I research is there is there any progress anymore yeah yeah certainly there's progress and we're going to discuss some of that in later lectures we hope to have a lecture on prefetching specifically and we'll probably cover caching but I mean there's certainly progress in all of these areas probably prefetching is has the most Potential from my perspective and there's a lot of work like we're doing a lot of machine learning based prefetching also uh to do things better there's also machine learning based caching mechanisms but we're going to get to that later on in the prentic part of the course maybe okay but I'm going to give you one idea as I said I I want to give you a little bit uh of uh what I think about it and what I have worked on this is my PhD thesis it's uh run execution and I really like this idea and I'd recommend people to work on their PHD thesis that on ideas that they are going to like after 20 years or so it' be nice uh but basically ideally we want perfect caches basically everything hits in the caches and you keep Computing you never stall for any access right this is what we really want prefetching Tri to get there but uh it's not easy to do that so if you have a out of order execution with a small instruction window you basically get a cash Miss because you just get cash misses and then computation continues until your instruction buffering becomes full after that point the processor stalls and then after the Miss gets serviced and the processor can continue and until it gets to the second long latency Miss so this is very common this sort of timeline is very common in many applications today that's why in earlier lectures when we looked at data from uh Google for example most of the time the processor is waiting because this is real this happens now the idea that we have developed we called it run ahead execution uh is a simple idea when you get to the load Miss after some point this load Miss becomes the oldest instruction in the window but you cannot finish it because you didn't finish the load you didn't execute it because you're waiting for data from memy instead check point the architectural State meaning take a checkpoint so that you can go back to that point and start speculatively executing instructions uh and hopefully that speculative execution will lead to some cash misses that you could not have discovered by stalling clearly you're not executing anything while you're stalling it but while you're in this run ahead mode you're in a speculative processing mode where you're not waiting for cash misses but you keep executing instructions and if what happens to the value of the cash miss that you did not wait for you basically Market someone way right you mark it in val or or you put a value you predict the value this is purely speculative execution and you're really executing on the program path and this leads to another cach miss that you can generate and you can start uh that access of that cach Miss and this could be paralyzed with the first Miss after some point the first Miss returns back from memory you flush the pipeline restore the checkpoint so that you go back and reexecute load one reexecute from load one and load one hits and when you actually get to load two in the real execution instruction stream this load two also hits because you've actually prefetched it in the speculative mode as opposed to waiting for a long latency cach Miss and then you save Cycles sounds good right of course there are a lot of things uh that you need to do to make this really work nice but this basic idea works and it's been implemented on multiple processors we're we're going to cover this sort of pre-etching mechanism later on uh and for example it was on sunra this is from sun when it used to design processors they became Oracle afterwards they bought by or they got bought by Oracle but you can see that if you do rhead that's their name for runahead scout uh essentially it's a scout thread from their perspective since everybody needs to rename something from there uh for for commercial reasons they had to rename it to scout which is not a terrible name actually they could have done much a much worse job uh but basically uh if if you're a processor these are commercial workloads that they were interested in executing and you can you get 40% better performance which is actually very much in line with what we have reported in our papers or uh you could basically add run ahead on top of a cach and you get better performance or you could just do a run ahead and save the amount of caching that you add to the processor basically if you want to have this performance level you can either get it with run Ahad with one megabyte cache or noran head with 8 megabyte cache that sounds good right so these are interesting trade-offs that enables and r is much less costly than 8 megabytes of cash or 7 megabytes of extra cash okay so if you're interested there's the paper and we're going to cover some of this later on when we talk about prefetching uh and there's a lot more work to make it uh improved and better and we have actually uh this is an earlier incarnation of the course you can see that's four years ago but we'll have more uh prefetching lectures and this paper was given a test of time award so we're very happy about that because it influenced industry and it influenced the thinking also so we will discuss that later on but now I'm going to deconstruct it even if you had run execution over here as a latency tolerance technique that also doesn't fundamentally reduce latency right I gave you one mechanism another mechanism over here that really uh tolerates the latency so now we're going to talk about how do we fundamentally reduce latency and I think this is really necessary because you cannot keep tolerating latency and keep increasing the latencies because it's in the end there's no end to this right we should really start Point again if we uh we talked about thinking like a 10-year-old like you give it to a 10y old 10y old would say maybe start with a low latency system and then optimize it right that's what we want to try to do really so basically why where is the inefficiency coming from uh where's the overhead and latency coming from I think there are at least two reasons there could be more but we're going to cover at least two reasons one is modern memory is not designed for low latency especially DM large memory I'm not talking about small memories like SRAM they are somewhat designed for low latency but they're also small and once they become larger they also become longer latency but still because they're expensive costly uh in terms of the how many bits you how many how much area you store you need to store one bit there's still lower latency right so they're two expensive basically we want large capacity and low latency at the same time so essentially modern DM is not designed for low latency the main focus is cost per bit capacity as we have seen and the second source of inefficiency uh so basically the first if you want to solve the first one we need to re rethink the D microarchitecture perhaps the second one is maybe easier to handle uh basically even though you don't design DM for low latency you determine some latency parameter and that latency parameter is worst case based on worst case condition and worst case devices to maximize yield in DM chips maybe that's not the right thing to do basically much of memory latency is actually unnecessary if you take this into account and we're going to show that also with real DM chips so basically the goal is to reduce Ed memory latency at the source of the problem and all of these ideas are applicable to actually uh assess these other large capacity memories as well but we're going to look at it from the perspective of the year so how do you truly reduce a memory latency so these are the two reasons that I already showed you and we're going to look at how to reduce it so we're going to tackle the reason one first how do you design the DRM micro architecture for lower latency so as opposed to maximizing capacity per area we're going to put latency as a consideration also maybe we will have a little bit more area but much lower latency hopefully so it's good to explore this uh uh Spectrum or Continuum of design between latency and capacity rate and then we're going to tackle the one- siiz fit so approach to latency specification today basically we use St latency parameters for all temperatures DM chips parts of a DM chip all different voltage levels and also application data we're not going to talk about the applic well we're actually going to talk about the application data also you can actually do something over there as well uh for example if if an application can tolerate some errors you can reduce the latency uh to match the error rate that the application can tolerate in that part of the data and this is we're going to exploit all of these basically does that sound good hopefully that's interesting and as I said these are fundamental it there's nothing that makes these only applicable to DRM there are works that have applied to these to ssds also emerging memory Technologies like face change memory and STM Ram there are other works that I'm not going to cover uh which are still interesting okay let's look at the Dr microarchitecture but before we go into it let me give a brief overview inside a DRM chip we kind of talked about this but I want to give you uh a ground up view essentially if you look at a DM module each chip looks kind of like this right this is the D photo of a chip and there are many goals uh in the design of a chip and usually cost and density are the top goals uh and latency comes at the bottom let's say even with low low latency DM so people have actually produced low latency which we're going to talk about but they're not a major part of the market as you can see today they're very expensive today you can buy perhaps some of those uh but not or easy some of these low latency drms are used in routers for example very F that they need to do very fast routing okay so this is the DRM chip you can see a bank over here and this is kind of a cartoonish overlay clearly a bank has many more Subways than two but we're going to look at some of these Subways uh and you know how a sense amplifier operates by now it's a cross coupled inverter and there's an enable signal over here once you enable the samp fire uh This Is The Logical one state the bit line the top is one WD the bottom is zero the opposite and this is a logical zero State there are two stable States basically and what this what what the sampire does as we have discussed in the past is uh when you enable it it checks whether the voltage level above is larger than the voltage level below and when you enable it if the if that's the case then the voltage level at the top is driven to vdd and the voltage level at the bottom is driven to zero that's the sensing process essentially we can go into more lower level but there's no need right now okay and this is the capacitor you've seen this is the empty State and this is the fully charge State uh these are the ideal states of course one is logical zero one is logical one as you can see and capacitor itself is small so it cannot drive circuits and reading the capacitor destroys its state that's that's why you need to couple it with a sense amplifier uh to be able to read it so this is what we want actually when you see this capacitor and you couple it with a s ampli you want to get zero over here when you see this charge level and you couple it with a s ampli you want to get vdd over here so this is how things are connected as you know already essentially you have an access transistor and initially the bit line and bit line uh the other side of the bit line are at a reference voltage this is the pre-charge state one over 1 Al and when you connect the capacitor to the bit line through the through the access transistor the capacitor charge perturbs the voltage over here and that perturbation that small perturbation gets sensed when the sense ampli Fire gets enabled and the sense ampli fire drives the top to vdd top part of the bit line to vdd and the other side to zero makes sense right we've kind of seen all of this but this just to make sure everyone's on the same page and then uh in turn uh this charge level over here that is driven to vdd gets restored into the capacitor that's the restoration process called charge restoration and that's how you ensure the distractive read is overcome right whenever you read the capacitor you destroy the value but then after sensing you restore the charge so that the value destruction is gone okay so now you can actually build larger blocks a DM Subway consists of many of these essentially this is one single bit line but you add essentially a lot of cells to the same sense amplifier only one of them can be enabled maybe but multiple of them can be enabled as we discussed in the memory Centric Computing lectures and as uh is smile will tell you maybe in a future lecture he's trying to enable even more okay but in existing operation essentially this is what a subr looks like this is a 4 by something 6x4 Subway right 24 bit uh yeah and then these are the sense amplifiers uh and there's an enable signal for the sense amplifier then these are the word lines essentially so there's a row decoder to decide which word line should be activated when you get a row address okay and this could be of the size 8 kilobits for example it could be this depended on the size of the robot for basically and the design and these are two different Cellar that belong to the same subay and then you can have different Dr decoders or at least local Road recorders for the same uh for for the same subray there's one local Road decorder for each subay and then there are different Road decorders for different subs and imagine you can have 128 256 of these there there of course needs to be another decoder over here that decides which subray uh the uh access should be routed to so it's a hierarchical decorder basically okay and then there's circuitry that takes a small chunk like a column of the data uh that's being read from a subay and sends it out to the DM chip that's called Bank IO uh we're not going to talk a lot about that but you can read papers uh on that topic so the address gets routed and then there's also this is the row row address goes to the row decoders and the column address goes to the column decoders over here and then eventually you get the data okay so the memory channel is actually smaller you aggregate many bits from different chips to get 64 bits for example and then there's internal prefetching that goes on in a DM chip that I'm not going to talk about right now those are a lot of interesting things that's happening inside the DRM chip also okay but this is what the memory chip looks like assuming the memory channel is 8 Bits you get eight bits uh in a with a given access and there's a shared internal bus which we kind of saw in Prior works like rone for example we we take we took advantage of that shared internal bus right okay so how does the operate again this is going to be a review very quickly uh we activate the RO that's part of the latency right so you need to supply the Ro address Drive the word line and the bit lines get connected to the cells and all of them uh get sensed and the data gets latched into the sense amplifiers across couple inverts that's the activation process and then if you want if you're targeting one column you read or write from that column which essentially you need to supply the column address and that that get that column gets transferred through Global bit lines to the bank IO circuitry and then out of the DM chip okay so there's also read write latencies as you can see clearly and then if you want to activate some other row either in the same subay or a different subay or uh in the same bank you need to pre-charge uh the bit lines uh and uh or pre-charge the bank which essentially brings the state back to one and a half vdd on the bit lines right and that takes time also so you can see that these are the three major uh time takers if you will when you do a DRM access you can Target all of these you can Target reducing the latency of every single one of them and we're going to Target all of them right does that sound good yes that's correct yes yes okay so if you want to learn more about DRM there's actually a nice section that we wrote with my prior PhD student re Shad who did a lot of The Works Ro clone Ambit Etc yes well they're Global bit lines that that are not shown over here yes Global Rober uh yeah we called it Global Rob buffer in our earlier papers but I tried to refrain from the terminology because it it's not really a row buffer it's much smaller so basically the issue is uh you have local bit lines you have 8 kilobytes of data let's say right and you have very short short bit lines and you can put many of them together right but uh if you want to take out a small piece of it or if you want to take out all of them out uh it's very hard to drive very thin wires so Global bit lines need to be white so fundamentally you need you can take out only a small portion from the local Rob offer right so those global bit lines are thick and they require a lot more power so there are fewer of them hopefully that makes sense in some of the papers you will see that but yes in earlier papers we use some Global Robo for terminology which a bit confusing should really be Global IO do you agree okay because you're looking like that okay okay let's discuss okay a better one maybe okay it's good to critically think about this also right the goal is not to just think this is the this is the case but the question is can you do better right it's always good to do that not design sector d so it's going to come up with a much better DM soon even before sector DM is presented okay okay so now I've given you the basics let's take a look at how the DM microarchitecture designed and I'm going to give you some ideas I'm going to give you one of the earliest ideas that we have developed here latency DM but basically the question is what causes the long latency clearly there are lot a lot of stuff that's happening in DM uh essentially you have the celer I've shown you there it in subs and then there's a IO that connects with the outside uh world so there's a Subway access latency and then there's the io access latency over here and DM latency is really a combination of both of them so we argue in this work that Subway access latency is dominant because you can tolerate a lot of the io access latency and there are methods that where you can reduce the io access latency prefetching Etc so let's take a look at the subr first so the why is a subr slow so this is another picture of the subr over here we have the roow decoder have the sense amplifiers and we have a Subway over here uh this is the cell normally what the manufacturers do is uh normally the sense needs to be large right because it has a lot of circuitry to sense things it's at least a cross-coupled inverter but there's also other circuitry to do better sensing better restoration better pre-charge Etc and this needs to be large whereas the cell itself is can be very small right uh so what D manufacturers do for cost purposes or capacity purposes they don't want to put a lot of sense amplifiers meaning that fundamentally they want to have a longer bit line let's say 512 cells or 1,24 cells per bit line right this amorti the sense amplifier cost across many of these cells and at leasts a smaller or relatively smaller area I mean you could also put 64,000 cells right on the same bit line the problem is now that interconnect becomes overly loaded and it's reliability becomes not so good so there's a sweet spot in terms of how many uh how many cells you put on the bit line so usually that sweet spot is 512 1,24 around that basically it could be it could vary basically but of course this is not good for latency right if you really want low latency uh you want to reduce the bit line right basically if the bit line is long it has a lot of capacitance and this leads to high latency and power but if you want to get rid of that you can actually have one cense supp fire per cell meaning that in a single bit line you have only one cell right but that sounds terrible for cost now right so basically now you can see the trade-off between latency and uh capacity and this is a very fundamental trade-off it manifests itself in essentially everywhere there's always a trade-off between latency and capacity like if you want to fill this room with many people it'll take longer that's latency and capacity trade-off right yes what's actually the main cost of having a longer prge basically there's capacitance and that capacitance affects the pre-charge time it also affects the reading time right whenever you're actually sharing charge and sensing it it affects the sensing CH time it also it affects everything sensing restoration as well as uh pre-charge large Capac is not good for latency basically it's the RC delay that you get it's also not good for power as you can see because there's uh the cap essentially power is really CV square right that's Dynamic power and if your capacitor is larger then you need to actually uh well Dynamic power is activity Factor times C capacitance times voltage Square uh and if your capacit is larger clearly you have power but if your bit line is also longer you may actually need to operate at a higher voltage so your actually power actually increases in a like cubic manner if you have a longer bit line and if you don't keep things under control so you need to increase both capacitors as well as the voltage potentially does that make sense okay okay so this is the trade-off basically there's a trade-off in terms of area versus latency if you want to optimize for area you want long bit lines like this but if you want to make it faster you chop the bit lines into smaller pieces and have smaller Subways let's say shorter bit lines but many more sense amplifiers okay clearly this is more costly so this is the trade-off this is a nice trade-off these are beautiful pictures from donok who did his PhD thesis on this topic okay basically this is the trade-off we have let's take a look at this trade-off if you do some simulations uh or some estimations may not necessarily simulations uh this is the estimate that we have essentially uh ideally you would like cheaper and faster ideally would like to be at this operating point over here right uh and this is normalized to the commodity DM which we assume makes the tradeoff at 512 cells per bit line optimized for area now if you actually reduce the latency by having smaller subarrays you can actually get reduced latency right clearly you can reduce the latency a lot maybe even further depending on what you do uh we call this fancy DM it's called short midline there's actually fcdm that sounds like fancy DRM it's called Fast cycling DRM but there one uh low latency DRM there's also RL DM reduced latency DM uh I think that was from Micron but essentially you could buy these things except they're extremely expensive because they're for two reasons one is the area cost is much higher uh and maybe we underestimated the area I should also say that these are also not easy to estimate all the time but you get you get the idea right as you as you go from here to there you your your cost is increasing a lot uh there's also another reason why uh these things are not cheap today because they're not commodity meaning uh these are manufactured especially at a low volume uh and as a result you get more expensive DM so you may actually go and buy this DM but you have to pay a lot for it that's why people who really really want low latency today actually buy these like very fast routing Network routers okay so this is our goal basically can we actually break this tradeoff can we uh can we have some uh low latency part in a and this is the idea basically if you want to get the best of both worlds or at least approximate The Best of Both Worlds uh sometimes you need to think about heterogene because you really want two different things you want to optimize for two different metrics and heterogeneous designs are actually usually good for this purpose right so basically uh long bit line uh leads to small area but High latency short bit line leads to low latency but large area we would like small area ideally but we also want low latency but there's no magic unfortunately so what you want to do is maybe create that low latency area so basically enable an area inside the DM subre that gets you low latency so that's the idea over here yes it's not perfect but maybe there's no perfect thing and maybe the perfect thing is going to be your idea next next year maybe there'll be a much better version of this but this is the idea and I think it's a clever idea essentially you get high uh capacity it's the same capacity as long bit line but maybe a bit cost you get the low latency of this when you're accessing only this part but then you need to isolate the two parts the high capacity part from the low latency part and to isolate them we add isolation transistors when you're accessing the lower part you disable the isolation transistor so that the bit line Looks small to the sense amplifier so the capacitance is low when you're accessing the top part you connect the isolation transistors and then the bit line is still long okay so if you want to take advantage of this you'd better put your as much as possible to this area right okay so that's the idea over here I think I already said everything that I'm going to say but small area using long bit Line This is a low latency when you turn off the isolation transistors uh because the loading on the bit line is not is low essentially at that point right if you turn off when you turn on the isolation transistors unfortunately latency is a bit higher because now you have to go through the capacitance and the resistance of the isolation transistors also so basically this is what we what it looks like this is at the time commodity DRM this is RO cycling latency if you uh near uh this near segment reduces latency by 56% this is a wrong bar it should really be somewhere here I still haven't corrected that there's a version of the stock where I corrected it but I cannot find that [Laughter] talk and uh but basically you reduce the latency of near by 56% assuming you have 32 rows in the low latency you'll see low latency segment or near segment but far segment increases by 20% this is a power so power also reduces significantly in the near segment but it increases when you're accessing the far segment an area already is significant so it's not small but it's not huge it's not as large as adding more Rob buffers isolation transistors still need to be reliable so we estimate about a 3% area overhead in the year so this is how you can break the tradeoff essentially our goal is this uh and if you can come up with an architecture that actually has all of the cells here that's great without having any magic magic is not allowed it's very tough to break the straight off that's what I'm trying to get at but basically near segment brings us out of that Paro curve as you can see but far segment puts us a little bit farther as you can see assuming still 512 cells per bit line okay so now how do you take advantage of it so you envision this as a substrate that can be leveraged by both hardware and software or either of them and there are many potential use cases that we have Showcase in the paper for example you could manage the near segment as a hardware managed memory controller managed inclusive cache to far segment basically far segment is really your main memory and uh the the near segment is an inclusive cache uh basically it caches part of your uh main memory right uh or this could be exclusive cache basically both of them can be part of your main memory but then you need to swap things uh between them because this is what exclusive means means data is not replicated in the main memory and then in the cache right or you could be you could do some profile based page mapping by the operating system everything is your main memory still and the operating system decides what goes where this also interesting now the operating system needs to be aware of Banks and also near segment and the far segment I think these are very interesting ideas we were talking about yesterday for example not allocating uh Pages uh or or not refreshing pages that are not allocated right I think we really need to rethink that operating system architecture interface going forward and this sort of architectures also enable some rethinking that's one of the interfaces that are actually one of the most important interfaces that are very rigid today operating system and architecture even though they could be co-designed really nicely I believe we're doing a very terrible job in our systems and we're going to talk about virtual memory later on for example virtual memory is one example of that what's happening am I not doing well no we have a question oh okay you have a question let me finish the slide and you can interrupt uh interrupts are another thing that we need to deal with but yeah fundamentally a very single core concept uh very weird well okay I don't want to get started but you could also simply replace dm with TDM and test your luck meaning uh you have the today it looks its properties look like essentially this replace it with a memory that it properties look like this and Hope that you get the benefits of the near segments this is hard to make work because some data will be uh in the near segments some data will be in the far segment and if your really latency critical data or frequently accessed data is in the far segment you're in trouble right so just magically just simply replacing dm with tldm didn't work in our experiments yes fraction we're going to talk about that yes very good question what fraction goes to near segment what fraction goes to basically it's a trade-off and we will see that okay are you going to interrupt me or no so the question is can we put a sense amplifier on both ends of I guess the subway and optimize the data path based on the location of the data I think we're going to see something like that later on hold hold on to that maybe you'll ask it later but yes okay it's an interesting question okay okay so we're I'm going to show you some results with inclusive cache meaning this is what we're going to look at essentially you have far segment you have near segment and clearly you have the Rob buffer or sense amplifiers far segment is going to be our main memory and near segment is cache so we're going to lose capacity in main memory that's not necessarily the best way of doing it but we're going to see some performance improvements also so basically the question is how do you efficiently migrate a row between segments is one question but you've already seen something like this row clone remember so this paper kind of takes advantage of R clone like mechanism and the second is is how do you efficiently manage the cache meaning what do you cash in the near segment and that's is an area that can actually where you can actually develop a lot of mechanisms people have developed a lot of caching mechanisms which we're probably not going to cover in this course we've covered in digital design and computer architecture some of them uh but you could apply previously developed caching mechanisms uh but also you could be a little bit more intelligent and recognize that this caching happens at a very large ground larity like 8 kilobytes or so and maybe design better caching mechanisms in this case we didn't design better once we tried but we actually look at very simple lru base mechanism and we'll see that it'll actually provide good performance so basically how do you do the inter segment migration it's Dr clone I'm not going to go through this uh again we've kind of talked about that you basically copy the data from the activate Source R which brings data into the sense amplifiers and then activate the destination en R in quick succession that brings the data into the near segment right just source and destination needs to be in Far segment and near segment if you want to do that okay and this is low latency how do you manage the cach basically we do lru but the paper talks about some different mechanisms basically least recently used uh the eviction uh the eviction mechanism from the near segment is evict the least recently used Ro in the near segment okay I believe there are much better mechanisms but even this one gives performance as you can see based on the number of workloads I believe there are 80 workloads or so we get significant performance improvements about 10% uh there could be more but you run into bandwidth constraints as you actually uh well uh I guess you tolerate things better if you actually increase the channels Etc uh okay and then uh this also reduces the power consumption and that's also significant as you can see in fact power consumption is even more significant perhaps so I think this sort of idea is very interesting because it's again another idea that reduces power and reduces latency at the same time improves performance and as I said earlier we should be really shooting for those ideas if this was a trade-off between performance and power this would not be as interesting I think okay this is your the answer to your question this is some data uh basically that shows that uh what happens assuming you use the same management policy lru assuming the same workloads uh what happens to uh the performance Improvement that you get compared to the Baseline assuming you have with with different uh number of rows in the nearest segment uh the Assumption here is that the far segment uh the total near segment plus far segment already has all 512 right and clearly as you go from 1 to 256 you get larger cash capacity but higher cash access latency also right and also higher power so the benefits actually decrease after some point and we we have The Sweet Spot at at 32 but even one is actually not bad as you can see over here this is actually interesting because one gives you some ability to to tolerate Rob offer conflicts so if you have a bank conflict uh you don't don't actually incure very high latency of accessing the far segment all the time you just actually access the near segment so it's good to think about this okay okay but you cannot dynamically do this as far as I know if you can come up with a dynamic architecture that could be interesting too without cost without a lot of cost Nissa is thinking about that now okay so this is the paper uh I still like this paper a lot and then we try to make it a little bit better as you will see later on any questions yes yeah we're gonna talk about that I think there are later papers that talk about this we're going to basically yes it doesn't work very well with the open bit line uh and and the Lisa paper covers that but I'm not going to cover that in detail I'm going to cover something else that I think answers your question in a little bit different way okay very good question I don't want to go into the details of the as much right now but yes okay yeah exactly so basically uh you have sensome fires on both sides essentially and you do the sensing in different ways we'll get to that in when we talk about clear DM so I'm I'm saving that to that part but you don't need to worry about that when you read the papers you will see that better so there's a lot of intricacy in the DRM has been optimized for many years so some of the ideas that may work for some architectures may not work for some other ones or you need to engineer things better to actually make them work so this idea off the bat may not work but it's some other idea may work okay we talked about Lisa so I'm not going to go through this in detail uh but I'm going to show you some of the benefits you get basically if you remember Lisa was targeting intra sub uh data moment right uh sorry inter sub data moment within uh a single bank because there's no not enough connectivity between the two sub arays uh if you want to copy data from one sub to another subr you have to go through the internal datab buas I don't know what happen here is this my fault or is this the computer's fault I thought I fixed this one okay maybe it's a computer's for fault but basically what Lisa does is it provides a new substrate to enable wi connectivity between subs and you you saw this before Oh okay there are two two slides now [Music] maybe it is my fault while trying to uh get rid of the bad slide I made a copy of the bad slide redundancy okay now the bad slide is gone okay now we have a lot of bad slides okay uh okay basically uh the basic idea is uh maybe more slides AR again I don't know but you know the basic idea of Lisa essentially the idea is to have isolation transistors between two sub uh and you connect the Rob buffer or sensum fire sensing and pre-charge circuitry as you can see over here of this subray to the bit line of the other subray as you can see over here and you connect you you basically control that connection with isolation transistors when the isolation transistor is turned on the bit lines are essentially connected between the two subs right we've seen that before and our goal was to really enable movement from one subray to another subray and we have seen the benefits of this basically we have this new command called Rob buffer movements and the idea is to move a row of data in an activated Rob buffer to the next subray over here so if assuming that the data is all vdd over here the bit lines you activate the isolation transistors or enable the isolation transistors using this Rob for M command which means that the chart sharing happens between the top bit line and the lower part of the bit line now it's a extended bit line and based on that chart sharing you the entire length of the connected bit lines become vdd because this also amplifies the charge as you can see that's how you can transfer the value okay okay so this actually has more benefit and I think we had a question in the previous lecture when I discussed this how many Subways do you connect using this uh isolation transistor and we actually have a lot of analysis in this paper regarding this and I think we decided we connect at most two using a robo for movement command and if you want to do more than two then you do another robot for movement command from so basically you can move the data from subray one to subay three with a single command and if you want to move it from subay three to subray 125 you need to do many robot for movement commands essentially so if you actually do two command if you actually do single command you can move from one to three as you can see over here and the paper has analysis of why we decided that way but basically this G gives you a lot of bandwidth internal bandwidth so outside bandwidth uh coming out of ddr4 is is relatively low this been increasing but now you can get more than 25x bandwidth internal bandwidth for this dat mod that's the beauty of adding interconnects inside DM as you can see right this is something that I think we should be doing better because one of the things one of the problems with uh both move data inside DRM and also doing computation inside DRM is the low connectivity right if you have low connectivity it's very hard to do uh data moment which is really needed for both data copy as well as communication across different units that maybe do computation inside D and this I think this needs to in a sense DM needs to be rethought for this purpose right it was the the original purpose of DM was to read data right if your goal is to read data maybe you don't need these interconnects right but if your goal is to copy data and all all move data where computation is done in one subray and you want to move the data to some other subray you need better interconnects okay there's of course theem chip area over it uh and you can read the paper so this enables multiple things that reduce latency clearly we've seen copy you can reduce the copying overhead which is essentially the same thing as R clone but now roll clone happens across subay uh which is good and we basically see significant reductions in copy latency and row copy energy that's good but there's also another benefit to this which is kind of similar to tiered latency DM without some of the disadvantages of tiered latency DM uh without even going into open bit line architecture you will see that essentially uh we have long bit lines we have short bit lines uh clearly we've seen this trade-off uh earlier short bit lines are good for latency but then it comes with a high area overhead so the key idea of using the Leisa substrate for caching is you can add fast sub as a cach in each bank so as opposed to partitioning a single subr like tiered latency DM did tiered latency DM partitioned a single subr uh to a fast portion and slow portion right now we're going to do something different we're going to have have a slow subr we're not going to change that that's good for capacity but we're going to add a fast Subway next to it let's say and the goal of this fast sub would be caching that sounds good now right and you can still employ different kind of caching policies exclusive inclusive operating system mapped and you just or you just use this and magically and in the good news here is if you just use this and not change anything including caching Etc you will still get benefit because some data magically goes to the fast break does that sound good so that's the now we're overcoming a disadvantage of tier latency DM as you can see or multiple disadvantages actually so but that's the idea over here uh and the reason you can do this is because you have fast you have isolation transistors between the slow Subway and fast sub so you can actually take the data from the slow subr and move it to the fast subr right but even if you don't have that if you have a heterogeneous design that looks like this you can still gain performance but that's the idea of this other paper this son plus Isa paper so that's not really the contribution of Lisa the contribution of Lisa is really ability to move data quickly between uh the slow subate and the fast subr yes Expos the fact that data MH Expos to the fact that there is fast Subs was the in of people expose that to the programmer the programmer because the programmer Knows X address and X address was in the SL sub it was then moved to the F sub yeah I mean normally the programmer doesn't get exposed to anything like this right but uh if the programmer somehow knows which data is latency critical or which data is frequently accessed which data could benefit from caching they could use hints assuming it it's available they could use pragmas in programming languages to say okay I expect this to be latency critical assuming those are available to the programmer and they get communicate to the hardware architecture right but that's not what we discuss in the paper what we do in the paper is really caching transparently to the programmer but I think uh if the language provides mechanisms to the programmer and the programmer really understands what they're doing and they can correctly identify what is latency critical assuming that information communicate all the way from the language to the hardware then I think it could actually this the substate can take advantage of it certainly yes the address examp access and was the does it have a separate address space area so that's a design Choice basically right that's what I meant by uh if if this could be exclusive meaning the addresses are not replicated or this could be inclusive meaning the addresses are replicated and fast sub is just a cache but fast sub can be part of main memory also this very similar design Choice as tldm except this overcomes a lot of disadvantage of tldm yes yeah yeah I mean that's a very good thing to think about basically I mean there are multiple answers this one is uh if you're using this as main memory it's not just another cache right so if basically part of your main memory is much faster right now right right not necessarily imagine you didn't do anything and 30% of your main memory is fast sub race 30% of your data access faster think about it that way right now all of your main memory is slow we're making 30% or whatever fraction that you want to add and that you can pay for let's say uh it's magically faster right it's not magically of course but you see what you see the difference right if this is used as main memory it has nothing to do with caching your main memory is fundamentally it's it's latency is reduced now if you want to take advantage of that yes you're going to exploit locality but then you you can also do you you also need to do that when you're actually doing page replacement for example right so think about all of this all of this locality is currently done in the bigger part of the memory hierarchy if your memory is not enough for example you also need to decide which pages to replace to the SSD so you already doing some page replacement and locality Management in your main memory we just want to do it a little bit better and then there's a second answer so this is one answer uh which I think answers your question but the second answer is very interesting I think because this is a different kind of locality exploitation than the adding caches to the processor side so if you look at the processor side you don't have cashes uh that have such large uh blocks the blocks are 64 bytes 128 bytes so locality is at that very small level uh so this is a different kind of locality which is really the that large block locality which many people have not really do not really deal with when they add caches so it's good to think about so that's why I think the mechanism to exploit the locality over here need to be a little bit more tailored to these large blocks that's why IC also benefits uh from this okay okay so I think I already given you the idea but these are very good questions keep them coming uh essentially now cash RS uh basically what we did over here is uh to by adding isolation transistors we can quickly move data from the slow subay to the fast subr okay and we found out that this reduces hot data access latency by 2.2x which is significant uh compared to uh not reducing it let's say okay if you're interested in this I would recommend leading not only the tldm paper which is a precursor to this but also this other paper because this other paper is interesting it was Al it was at ISA 2013 it basically proposed this heterogeneous Subway architecture but it didn't have these connections meaning if you want to really exploit that uh you had to do the data moment through software in some way so it was actually very expensive to do that data moment or you got to be lucky uh a little bit but this data moment actually this fast data moment capability between Subways actually enables a lot of things one more thing it enables is precharge latency reduction so if you look at uh uh this conventional DM normally what how you do pre-charging is you really uh pre-charge the different bit lines from one side and the pre-charge time is really limited by the strength of one pre-charge unit now with Lisa we can do link pre-charge meaning because the pre-charge units of the top Subway can be connected to the pre-charge units to the to the bottom subray when you enable the isolation transistors you can basically use two pre-charge units to charge a single bit line we call this link pre-charging and we actually see that you can get significant reduction precharge latency about 2.6x this actually very interesting also now we increase connectivity but you actually get additional benefits and this works with open bitline architecture also without going into the details at all okay so there's more on Lisa uh and again I think uh apart from the specific mechanism that this paper propos this I think uh the maybe an even larger benefit is really thinking of the DM architecture like we're not thinking of the DM architecture as something uh that is designed for reading but it's really about connectivity and I think we we really need to do more work in this area to enable better connectivity okay let's see how much time we have maybe we should take a break so let's take a break until 1528 and then we're going to continue our exploration of different ideas e e e e e e e e e e e e e e e e e e e e e e e e or maybe they'll die after you this h this de is it time do we say 15 28 or 29 I said 28 okay this is important for predictable performance but it's 29 it's says 29 now so we can get started so okay uh I'm going to cover some more ideas uh that are interesting I think and I think this field is really important for developing new ideas new creative innovative ideas let's take a look at some of them so we saw how Lisa overcome some of the disadvantages of tldm uh but this work actually makes tldm a little bit more practical as you will see and then we'll see another work that's actually also interesting so this work uh propos a very simple idea that could be used for multiple purposes to reduce access latency refresh overhead and also uh row Hammer vulnerability bit so we already looked at conventional DRM right the idea of copy row DM is basically partition the subray into two types of rows regular rows and copy rows but not have isolation transitor between them so we get rid of the isolation trans cost uh and this picture has it's a mind of its own I should have turned off the animations but basically you have two parts regular rows with a regular row decoder and copy rows with a copy Crow decoder essentially the idea is copy rows can store the copy of any Row in the regular row but you have two separate decoders without isolation transistors so how do you get reduced latency we'll see that so if you identify that some uh so you can do a row copy operation basically if you identify some row is frequently accessed you can basically copy it to I don't know what's going on but uh we should really disable the sort of automatic animations let me see I really dislike my presentation having a mind of its own anybody want to help me here slideshow there oh it's not doing that so how do you set up an automatic animation okay I see yeah okay I'm not going to do that right now then I never do that in my presentations this is how San's fault it's his slides and I had this problem before yeah Al I'm talking right now yeah as you can see well he's not talking he's acting yeah but it but the basic idea is basically you can copy a regular row to one of the copy rows okay and you know how to do this to R clone right and we use essentially R clone but now how do you get uh lower latency for that row basically you use multiple re activation you activate both copy of the row and then that leads to lower latency because now you actually share more charge that essentially enables faster sensing that's the idea over here it's a clever idea you could actually have more than two copies as you can see assuming you copy the same row multiple times does that make sense you're really driving the same bit line uh with with with two copies of the same value you can think of it that way okay this gets rid of the isolation transistors but uh clearly this doesn't have the power benefits of uh tldd right it actually has more power because you have multiple activation okay so there multiple use cases of this one is caching obviously you could use those copy rows of caching like I showed you basically cach frequently accessed rows for example or I've already said this I'm not going to go repeat this basic it's very similar to tldm uh you could also do uh ref reduce the refresh of it so for example you profiled your DM and found a weak cell in one of these rows you could map it to a strong copy row and always access the strong row right this way you can actually reduce the refresh overhead in a different way way you remap a weak row to a strong row strong row means you can retain the data for longer and then you can reduce the refresh rate assuming you can remap all of your weak cells maybe you can reduce the refresh rate completely in all of your DM R make sense and then there's another idea okay I think I already said this so you basically access Only the Strong R of course this requires modification for remapping oh slightly a remapping table for the copy Ros and there's a mechanism that we discuss briefly to protect against roow hammer and the idea over there is if you detect a a row being attacked uh in the regular row space you basically copy that attack that row that keeps being activated uh to the copy row space that way you limit the damage you someone can do to the regular rows which is also interesting as you can see right I think this is maybe a heavy-handed solution to a row Hammer because you're moving data but maybe in some cases has some upsides does that make sense okay okay so without going into more detail this also provides benefits if you actually combine the caching use case with the refresh use case which the paper analyzes both individually and together you can get significant speed up and DM energy improvements and DM energy improvements are coming from uh both ex lowered execution time and getting rid of refreshes uh it comes at a cost in Hardware overhead the M chip area because now we have two separate decoders for example you have the multiple activation circuitry uh uh and also uh DM capacity loss because you're using this for copy rows exclusively uh for different purposes uh as opposed to as part of main memory and Al you need to modify the memory controller so all of these ideas require modifications to both memory controller and DM as you can see right at least if you're exploiting something like this at the operating system level then you need to actually change the operating system Etc as well as well as we have discussed so this is another idea as you can see so this is lower cost but maybe it's not uh uh it's also more versatile potentially any questions okay I'm going to discuss one other idea uh this was developed by haong uh while he was an intern with us now he's doing his PhD uh he's also worked on R press as you guys probably know and rater too we see around we see on Zoom okay but basically the idea here is very interesting I think basically we want to have some configurability in DRM in terms of capacity latency trade-offs so no DRM really enables this in a configurable way and we wanted to have it in a nice configurable way uh essentially workloads may have varying capacity and latency demands and the previous architectures that we have seen and the commodity DM makes static capacity Laten trade-offs at design time right uh for example it's true for also uh tiered lat DM right tiered lat DM has a static part that's near segment uh can we change that dynamically is a good good question basically uh essentially our goal was to design a lowcost architecture that can be dynamically configured to have either high capacity or lat or low latency at a fine Grand in Grand L of a ro and I think we should be thinking more and more in this sort of configurable uh memory Direction because this has actually imp applications for also processing inside memory as well in my opinion but basically you should be able to the the goal was to switch a DRM Ro from being a high capacity Ro to maybe slightly lower capacity but low latency access Lo and to be able to do that we're going to use the idea of replication basically we're going to replicate the same cell twice for example and then sense both of the cells so I'll give you the idea so uh essentially I already said this a row can dynamically switch between high capacity mode to high performance mode and the idea is to dynamically configure the connections between DM cells and sense amplifiers in the open bit line architecture and this is where I will introduce the open bit line architecture pictorially it's going to be a little bit different from what we have seen so far which was the closed bit line architecture so this is what a lot of modern DM looks like because cells have become so smaller and sense ampliers have become too large as a result what happens in a single Subway is uh one bit line is sensed by a sense amplifier on the top the next bit line consecutive bit line sensed by the sense amplifier on the bottom and the reason it's called open bit line is you can see that this part of the bit line is open it's not connected to anything and this part of the bit line is also open make sense so essentially each bit line is connected to only one sense amplifier in this case which is a good trade-off in terms of packing a lot of cells on sense amplifiers together and still that you need to do a lot to make it work okay so the idea and clear DM is to make these connections between the bit lines and the sense amplies configurable and configurable is essentially enabled by isolation transistors in this called we call bitline mode select transistors and there are two types of these if you enable only type one for example it looks exactly like the open bid line architecture right now if you enable type two it's a different open bit line architecture if you enable both type one and type two that's what we're going to take advantage of okay that's the idea so clear this comes at an additional area cost because you don't have just wires over here but you have isolation transistors but that's what gives you dynamic configurability or even if you ignore the dynamic it gives you configurability right okay so how do you get to the Max Capacity node Max Max Capacity modes essentially nothing different from existing open bit line architecture we enable only the type one uh isolation transistors and this mimics exactly uh what we do today in open midline architecture right okay that's not that interesting maybe but you you get the same stor capacity clearly you lose a little bit density because you have these isolation transistors that are extra and then there's more modifications that you need to do but that's in the paper how do you get the high performance mode is really the Innovation and the interesting part I think basically you can see that uh we enable both type one and type two but uh in order to take advantage of it we actually store uh the value uh in one of the bit lines or one of the cells and the complement of the value in another one now what this enables is if you act these are called coupled cells if you activate uh these cells now you can actually do better sensing better differential sensing through a single sense amplifier or maybe both amplifiers so you could actually use both sers potentially but basically uh you you can you can operate these cells or do the sensing much faster uh because you drive both of the bit lines and enable S sire with differential values and as a result sense amplifier does the sensing much faster that's the idea and there's a lot of circuit double evaluation in the paper that you can see uh so the sense amplifi are also coupled uh and this actually increases uh the sensing uh capability it's similar to like pre-charge coupling that we have seen right in Lisa we coupled the pre-charge circuitries here we're actually coupling the sensing circuitries and taking advantage of the two different values stored next to each other of course the difficulty is how do you get the two different values stored so you need to populate these things but maybe there needs to be more future work in that area so this enables reduced latency as well as refresh overhead we have this coupled cell plus tound fire operation so okay I'm not going to go into a lot of details to actually test this you need to do a lot of circuits simulations which are in the paper but these are some results uh based on some of those simulations and these are the values that we see that we report in the paper you can reduce essentially all interesting latencies activation restoration pre-charge as well as right recovery latency because having the strong sensing and strong uh sensing circuitry that is that that takes advantage of couple cells and couple s on Fires really enables faster everything in the operation and also there's refresh clearly refresh is affected because it reduces the latency of activation and pre-charge rate and uh the paper also evaluates some system level approach and it it shows significant benefits so I believe this actually an interesting idea in an interesting Direction there should be more work in this area uh and maybe you guys will come up with different ideas in this direction but this is a different way of thinking if you think about it uh it it adds more configurability uh into the bit lines and it's also uh it can employ it can be employed in conjunction with other ideas right Lisa for example Lisa is orthogonal you can have another uh s another uh isolation transistor to connect with the next one any questions yes you have your yes when you're when you're in uh yes basically when you're operating in uh high performance or low latency mode for a given row you have your capacity yes but that's a very fundamental trade-off you cannot get away from that happens for example in a sense is kind of similar to single level cell and multi level cell and Flash that we will see right if you want to be low latency single level cell is unbeatable because you just sense uh you just sense a single voltage reference if you want to have high capacity you pack more bits and have more voltage levels in the same voltage range but sensing becomes more complicated which means that it it's going to take extra time in a sense it's very similar except it's using two different cells as opposed to a single cell to store the same uh value okay okay I think there's more work to be done in this area maybe alar is thinking of much better ideas now or maybe other ones who knows yeah okay if there are no questions I'm going to cover salp a little bit because I think this such a fundamental concept that should be covered it also makes for a good story but uh this is uh one of the earliest ideas we develop when we actually looking at DM it's called sub level paral I already given you the key ID actually when we talk about refresh access paralyzation but s actually is much broader than that it's really access access paralyzation assuming those two accesses go to different Subs okay basically uh we target Bank conflicts Bank conf Whenever two accesses go to the same bank today you see ize them this is bad for performance it's also bad for energy it also thrashes the Rob offer has a lot of busy weight Etc so our goal was to reduce Bank conflicts without adding more Banks essentially to do this at low cost and it's uh basically this uh idea takes advantage of uh the underlying d m microarchitecture a bank is not a mic structure it really consists of Subs as we have seen earlier and each sub has its own local Rob offer local sense amplifiers essentially so this is our old picture that shows logical Bank looks like this but really physically a bank is divided into a bunch of Subs okay and each subay looks like this kind of ugly picture but it has a local robot for as you can see each subray is like that and then there's a global IO circuitry which we call robw Buffer which we should not call Rob buffer going into the future but that's how it is okay so the second key idea after You observe this is to see that Subs are actually mostly independent if you look at how these Subways are organized you have a global Ro decoder that's shared and then after the address is decoded this Subway operates independently mostly there are other Global structures which is the global IO circuitry Global bit lines basically we will see that in a little bit more detail so these are the shared structures everything else is mostly independent the key idea is tamy actually reduced the shared Global structures so that actually you can access Subways mostly independently today what happens is you cannot access uh one sub while you're accessing another subr now that's not exactly true later work by G show that you could actually do that right in a limited way but we wanted to do in this work like 10 years before G showed that it's actually possible or part of it is possible in real DRM chips is to actually do this by modifying the DRM so that we can actually access two Subways mostly in parallel okay so basically to do this you need to really reduce the sharing of the global decoder to enable almost parallel access to subray basically you need to be able to activate two rows in two different Subs at least two different Subs but a global row decorder that's not that's shared between the two sub Subways doesn't allow for that right okay so I need to do some latching perhaps right as we will see global per doesn't enable reading uh from two subs concurrently we're not going to do that that's expensive uh but we're going to try to reduce the sharing on the global IO lines so that while we're reading from one sub we don't disturb the other sub essentially or put another way we want to reduce the connection of one sub to the global bit lines while we're reading from another Subway okay that's these are the two basic ideas so let's take a look at how these work this is you have two different Subways over here uh both of them are local Rob offers and then there's Global IO and then I'll show with the global bit lines later let's take a look at how we reduce the global decoder normally what happens is you decode the address and then you have a latch over there there's a global latch but this Global latch drives all of the subas as you can see so in order to enable in order to activate two subs concurrently or consecutively we just need to have local latches a little bit more expensive but it enables activation of two different latches two different Subways makes sense right so in one site you send the address to this sub in the next cycle you send an activation to another address or another row in another subay that's the idea this way you can activate two uh Subs concurrently more than two also right okay that's the idea so we solved one problem now we don't have this Global uh decoder sharing problem the other problem is the global bit Lin so there are Global bit lines that are running vertically not as many as local bit lines as we have discussed earlier but these are the ones that take the data from the local Rob offers and put it to the global iio circuitry so I'm I'm going to show them in a different way over here normally they this is a logical picture it's not a real picture clearly right you don't have Global bit lines outside uh not touching the array but to demonstrate how these are really connected uh normally what happens is when uh uh whenever you send a read command uh you take the data from the activated local Rob offer and take it through the global bit lines Mak sense right but there's no switch normally between these Global bit lines with and the local Rob offer what you bank on is only one row in one Subway is activated today if that's the case you're guaranteed that only one local Rob offer will drive the global bit lines when you get a re command but now we we we're violating that we have two activated well at least two activated local Rob offers now if we get a read command we need to decide where to read from first of all but assuming uh we read uh from one of them we should be able to supply the data from that particular sub or local Rob offer to the global IO to be able to do that we need additional connectivity meaning switches so that's what we add essentially so we have a designated latch to designate which subay we're going to read from and assuming that we're going to read from this subay this latch uh connects the switch normally the switch is open uh meaning whenever you activate you don't immediately connect uh uh the the the column over here to the global bit lines and none of them are connected okay but when you actually do a read to a particular Subway you set the designated latch and then you connect the data from the local Rob offer to the global bit lines makes sense right the column that you're reading that's the idea and then if you want to read data from another subray you deactivate meaning set the bit of this designated latch for this blue Subway to zero switch is disconnected and you designate this Subway to be connected to Global bit lines makes sense right so there's some switching overhead clearly sub to sub switching overhead so you need another timing parameter to enable this but it's much better than pre-charging the entire Bank activating another row and then reading right so that's what we're saving basically if there's a bank conflict and if the data is mapped to two different Subs now as opposed to uh an imaginary Bank conflict where you keep reading from the blue one and the yellow one consecutively blue yellow blue yellow blue yellow blue yellow blue yellow you would be inuring the pre uh TRC delay uh for uh for all of those accesses you need to uh essentially pre-charge and activate precharge activate pre- charge and activate now we reduce that to only switching delays right what what happens is essentially you activate blue activate yellow they stay activated assuming your access pattern is such that you do blue yellow blue yellow blue yellow what you do is basically you just incure the switching penalty which is still some time but it's much much lower than uh precharge and activate for sure so it's a very low latency switching I like this idea a lot as you can see uh other people like the idea a lot too but of course there it comes at a cost right the cost is you need to have per per uh per subay latches and these designated uh designated uh Subway bits switching is already kind of there actually except uh we don't talk about it okay so basically this is the Baseline Bank organization uh essentially uh in the Baseline Bank uh this is how we change it you distribute the global latch to local latches and we have designated uh bits over here and then also the switching capability such that this designated bit connects uh controls the connections from uh the local columns over here to the global bit lines and there's some overheads maybe we underestimated the overhead a little bit but doesn't matter it's very tough to get these overhead numbers uh but that's how it is okay so results let's take a look at this the paper has a lot of results the paper has actually three Progressive mechanisms one of them is very similar to what gii kind of did but not exactly with a refresh uh and we see significant Dynamic DM energy reduction because you actually reduce the robo first switching open close activate precharge Etc uh and roow roow hit rate also improves because now you can actually service things from different sub rays and you keep them active and we also see significant system performance improvements so these are different types of s some of them have very low overheads but low benefits the um the the let's say most aggressive version is multiple activated subra which is what I just described it gets you almost the same performance as having the same number of banks as Subs but so having the same number of banks is very expensive because you need to actually replicate the decorder circuitry across many of them uh okay so you can see that the having the same number of banks as Subways is actually 36% area overhead whereas even if you think that we estimate underestimated the area overhead over here it's 1% maybe 2% if you're terrible at designing maybe 3% it should not be 3% I think but the manufacturers sometimes exaggerate how much their design effort is I think everybody in Industry exaggerates their design effort in general they don't want to change anything if you can make money without changing anything that's the best mode of operation industry terrible for progress of course right okay I'm joking of course but yeah terrible for also staying in business in the long term I think okay so more on S this is the paper uh but the good news is uh immediately these Samsung and Intel Fox picked up on our paper they actually did their own simulations and they basically said we should use the Sal and one of the reasons they said it is not necessarily because of like this whole package that we come with s but because they they basically said right latencies are increasing we can relax the right latency to make the design much better meaning uh much robust because if you don't relax the right latency you start getting errors and you need to add additional ECC Etc but if you relax the right latency make it longer now you can actually use Subway level parallelism to overcome the increased latencies so you can tolerate The increased latencies and that's what they show that's what they evaluate on their performance simulators and they basically say with s uh they get much better benefit which is nice right you can read the paper for more detail they have a paper and they have a presentation also okay so why did people not implement this yet so these two companies actually pushed for the adoption of salp in ddr4 extended standard at the time Unfortunately they lost the political game even though they're two major companies Sams and Intel right we're talking about two companies that don't normally talk to each other but they wanted the same thing in this particular case and they verified a lot of these things that we said in the paper yes our overheads are a bit underestimated but they they still think this could be doable easily but unfortunately they were not able to convince uh the political voting body which is in the jedx standards uh at least at the time who knows maybe this will come up again but uh essentially they got out warted because people didn't want to change so you could have a much better dm with Subway level plalism today if that vot went the other way so you can blame this is one of the reasons why I actually uh think these standards bodies are actually against innovation in general they're innovative ideas and there are people who realize that ideas are Innovative they independently verify the ideas these are this is all done completely independently of us we didn't touch any of the simulators they actually wrote the paper and discussed with us afterwards and it's also published you can find it so there's clear Innovation over here but uh a standard's body is made up of so many let's say uh companies that don't necessarily the goal is not necessarily Innovation right some of them have goal alterior motives clearly many of them have the motive of making money and some of them have very low margins so they don't want to change the memory controller right so there are memory controllers that don't want to change the memory controller there are me memory companies that don't want to change the and if you're in that situation uh Innovation gets squashed so hopefully this gives you a good story of how Innovation sometimes gets doesn't get adopted right so you may have a great idea but it may not get immediately adopted it may say take some time but I believe that this actually a good idea it will at some point get adopted we'll see how long it'll take yes yes yes basically we add some separate DM commands we try to minimize the changes to the interface but you do need some commands yes but in the in the smaller versions uh like like if you look at over here in the paper we have salp one salp two Messa some of them have even less even fewer commands I'd recommend reading the paper and I think actually the version that these folks evaluated was not the most aggressive version We proposed so even with the uh let's say not so aggressive version they got significant benefits okay I think this brings us to the end of the lecture I'm happy to take questions but given that we have a Time slack of only three minutes I don't think I'm going to cover any ideas but I should tell you maybe maybe I'll tell you this unless there are questions I'll just give you one slide overview of this and then we can part okay basically we're going to tackle the one siiz fitall approach to Laten specification so we in the earlier part we talked about how to change our thinking of the Dr microarchitecture to reduce latency now we're going to talk about not changing the DM micro architecture necessarily but getting rid of the inefficiency We have basically reliable operation latency is actually very heterogeneous there's not fixed latency uh temperatures chips parts of a chip voltage levels they all have different reliable operation latencies and the idea the general idea in the works that we're going to cover is to find out and use the lowest latency one can reliably access a memory location with ideally you would do this dynamically as we will see later but you could potentially do this statically also and we're going to cover a bunch of ideas uh that try to do this uh across chips across temperatures within a chip within different parts of a chip uh within voltage levels uh and also use the idea of reducing latency for some other purposes like through random number generation Etc but to be able to do this we would like to really find the source of latency at or8 and exploit them to minimize latency or create other benefits that's the idea over here like why is latency hetrogeneous again we're we're basically a process manufacturing variation you have different manufacturing and operation conditions so you this leads to lat lat variation in timing parameters this leads to latency variation across chips some chips may have very strong cells that that can be accessed faster reliably some chips uh may not have those strong cells so you need to actually access them a little bit longer so that the circuit settles and you get the correct results right and even within a chip there's some distribution this is of course not exactly like this distribution but there's some distribution some cells are strong some cells are weak so what D manufacturers do today is uh they specify a fixed latency they don't take into account any of this basically they try to specify a standard latency so that almost all the chips that they manufactur is acceptable why because this is good for a business again right yield if if actually the standard latency was somewhere here lower a lot of chips would not be acceptable as a result you would make less money there could be some binning that you could also do actually that's just done with prosers but uh we can talk about that later okay so we're going to see how we can actually take advantage of the set8 like not necessarily we will see that yeah you can you can do other things on top to enable predictability okay I'll see you next week have a nice weekend

Transcript for:[Lecture 8] Enhancing Memory Latency in Computing

Transcript for:
[Lecture 8] Enhancing Memory Latency in Computing