[Lecture 10] Understanding Memory Controllers in System Design

all good okay shall we start let me yeah if you can fix the camera concurrently that' be good this I don't know why these things are not very robust okay the sound is better too it was too loud I think earlier okay let's get started uh so today we're going to cover memory controllers but as you know we've been covering memory controllers for a while but now we're going to do a little bit more in-depth uh treatment of what they do uh and why they're so important and why they're getting even more important into the future uh like from my perspective uh we spend a lot of time optimizing the processor but everything goes through the memory controller everything meaning everything in the system goes through the memory controller to access main memory and I don't believe we put it do enough diligence to optimize memory controller as much partially because uh the one of the main reasons in my opinion is really we are very processor Centric processor processor processor processor processor everybody has proster in their heads and everybody's educated with uh prer being the almost the only thing that is executing programs Etc and as a result other parts of the system are kind of ignored by most people this is really inherent in the thinking educational system and also the way industry operates in general right industry values prosers a lot more why well I think there are business reasons Etc that are not really nice in my opinion in overall but uh I think we should be doing a lot more diligence to memory controllers so I think hopefully this gives you a little bit of the mindset uh as well so okay so what's a memory controller essentially long latency memories U have similar characteristics that need to be controlled uh I will use DRM as example in this lecture but scheduling and control issues are similar in the design of controllers for other types of memories also like flash memory as we will see later on emerging memory technology as we will see a little bit earlier than flash memory probably uh we're going to talk about some of these uh but these other Technologies actually potentially place other demands on the controller as well some of these technologies have actually almost all of these technologies have endurance issues whenever you write to a cell you destroy it a little bit so you cannot do more than some number of rights to the cell and the controller needs to deal with it this is a first order job of the controller how do you ensure that a cell doesn't wear out so so that you don't lose data right this is not a problem in DM currently so there are other uh things that other memories impose so you always need to design the memory controller to control a particular technology some of the things that you do are more or less independent of the technology like scheduling requests Etc but there's technology dependent aspects of it also because if the technology for example uh is in such a way that uh accessing one cell is faster than accessing some other cell now that control can take advantage of it so there could their design the design of the controller has technology specific aspects but also some technology agnostic aspects as well so it's good to understand what those are so that you can actually create effective uh controllers okay and we've always argued in this course that you should be designing the controller so that you can make the technology underlying technology even more effective right we've seen that with roh hammer for example we've seen that with data retention Etc today we're going to look at issues like scheduling Etc uh before I go into DM a little bit uh let me give you an example from SSD controllers we're going to see this more but SSD controllers essentially control flash memories today uh ssds are made of flash memories uh usually n flash there's also nor flash but it's very small uh amount uh uh portion of the ssds that are manufactured today because it's not as scalable as n flash today uh but essentially SSD controllers are flash mem specific they need to deal with uh the characteristics of of the flash memory chips for example there's sophisticated airor correcting codes as we will see later on uh there are sophisticated mechanisms for V leveling meaning we leveling the wear out across different cells so that no one cell goes bad much earlier than other cells they do a lot of voltage optimization so that the data that you read is actually accurately read you don't get errors or you minimize the errors so that air correcting Cotes can correct them uh they do things like garbage collection because whenever you uh write to a flash cell whenever you erase a flash cell uh you need to reallocate it potentially there's an erase operation that we will see when we talk about Flash that doesn't exist in DRM really uh uh so that uh page becomes invalid and that block becomes invalid and you need to actually uh collect it and reuse it later on and then there's also page mapping mechanisms that happen due to M multiple reasons partially because of garbage collection unallocated blocks but also very leveling so an SSD control is actually quite complicated and we will see this more when we discuss flash memory but keep this in mind there are a lot of tasks over here as you can see for that reason uh this SSD controller usually has uh like Hardware components that are specialized to do particular things like ECC for example there's Hardware L ECC over here also scheduling Etc but also it has software components to do some other things like wear out management garbage collection Etc so there's a full uh processor uh that exists in SS controls today in some SS these actually have eight different cores over there trying to do trying to manage this memory so if you look at this this is kind of like a system by itself right although there's a connection to the host which could be the CPU internally you actually have a lot of things there's buffer manager but that buffer is DM so there's internally a d controller inside the SSD too so it's fascinating I think so this picture should make you think a little bit right we're designing a system just to get data out of storage and this system is there it's actually general purpose and we're treating as only a servant to the CPU or GPU or whatever you connect this to does it really make sense right you're really designing all this complexity in the system so that you can get the data out and send it to the CPU and back and forth and back and forth It's good to think about what else can you do with the system over here right that's why that's why I think memory controllers are actually extremely important but people need to think about them differently to really make them uh much better okay so this is a better picture perhaps from one of our papers uh uh where we summarize a lot of the works on how SSD controllers operate to manage errors if you're interested you can take a look right now but we're we're going to look at it more so there are other things that this controller does actually like scrambling of the data so that you get better power distribution Etc uh encryption potentially uh compression potentially so there are a lot of tasks over here that the processors or Hardware uh uh units over here do to manage data okay and this is the paper uh where this picture comes from and I'd recommend it to you multiple times I actually showed this but uh I believe it's a very good introduction to ssds so uh error management is a big task uh of these controllers as we have seen in DRM in DRM we talked about robustness issues retention issues ETC refresh uh those are essentially error management issues right refresh is in my opinion an error management issue Integrity management issue because if you don't do that you get errors the reason you we doing refresh today is to prevent errors and data corruption maintain data Integrity true for row Hammer as well so if you thought those errors were like ominous and a lot once you look at other Technologies you'll see a lot more errors so this is just our characterization from that U from that paper that I mentioned but you get wear out errors for example you get errors when you program cells you get erors uh similar to uh like when you when you actually do something to a cell some other cells get interfered with you get data retention areas like in DRM and you get we disturbance mechanisms that are slightly different from the year and for these uh there are different mechanisms for these erors and there are different techniques that are developed also to handle these errors so this paper categorizes the error types as well as the mitigation mechanisms and mitigation mechanisms Target different areas as you can see over here so it's good to do this sort of characterization I believe the errors that the is being subjected to is also growing so at some point maybe 10 years down the road we'll write a paper that looks similar to this in the uh 10 years ago maybe the air types were very small relatively small uh and this doesn't even include the softs for example we don't even deal with soft eres here because soft eres are they exist in any technology softes are basically erors that could happen for example due to some particle strike uh a neutron strikes your circuit and you get a bit this is just random luck right it's not really inherent to the technology it's really external it can happen to anything it can happen to you also a particle can strike you and you may be okay hopefully if you don't unless you have some let's say uh safety critical electronic circuitry on you but basically that's random so that affects every technology that's why they're not classified uh here okay so this a more up toate version of that paper that talks about controllers uh there's more work on SSD controllers that we have done I'm just going to flash these because we're going to cover these at some point uh but uh uh there's there are interesting issues that we're going to cover specifically in DM to begin with but just to give you an idea these issues also exist like fairness issues interference issues exist in ssds uh there's also security issues that exist in ssds reducing latency is important ssds also uh reliability uh temperature they all affect ssds in fact temperature is a big effect in ssds and uh there are good reasons to actually build SSD architectures to enable parallelism like we talked about sub level Paralis for example right similar ideas can be applied to ssds and you may actually have more freedom in the SSD uh so that you can access many uh chips at or Subways Etc in parallel okay so we have lectures on this that are going to come up Muhammad May deliver some of them or Rakesh in the back they've been working on ssds if you're interest in this you can talk with them and we also have a SSD course that actually goes into a lot more detail on these issues than we can in this particular course and this particular course we'll probably have one maybe two lectures on storage and ssds okay okay let's jump to DRM now any questions on what I've covered so far yes compon inre exactly yes has stud that's a good question so different studies actually impact study the latency impact so you can actually look at the studies and they talk about how they affect latency uh but in ssds that those latencies may not be as important right because you take microsc for example to access data and if you add uh maybe one microc it's okay as long as you hide it in some way right so in SSS you have a lot more freedom so a lot of these techniques that are spe designed for SS may not be immediately applicable uh to the Year yes but that's a that's a real concern if DM becomes like much harder to control uh and you add mechanisms to overcome those scaling issues latency is always a problem we will see also in scheduling latency is a problem when we talk about scheduling so in DM latency is a lot more critical okay so uh we've discussed this before but I want to also remind people that there are a lot of different DRM types uh these DRM types are there uh for to cater for different types of accesses different kinds of use cases Etc uh for example low power DM High bandwith DM I'm going to list them actually over here these DRM types have different interfaces optimized for different purposes right commodity is commodity it's kind of general purpose it's not really optimized for anything other than capacity perhaps capacity is the best thing to say over here low power is specifically optimized for low power while keeping capacity as much as possible high bandwidth is specifically optimized for high bandwidth access high throughput and we know that uh especially high bandwidth for graphics this is actually specifically for graphics uh and then there's also low latency is optimized for low latency and we discussed this right it comes at a cost usually and then there's 3D stacked memories that also provide high bandwidth not necessarily specifically designed for graphics uh that are actually becoming more popular due to machine learning applications and due to their prevalence in gpus uh and then there's more that may come right so today actually there's a proliferation of DM because there are many use cases of computers and a single DRM type is not good enough for everything right if you actually demand High bandwidth you cannot get high bandwidth uh without actually buying let's say 3D stack or high bandwidth memory if you want extremely high capacity well you don't want to probably have uh just high bandwidth memory in your system right uh if you want very low latency I guess good luck you have to give your arm to buy one of these low latency DMS today and don't lose all of your arms I'm not joking these people actually pay charge a lot uh so that you can get uh low ly memory but in the end uh underlying micro architecture is really fundamentally the same so the operation we have seen of DRM Bank rank Etc it's really fundamentally the same uh the DRM cells don't change DRM uh subs don't change internally the chips are mostly the same what's really different is really the an analog interface that you have uh and the uh and and the basically characteristics like 3D stacking for example but if you look at the DM it's itself it's DM dies itself they're similar so ideally you'd like to design a flexible memory control that can support various DM types but this is very tough as we have discussed in an earlier lecture right first of all this complicates the memory controller in many ways because you need to support multiple types uh difficult to support all types because they all have different characteristics different latency characteristics different refresh characteristics different number of banks different widths Etc and then maybe you want to upgrade uh maybe that's difficult to do but what's really the most difficult part is uh the analog interface essentially uh when you actually communicate with these uh DM types you need to design an analog actually more more accurately a mixed signal meaning mixed digital and analog interface that operates at quite high frequencies for DDR for example it's more than four gahz today right it's increasing so I cannot keep up with it but basically this analog interface design is uh not easy because it's very high frequency design if you've done any analog or mixed signal design operating things at low frequency is not easy to begin with and if you're at high frequencies like what we're talking about two plus gigahertz it's like it's even more difficult so getting things reliable uh at that frequencies is not easy uh and also it's extremely costly in terms of area as we will see soon basically these interfaces so that they can be reliable they need to have uh analog to dig digital analog converters as well as drivers and ioads where you can actually drive these signals out of the chip right at the memory controller so if you look at a single uh interface to a single DM chip it's actually pretty big pretty large and if you think about replicating this for a whole Channel maybe 64 bits 128 bits okay you design one and then replicate it that's not that bad right uh hopefully it works but now that's say you designed lpddr 6 lpddr5 right that happens in a lot of s so's today for example but if you want to add for example hbm on top of that to the same chip now you have to redesign the completely the analog and mixed signal interface so it adds to your cost maybe probably more than doubles your cost because you also need to figure out where to lay these out on your chip because your chip is in the end limited by the IOP pins right okay so this is really the difficulty of adding more memory types to the same chip does that make sense and on top of the analog interface you still need to design the controller the controller needs to be different for these different types so as a result most uh chips today have a single interface to DM for good or bad uh uh and this interface directly from the memory controller or directly from the processor chip to DM minimizes latency right so one way of actually thinking about the memory controller is taking the memory controll to a separate chip this used to be the case as we mentioned in 18 years ago or so uh and AMD was the first one to actually bring the memory controller to its own chip and then Intel followed sweat because that was the right thing to do with vsi integration you have a lot of uh uh logic uh to use clearly and bringing the memory controller to the processor chip was a good idea because it cut the latency right it's much better than uh scheduling request to a separate memory controller chip and then that memory controller chip schedule request to a DM chip right or DM module so basically lenes to memory actually were reduced One Time by bringing the memory controller from another chip to a uh uh to the processor chip so it's good to think about these but then the problem is now you cannot uh connect the processor chip to a different memory controller interface right it it used to be much more flexible in the past it's much less flexible today now to today there are interfaces like compute express link where people are trying to communicate with the outside world with high bandwith and low latency in quotations because it's not clear if that can be low latency so you can actually be more flexible maybe actually you can you can send a request to memory through the computer express link interface and that goes to a memory controller chip like it used to be in the past but this means that there's still additional latency right so you can actually access probably more flexibly different kinds of memory chips right now with the computer Express type of interface but uh that leads to higher latency so it's good to there there's no way around it I think once you go off chip to another chip and then that other chip does the memory control functions and scheduling you're going to get more latency so people may say comput express link will have the same latency as today's DM that's I think violating the fundamental laws of physics I don't know how it can be possible because you can actually design a very high bandwidth low latency interface if you're actually optimizing that interface uh to a particular DM type right on the processor and also the length of the interconnect is just fundamental right okay okay uh basically this was actually you remember our Ram later paper so we this is the DRM types in around 2015 even in 2015 there were a lot of DM types you can see that HPM was just coming around at that time in 2013 it was introduced H hybrid memory Cube was there there are also other things that are not as important today perhaps but it's good to think about these things uh this is from the ulator paper original and we also seen this picture uh where we evaluated many different types of uh memory and their interactions with workloads so you're already familiar with this you're already familiar with the new version of R later and you're already familiar with the study that we have discussed in the past now let me give you some examples of DRM control logic so if you look at uh DRM control logic it's actually pretty large right this is for example according to AMD this is 2006 uh the D interface it's a lot of logic and analog interface this is according to Apple the lpddr interface they have eight memory channels and they actually occupy a lot of area this is AMD again they called Global memory interconnect which includes the memory controller a lot of area this is again uh IBM this is memory signaling uh again a lot of area IBM has a flexible memory interface actually so IBM's interface have been more or less cxl like for many many years because they actually they they needed to design these uh Big Iron machines Big Iron machines meaning machines with lots of processing and lots of memory so they couldn't afford basically just having uh a small amount of memory connected to a small a processor they need a flexible interface so that they can increase the amount of memory and this is part of their flexible interface even that flexible interface is high frequency and even that is occupying a lot of area as you can see okay I could keep going through this but now you can see that memory controls actually occupy a lot of area any questions now let's go through some functions if there are no questions so there are many functions a memory controller has well specifically DRM in this case uh you need to ensure correct operation of memory and in DRM there's refresh timing increasingly other stuff like row Hammer uh basically correct operation requires a lot of things to think about on top of this you need to service DRM requests while obeying timing constraints of DRM chips that's part of the correct operation uh and there are many timing constraints as we have seen before because you need to take into account resource conflicts Bank bus Channel rank minimum right to read delays Etc we will see more of these and also uh the m control needs to translate requested DM command sequences uh when when when requests come from different agents like processors CPUs gpus or dma engines direct memory access engines uh these requests specify an address right the memory one of the one of the jobs of the memory controller is to make sure you issue the right commands you break it break a request down into commands right the request doesn't come with activate pre-charge Etc you need to figure out whether you need to activate or R you need to figure out when you need to precharge or you need to figure out which column you access Etc right so all of those are part of the job of the memory controller so to be able to do that uh the memory controller has a important task which is buffering requests because usually the requests come at a rate that is faster than the rate of service that can happen from memory because processors are much faster and DM is much slower as a result there needs to be buffering in the memory controller to uh account for that difference H and once you have buffering you need to decide what what request to schedule so that you can get high performance this uh requires the memory controller to take into account underlying characteristics like Bank conflicts that happen between different requests Ro conflicts that happen Rob buffer conflicts that happen um increasingly depending on the structure of DM it could be the bank group because there there's something that jedc introduced which groups Banks into groups and that you could get conflicts at that level also you could get conflicts at the channel level you could get conflicts at the rank level and you really need to take those into account so that you can get high performance so this requires some reordering some requests taking into account all of these uh resources that you're controlling okay on top of this uh a good memory controller should manage power and thermals in the so that you don't exceed some temperature and also so that you don't actually consume unnecessary power right ideally uh you would turn off Banks or turn off ranks that you're not using right at at that moment maybe you put them in self refresh mode even though you may not know exactly like we've discussed in an earlier lecture somebody suggested that uh don't refresh parts of memory that are not allocated great idea but that information may not be communicated to the memory controller today but memory controller can say oh I haven't used this rank in a long time I'm going to turn it off meaning I'm going to put it in self- refresh mode there's something called self- refresh mode in DM such that uh I'm going to tell the self- refresh this bank this rank I'm not going to touch you for a while and when I need that rank again I'm going to uh send a signal to that rank saying get out of the self- refresh mode I need to access you so these are some tasks that uh memory can do so that you can save power uh but clearly this is complicating things right okay so this is what a somewhat modern memory control looks like it's not the best picture but it's a nice still a not bad picture I think uh there are the DM chips over here on the site there's clearly an electrical signaling that needs to happen and there's a signaling interface or electrical interface or the mix signal interface I would say this a DM interface which is not easy to design to begin with and that's not the subject of this course also if you really want to learn how these things are designed you you really need to uh take I believe a high-speed IO circuit course uh mix signal analog digital so it's really actually a beautiful design but uh that's not the subject of this course we're Compu computer architecture we're assuming that this is done but again that's a very important skill increasingly few people have uh that skill in the world but I think we need that to actually really design good uh systems uh okay well we need we need all kinds of skills in the world we we don't we we don't need only AI Engineers or data scientists in the world sorry I had to I had to say this because sometimes people uh miss this important thing right an AI engineer is not going to do that and that is extremely important right okay nothing against AI Engineers I'm just saying that not everybody in the world need to be that uh okay uh and machine learning is not going to do that also they may help but really there needs to be a great domain expert designing the circuit really well okay so on the other end uh the memory controller interacts with agents essentially this could be CPUs gpus fpgas iio engines the direct memory access engines essentially a lot of different things can be connected to the memory controller over here and today many things are connected as we will see uh soon and these all basically keep injecting requests to the memory controller the memory controller's task is to actually gets those requests buffer them uh and I identify where they should go uh which rank which bank potentially which bank group uh depending on essentially the structure of the DM so basically uh which row which column Etc the memory controller really needs to uh do that translation and once that translation is done there's some queing at after some point there needs to be some scheduling mechanism that happens over here right because there may be many requests uh and then there's another design decision that's over here that needs to be done do you have a full a single buffer uh that uh is used to uh distribute request to different banks different ranks Etc or do you have separate buffers uh to to different banks or Bank groups for example uh this is a classic example of uh static versus Dynamic partitioning of buffer space right do you have a single buffer dynamically partitioned across different banks or do you have different buffers statically partitioned across different banks and you can imagine different uh tradeoffs over here if One Bank gets a lot of requests uh its buffer becomes full and all of the other Banks buffers may be empty and that's the downside of static partitioning in general you cannot utilize the buffers of other Banks if you've statically partition the buffers but then the dynamic partitioning of a large buffer may be more complicated right uh so this is a fundamental trade-off between Distributing the buffers or partitioning the buffer paring buffers or having a unified buffer and we will talk about that when we talk about Resource Management more but you can see that there are a lot of decisions you need to make when you design a memory controller it's not an easy thing uh it's a fun thing but not an easy thing uh this is our picture uh of maybe a little bit more simplified picture uh of a memory controller from a paper that we will cover maybe today probably not today depending on the speed I need to catch my flight too so we may need to end like a bit earlier today uh but uh this is an example from a multicore system you get requests from let's say some number of caches different cores potentially or different shared caches and then each bank has a separate request buffer and uh there's reordering that may happen each bank has a separate Schuler so you can see that these are partition schedulers partition buffers but at some point uh because the channel itself is shared across all of these Banks the datab bus and the command bus and the address bus are shared across different banks at some point you need to have a global scheduler that decides which bank schedulers request I should take or which refresh I should take what what should I do basically there needs to be a global decision at some point you cannot partition everything completely because in the end your channel is shared across all of these uh entities like Banks it's fun right a lot of design decisions I've been designing memory controllers for a long time and this is one of the hardest parts of the system did I tell you the story of Chuck Tacker yes you remember or was in the seminar course I so it was probably in the seminar course so who knows about check Tacker if you've taken the seminar course your hands would be high anybody else who hasn't taken the seminar course nobody know the Chuck Tacker he's a touring Award winner does that help and he he won the touring award deservedly I should say so it was a deserved touring award but basically uh Chuck Tacker is a Pioneer in building personal computers he actually designed these in 1970s maybe even early late 1960s Etc uh while he was doing research in the western research lab Etc Alto system for example was one of the first computers and he designed very complicated systems compared to their time right a personal computer he also designed tablet PC for example while he was at Microsoft and I met him at Microsoft we actually uh talked a lot with each other worked with each other but I was actually doing a lot of research on memory controllers at the time and we were publishing papers like what I'm going to describe also some of them uh and he was at the at the time very interested in uh designing emulation platforms fbjb emulation platform so so they can accelerate simulation for example system level simulation and one of the tasks he was busy with for a long time was to actually design the DDR3 memory controller for that fpga platform and we used to talk we used to discuss a lot of issues related to that but uh he used to complain a lot saying that uh memory controller is the the worst thing he has designed in his life worst thing in the sense that it's the most difficult thing so he found designing many many systems memory controlers being the worst and most difficult one and I kind of agree with him I think this is one of the most difficult parts of a system uh and I believe it needs to become more difficult meaning because it's so important that we cannot really uh just leave it with simple designs simplistic designs okay so now you know Chuck teer if you have someone ask you about Chuck teer can tell me about it I think he won the touring award in 2009 okay so basically one of the tasks of a memory controller is really to uh schedule requests right we're going to look at some scheduling policies I'm going to start with simple ones uh but memory controler buffers request it could be a single uh Bank level buffer it could be a global buffer across Banks Etc it doesn't matter uh you need to have a scheduling policy uh to decide what requests is schedule next and of course one of the goals is to make sure that you don't waste bandwidth right you don't waste datab bu utilization in general we're going to kind of talk about that later on also in a different way uh it's not an easy task uh because uh what does wasting datab bus utilization mean keeping the datab bus busy all the time how do you do that uh you have a lot of timing constraints as we will see uh but let's talk about some general scheduling policies like first come first serve it's really not scheduling in my opinion you basically takee things in a F Q uh that's relatively easy uh oldest arriving request first basically is another way of thinking about first come first serve most memory schedulers today are variants of first ready first come first serve uh I I'll simplify this a little bit basically the idea is if you you schedule the request that hit in the robot for first and assuming that you have multiple such requests you pick the oldest arriving one among them so it's a prioritization order in the end this is slightly better because it maximizes the Rob Rob offer hit rate meaning that it tries to maximize the DM throughput at least in a myopic fashion meaning short-term fashion right you're basically optimizing the robot for locality at the current time that doesn't mean you're optimizing for robo for locality that's really important in the future potentially okay it's good to keep that in mind current time versus Future these decisions may be different right it may be better to proactively open another Rob offer and schedule requests that go over there depending on the criticality ETC of the request right okay but basically this also uh gives you an idea of the difficulty of the problem so scheduling is actually done at the command level meaning you don't normally schedule things at the request level you could uh but in the end you have to translate things into commands right in some schedulers actually they schedule requests at the request level and then they uh issue commands so that they don't need to do command level scheduling but you could do scheduling at the request level as well as command level uh but in the end column commands uh read write can be prioritized over row commands activate pre-charge and you know these commands by now and within each group older commands are prioritized over younger ones this is a command level formulation of the ffcfs policy okay so uh I think we have seen this sort of picture before so I'll go through this relatively quickly but just the review of the DM Bank operation you have two dimensional array of columns and rows and you have to bring the data to the S some fires row buffer initially this is empty if the memory controller needs to access row zero column zero it needs to activate row zero first which takes some time and then once the data is in the r buffer the memory controller needs to send a column command uh read in this particular case to read the data out of column zero which brings the data out of the DM chip right okay now if you you have another access that goes to the same Rob offer this is a Rob offer hit because Rob offer is ready well no Rob buffer has Robo zero which is what we're trying to access of course the memory controller needs to have some metadata to keep check of which row is open at each bank for correctness purposes not just for performance purposes but for correctness purposes because if somebody's accessing row uh one again for example then you need to write back row zero or pre-charge the array so in this case it's a robot for hit the memory controller needs to have logic to determine that and then the memory controller needs to send column address to get the data out from column one in this case this is again a read request and then if uh the pro the processor or one of the CPUs accesses row 0o column 85 the memory controller again realize that this is a hit and then it sends the column address gets the data out so the memory controller job is to ensure that this works correctly and maybe also to optimize for Rob offer locality now if the access that's coming to the memory controller maybe in the que is to row one column zero then the memory controller should realize that Row one is not open in this bank I'm assuming that all these axes are to the same Bank of course right uh this is a robot for conflict and the memory controller needs to First pre-charge the array which you know what that is by now right basically it's make sure that the sense amplifier becomes set to vdd over two on both ends and it gets disabled and also the bit lines uh and it sends an activate command to row address one which brings the data into the sense amplifi ruffer and then the memory controller can send the column address so clearly this took a much longer time now a ruffer conflict took a much longer time than a Rob buffer hit as a result if you want to actually maximize uh throughput that's coming out of DRM you'd better optimize for Rober locality right maximize the hits that go into your Rober okay so we'll we'll see this more uh but that's one of the concrations but in the end a scheduling policy is really a request prioritization order I'm not going to cover every scheduling policy today but just to give you ideas like prioritization can be based on the age of the request or arrival time uh Rob offer Hit or Miss status request type if you know that information maybe you treat uh read request differently from WR requests actually you have to in DRM uh DRM read request and DRM write requests are different because they do different actions uh to uh to the memory bus Channel essentially you cannot schedule uh a read while you're scheduling wrs because when you're actually writing you're really pushing data from one side of the memory bus to the other direction when you're reading the direction of the bus needs to be turned around and this is called bus turn around time unfortunately there's no way to read and write at the same time concurrently through the bus you have to switch the bus Direction and this introduces the delays uh this introduces special delays like t write to read T read to write now you can see there's another scheduling decision we need to make right it's not just about the Rober Rober is easy well easy uh but uh if you now you're adding another dimension I cannot schedule rights at the same time I can schle reads so how do I handle this thing rights we know that are not as important because you're really writing to DM and hopefully the processor is not waiting for it read on the other end could be critical because the processor maybe actually immediately waiting for it uh but now what happens if my right buffer gets full if my right buffer is full I have to I have to push these rights out to the otherwise where am I going to put the uh next right so you need to switch to a mode where the memory controller does rights for a while while the memory controller is doing rights for a while on that channel it cannot do reads meaning now you're delaying reads that are supposed to be more critical because you have to do rights so meaning that rights are critical also at that point right so this is actually the design choices that you need to make when you actually design a memory controller one of the design choices of course and then if you have this information what about prefetches right this is not something that the processor needs immediately but a prefetcher generated the request and they tagged it if you're lucky they tag the request with saying that I want the prefetch from this location and this is a hint to the memory controller saying that this may not not be as important how do you deal with that right do uh do you prioritize prefetches over rights do you ever prioritize prefetches over reads it's good to think about we actually have papers on that topic called prefet DM controllers and sometimes actually prioritizing prefetches over reads is a good idea especially if you're exploiting Robo for locality with those prefetches you don't want that prefetch to cause a bank conflict later on you actually schedule the prefetches that go to go to a Rob offer that's open and get them out quickly as a instead of and and maybe delay some reads that you know are actually demand reads uh as opposed to letting those prefetches uh let's say be not serviced for a while service this read that will close the row uh which will take time and then service the prefetch that will close the row and open it again so as opposed to causing more Bank conflicts you may actually be better off not causing the bank conflicts but prioritizing a prefet request right example so it's good to think about these decisions and hopefully what where I'm building up uh to is uh I think making a lot of these decisions with a lot of timing constraints is not easy as a human as a result the scheduling policies that we will see are going to be very relatively simple and whistic based and they may not work under all conditions even though you can actually generate a lot of insight as to why some policies may work putting everything together is not easy okay so and then there could could be a requester type if you're lucky you may have this information also is this read request coming from a stor Miss meaning the processor needs to store to a location and it's missed in the cache and that cach block needs to be brought to the caches from the memory or does it belong to a load Miss meaning the processor is actually trying to read from that location so that it can actually execute something that's dependent on that read these are two different criticalities right a stor is may not be that critical to the processor because the processor uh that uh that store instruction may not have any dependent instructions of course a load may depend on a store but that may be farther down in the instruction stream whereas a load Miss may be immediately critical because the data that's needed by the load may be needed by may may be needed by an add instruction afterwards right a branch instruction Etc so if you have this information how do you prioritize requests probably I would prioritize a load without any other information right uh request criticality this is actually something that we've been kind of talking about but maybe there are other dimensions of criticality right how long is this going to is this request going to store how long is this request going to stall the processor will it stall the processor only one cycle because the request is actually uh the latency of the request will be overlapped with other requests or will it stall the processor uh for 100 Cycles because the latency is not overlap by any other request is the only request that the processor is waiting for right so this criticality is an interesting thing actually and how do you approximate it measure it is actually not not an easy thing later we may talk about criticality uh uh but today we're not going to go into that that much and people have talked about different ways of thinking about criticality I don't think we have a very good way at this point or at least a perfect way so basically is it the oldest Miss in the core how many instruction in the core are dependent on it will it stall the processor these are all good questions to ask and you can make it will it stall the processor in the end is a good measure but can you guess that easily is another question okay and then there's interference cost to other cores so we've talked about all of these things from a presser perspective but then what else do you do what what is this what damage is this request going to do to the performance of other agents that are trying to access memory if I actually prioritize this request from thread a over another request from thread B to the same bank I'm going to delay the request of thread B maybe thread B is more critical from user perspective maybe it needs to be prioritize for other reasons basically this is also something that the memory controller needs to think about and if you don't think about it as we will see later on you run into problems like denial of service and fairness and there are many many other things this is I think a good uh quick summary of Prior what prioritization could be based on there's also like I think there's room for creativity over here uh people need to think about other things that prioritization could be based on any questions yeah yeah yeah yeah yeah now you're actually making things more complicated right potentially yes potentially you could tag uh that's one way of also thinking about criticality or non- criticality right if the if if a branch instructions depend on a load for example this may be good to communicate because if that Branch assuming that branch is highly mispredicted this cause a problem right because you that's really potentially that's a very highly critical branch and that's because the branch is very highly critical probably the load is also highly critical but if the branch is predicted correctly maybe you don't care right yeah so it's good to think about these things there's no perfect answer unfortunately maybe you can talk to Rahul about that is going to come up with the best uh mechanism to predict criticality soon and hopefully use it in the memory controllers but other parts of the system too if you can really understand the criticality of instructions then you can really design much better resource management policies right because criticality uh of an instruction assuming you Quantified it correctly then that instruction needs to really be prioritized everywhere right it needs to be prioritized in the processor caches memory controller storage whever it really touches at that point the problem is that criticality is difficult to identify First and that criticality is dynamic it changes over time if you prioritize this next time maybe it not it may not be critical that's one thing but once you prioritize it maybe something else becomes critical right so there are multiple things that become critical uh to each other potentially as well okay these are good thoughts okay let me cover warer management policies relatively quickly because we talked about that even this is actually so I I give you a lot of dimensions in memory controllers not everything we we will see a little bit more uh but we're not going to cover everything but even robot for management policies are uh subject to optimization so there usually three major policies one is keeping the row open after an access that's called the open row policy now this is good if the next access to the same bank needs the same row that's a row hit this is bad if the next access might need a different row right there a row conflict wasted energy right you if that's the case you could have been better off with a Clos R policy uh which says that close the row after you access the row unless there are other requests already in the request buffer need that need the same row that's a not so dumb CL closed R policy right if you always close the row that's that's kind of dumb right knowing that there are other requests that are waiting for this row there then you're shooting yourself in the foot probably as a memory controller but if you look at your queue and say oh there's no other request waiting maybe I should close the row so that someone else can access it then if the next access needs a different row you avoid a row conflict that's good but if the next access might need the same row you get an extra activation latency right and there's no Perfect Choice over here unfortunately unless you predict the future somehow and that's what a lot of controllers do they use a to policies they try to predict whether or not the next access to the bank will be to the same row and act accordingly and how do you design that predictor is not easy uh there could be different policies maybe there could be machine learning employed here uh but this is only one aspect of the memory controller as you can see right Robo for management it's an important aspect but it's only one aspect clearly this integrates also uh uh with the scheduling policy that you have in the memory control right do you actually exploit robot for locality as much as possible that would be useful in general in memory controls to exploit their or for locality uh okay maybe one more thing over here is uh there could be I mean these policies actually don't need to be just in the memory controller right uh you could imagine uh a last level cash policy that's aware of the Rob buffer and last level cash when it's evicting uh a a cash block it says oh this cash block uh I I have I have a bunch of potential victim candidates that I'm going to evict I'm going to choose the one that is going to maximize the robot for locality down speed given the information I have about the status of the Rob offer right now and if you actually have that communication between the memory controller and the last level cache then you could potentially co-optimize Robo for locality but of course this may not be the best decision for a cash eviction but you usually there is some slack in eviction that you need to do so this a this a mechanism that we propose that I'm going to talk about very briefly uh in the in the paper that I'll mention soon okay this is just a table that shows what happens uh if you have a given policy and this is your first access and this your next access these are the commands needed for next access I'm not going to go through this table you can study it on your own there's nothing new here uh that's different from what I just said over here it just quantifies some things uh with or or or makes things more concrete by telling you which commands are needed for the second axis assuming you have a ro policy and first axis and second axis okay let's talk about power management very quickly uh I'm going to take a break at a clean Point as I said DM chips have power modes the idea here in general is to power it power down a chip or a portion of it if you have the capability when not accessing it these are unfortunately the scenario that's in my opinion not very well studied uh but I think this will become more important especially with 3D stacking of DRM yesterday we were talking about uh DRM being stacked on top of processors in s so's right and this will really impact the power consumption as well as temperature temperature and power consumption are quite related things right uh there's a direct relationship between temperature and power consumption uh we need to really manage the power much better in DM going into the future but these are power states that exist in the arm chip so active is basically you're really active meaning the Rob offer is active well there's something that's accessing basically when you're accessing that's where when you consume the highest power but it's not a power State itself it's really a state of what's happening at that point active is the highest power State the Rob buer is active and it's connected and it's basically wasting power at that time idling a particular Bank could be one assuming your chip supports it meaning you close the row and it's idal maybe you do some power management in the Bank all banks idle uh that also uh closes all of the rows and maybe it actually has some power management that reduces the power supply Etc it could reduce the voltage Etc right uh and then there's a power down state and then there's self refresh which is the lowest power self refresh I mentioned right basically you turn off as much as possible except for the refresh circuitry internal refresh circuitry and you set the internal refresh shortcut is such that the DDM independently without requiring the memory controller to send refresh commence the by itself refreshes itself that's the lowest power mode even then it's clearly consuming some power right because there needs to be some circuit that is turned on and there needs to be refreshes that happen periodically okay so the trade-off is that uh when you transition between one state to another low power to high power or high power to low power you incure latency during which the chip cannot be accessed this is kind of dead time where you need to transition the power State this clearly has an impact on performance so if you're optimizing for power you need to be careful right when should I put the DM into self- refresh state right when this kind of a prediction mechanism again right for example I'm not using my phone right now would probably it' better be in self refresh mode but when I use my phone maybe self- refresh mode is not that great because you need to get out of that quickly right so and other states also have and and going into the self refresh mode and getting out of the self refresh mode actually has the highest penalty in terms of latency but other states also have penalty so this is not an easy optimization problem and their papers on this topic but I believe uh the power States in DM needs to be re-examined also okay so this is a much clean and nice Point uh today I'm going to take a shorter break because we're going to end early let's take a break until 1420 10 minutes and then we'll be back with uh more DM control e e e e e e e e e e e e e e e e h for for okay I think it's time to get started okay now now that we've covered some uh Basics uh let's talk about the difficulty of DRM control uh I think I've given a lot of this but essentially uh we need to obey many DM timing constraints for correctness we're going to see a bit more of these and it turns out there 50 is actually an understatement more than 100 timing constraints in modern DRM this is because of the way interface is designed right interface is designed so that memory controller can know everything about uh the timing of DRM so that you can actually control it uh in a fashion that's uh relatively optimized as we have seen in the last lecture it's not very optimized all these timing parameters are worst case parameters but uh there are many timing parameters which makes it still better than having a single timing parameter that's the worst case across all of those timing parameters meaning you could have one timing parameter that is the worst possible timing parameter and say after every request I make I'm going to wait that longest timing parameter right that's very bad so that's why people have come to this interface where you have so many timing parameters for different fine grained actions and the memory controller needs to be aware of those timing parameters so that you avoid this huge long even longer timing parameter does that make sense so this is a conscious design Choice uh in synchronous DRM where the memory controller is in complete control and it needs to schedule all of these uh requests uh based on the timing parameters as I mentioned there's a t right to read delay which is the minimum number of Cycles to wait before issuing a read command after a write command is issued so you cannot do reads and writes concurrently except for some overlapping conditions which I'm not going to go into TRC you already know this the road cycling time which is really what latency is usually measured as but it's not the only latency measure clearly as you know there the minimum number of Cycles between the issuing of two consecutive activate commands to the same it's really the rad conflict time it's called Road cycling but it's also rad conflict time kind of okay uh there's other timing constraints as we will see in a little bit but it's a lot to optimize for Within These constraints then you need to keep track of many resources as we have discussed and there are many resources these are increasing over time bank groups are not here but they were introduced relatively recently for example uh you need to handle the refresh clearly and we've spent uh one full lecture on that topic you need to manage power consumption as we have discussed briefly you need to optimize performance and quality of service in the presence of all of these constraints and to design things like design of a controller the hardware itself to do reordering is not simple even something like reordering is not that simple in the end because to to be able to do reordering you need to figure out what to reorder and you need to also do some some matching of requests which makes the hardware more complicated and mon you actually add other constraints like fairness and quality of service as we will discuss either at the end of this lecture or at the at a later lecture this really complicates the scheduling problem and we're here we're we're not really considering a lot a whole lot of other things over here like energy for example well power is over there but maybe there's also other optimizations that you can do for energy Okay so these are this is the paper that I mentioned that basically proposes to make the last level cach aware of the characteristics of dram so that you can schedule right write backs from the cache in a way that really doesn't hurt performance or in a way that ex exploits a robot for locality in the so this paper well uh this is the paper reference over here it's a technical report it was never published uh but it's actually a pretty interesting paper I think uh it basically shows that how to schedule writes and reads in the yeah and this is uh some of the timing parameters over here from that paper if you're interested you can take a look more recently we've written papers that actually clarify the timing parameters one way of learning about timing parameters is to go and read DM data sheets which is not bad but those are actually terrible for understanding and insight in general uh they don't tell you why they make the design decisions they make so it's better to actually read papers that are written with insight as as a first class constraint right in data sheets Insight is not the first class con that's not the goal it's really to tell the memory controller designer this is what you need to do don't ask me why this is why this is what you need to do but good papers usually don't tell you what in fact they almost always tell you why before they tell you what right that's why when you're writing a good paper you should not focus on what first you should really focus on why first what is your goal why are you making design decisions you're making in the end I don't really care about the you're making if you're not justifying those design decisions very well right and there are good reasons for these timing parameters right these are actually these two papers I think explain the reasoning uh reasonably well and these are papers that we've discussed the self paper and the lates Tod paper and they actually show uh what are the what is the scope of the timing parameter Etc I'm going to talk about some of these in a little bit but let's take a look at why these exist this is also a picture that you have seen yesterday when I when we talk about uh two random number generation year right basically these timing constraints exist because you need to give enough time to the Circuit to settle right that's part of the reason for the timing constraints you activate and after activation the cell starts sharing its charge with the bit line you can see it's it was 0.5 vdd the bit line voltage cell is charged so uh it's the bit line voltage is perturbed to the higher state and at some point The Sire gets enabled this is not controlled by the memory controller today as we have discussed with the codic paper yesterday right the sense amplifier enabling is happens internally there's an internal timing circuitry that actually kicks in the sense amplifier and then the sense amplifier amplifies the bit line voltage to the high as possible and then at some point you can do a read and this read timing constraint is determined so that you don't get the wrong data when you read that's the idea basic of a timing constraint and this is called trcd activate to Red delay in other words uh and uh as we have discussed you could actually pull this earlier because this cell may have a ready to access voltage level that's reliable much earlier than the trcd timing constraint but we add a guardband for various reasons we've discussed again yesterday temperature variation spatial variation Dynamic variation Etc uh to protect uh to get the correct result while having a single timing parameter right so and you have also seen yesterday that different cells behave differently so I should probably animate well it doesn't animate there are strong cells that could be accessed earlier there are weak cells that actually eat into the scar bent uh hopefully they don't actually go out of the Guard bent Etc okay so there's a reason for these timing constraints as you can see to make sure that the circuits settle and they give you the correct result after you take an action clearly there are other timing STS that we're not going to go into but this s paper sub level Paralis paper actually does a nice job I think uh describing it at least a major set of the timing parameters by also showing pictures uh of what happens uh over here so there are some timing constraints that are actually at the bank level uh like what we have seen uh trcd is at the bank level it's between activate and read and write commands uh and this is the value in a particular stand standard DDR3 standard uh t uh let's see what should I pick uh T right this is the after doing a right you need to do a pre-charge and you need to wait for some time so that the data is actually guaranteed to be written including the guardband that's added uh in this particular case it's 15 NCS for example uh there's a rank level timing constraint read to write this is actually not necessarily specified this way in the standard and we say B not explicitly specified by the JX standard as you can see so there no ex but I think intuitively insightfully it's a read to write timing constraint you need to obey this as a function of other timing constraints uh but I think intuitively it's good to think about it as a read to write timing constraint basically this is how long you need to wait before you sh a right uh command after you sho a read command it's a rank level meaning you need to enforce it at the rank level to any Bank you cannot issue a right after you should a read unless you wait for this long and this long is in the standard it's 11 to 15 NCS as you can see it's the other way around also exists which is Right to Read as we have discussed right and there also they're also interesting idiosyncrasies for example we say uh this is the uh delay between the write command and the read command but it's not really the command there's a star over here asterisk that you need to read it goes into effect after the last right data not from the right command because you send the right command and you send a burst of right burst of data and you should really make sure that you schedule the read command and nanc or twtr NCS after you sent the last piece of data to the so you need to really know all of these to design the memory control if you make a mistake then you can get a wrong data value you can get data corruption yes are these yeah yeah that's a great question it could be different data sheets usually specify a uh variety or or range of timing parameters they don't dictate usually these timing parameters are red uh when the memory controler initializes uh you you read uh something called SPD circuitry in Dam and you get the timings that are employed by that particular module yeah so it could be changed as as you uh uh yeah as you suggest but there is a range that's usually specified in the data sheets under different conditions also under different voltages these may change also basically for different voltages different temperatures there are different timing parameters yes that's correct yes yeah from it's all from the memory controller's perspective because in all of these the is a dumb agent it basically can do nothing other than to obey the command no no it doesn't exist the assumption is that it's completely reliable yeah but there could be I mean in in in really extreme scenarios there is an alert signal that could be raised by a DM chip that say something wrong has happened with me I exactly for rights you don't necessarily know yeah there's no acknowledgement so yeah but we're going to talk about some of that soon maybe there should be a better uh so the assumption is that basically uh the memory handles all of that and it doesn't need to acknowledge this also reduces the protocol overhead in a sense right if there were acknowledgements that you need to wait the protocol overhead would be high uh and acknowledgements are useful in unreliable medium of course uh the Assumption here is it's reliable mostly hopefully so you can see that it's not designed for complete robustness alac do you have anything add to anything to add about the about the alert command no do you know when it's used how it's used yes MH yes so that's basically the level of uh kind of negative acknowledgement but it's not clear when you receive the alert right it may be a bit too late you may have actually proceed it a lot right M yeah yeah so the protocol is really not fully optimized for robustness I think it just assumes things are done and that uh yeah okay we can talk more about that later on there may be errors happening at different places basically where alert may not be able to catch those errors right but these are good things to think about how you design a better protocol okay what else do I want to say over here there also a channel level constraints that I'm not going to get into there uh for example here tccd is between any command or right command that needs to be enforced at the channel level yeah this is not a comprehensive set but I think this is an insightful way of thinking about commands Okay and there's another explanation the other paper that I'm not going to go into uh but basically I think uh my point so far is DM controller design is actually becoming more difficult it's going to become more and more difficult uh if you look at an S so today all of the processors go through uh a DM control controller and we're going to expand this DM controller to other memory controller like hybrid memory controllers later on when we talk about emerging Technologies but we've seen examples of hybrid memory controllers also so essentially you have heterogeneous agents uh accessing main memory there's interference happening between them and the memory controller needs to do everything that we said plus managing notal interference as a result there are many goals that you need to satisfy at the same time performance fairness quality of service Energy Efficiency power and who knows what else maybe you'll come up with some other metrics uh but we will see some more of this also uh so one of the things that uh like going back to the conversation we were discussing while I was at Microsoft with Chuck teer one of the things we that we were working on at the same time Chuck teer was complaining about how uh the control is being the worst thing he's designed in his life uh was actually to make the design easier for people uh of memory controllers and I like this work this is one of the like relatively early applications of uh reinforcement learning to Hardware design and I think there needs to be more and more and more of this sort of approach uh that hopefully make things more robust and also easier to optimize as well uh so I'm going to describe this a little bit this appeared uh in Isa in 2008 it will be one of your readings uh as you can see but the idea is it's really difficult to design a policy that maximizes performance we're going to focus only on performance here but then there are other metrics don't forget there are too many things about thanks Raul too many things to think about is this you no okay you you just found it somewhere okay Rahul enhanced my boring slides with pictures like this Jackie it's Jackie Chan really okay it's a meme okay and why is he doing that he's frustrated about memory controllers apparently I think Jackie Chan would break the memory controls If he if he saw them okay but basically there are too many things to think about and there's continuously changing workload and system Behavior too right you may actually design a policy that that is juristic based that works for some workloads or or some system Behavior but that may not work very well for some other workloads right when do you switch between read to write mode for example do you use a high water mark meaning if 80% of your right buffer is full you start servicing rights is that a good threshold is that graceful usually that sort of hard thresholds are not very graceful right there there are workloads that actually abuse well abuse in the sense that they don't know necessarily the thresholds but they actually send a lot of requests and now your threshold is actually bad for some workload and good for some other workload potentially right so it's very hard to design things such that you can really take into account changing behavior of different workloads and there are many workloads uh if you uh if you can imagine some legal action happening to DM it happens in real workloads real workloads are so varied and complicated that they actually do things that you may never think are done it could be done for example I think I'll go back to the comment that we received with rammer paper right rammer paper show this on real workloads right that was one of the comments and that was one of the reasons for rejection we couldn't show it for real workloads I mean we could have guess that's a lot of work but roow Hammer rammer bit flips happen in real workloads people found them actually but maybe the natural tendencies not expect that this would happen so real workloads are a lot of things basically so it's very hard to optimize for everything that's done by real workloads software is so if you think Hardware is complicated software is orders of magnitude more complicated there are a whole lot more people writing software there's all kinds of software and it all executes on these systems so it's very hard to optimize for everything so basically the dream is that given these wouldn't it be nice if the if the hardware designer didn't need to optimiz for everything basically figure out a way of Designing a memory controller such that it automatically finds a good scheduling policy on its own in this case we're only concerned with a scheduling policy to optimize performance but this could be extended so basically we're going to optimize for the performance function memory controller sits in between cores and memory and our goal is to resolve memory contention by by scheduling requests so how do we schedule requests to maximize system performance uh so I think I already said a lot of these and the idea uh the general idea is to uh really use machine learning uh have an online machine learning policy uh that keeps updated while the memory controller is running while the system is running while the workload is changing while the system is changing potentially right that's the idea and we found out for reasons that are given in the paper if you're read the paper I think there's a beautiful mapping of reinforcement learning to memory control uh especially reinforcement learning uh Works nicely if you're if the problem you're optimizing is the mark of decision process and we motivate that uh in this particular case the memory scheduling problem is a mark of decision process and you can see that I'm not going to go into that uh uh right now uh we're going to see Mark of decision processes later a little bit when we talk about prefetching also we'll go into it a little bit more detail but you can read the paper essentially the design uh that we have is to use a reinforcement learning agent built in Hardware uh is as the memory controller then the question is of course how do you design that thing okay and uh reinforcement learning is nice because it's essentially you learn by uh getting reinforced positively or negatively after you take an action right meaning uh at a given State you take an action and you over time observe uh whether that action was good for performance or bad for performance and over time you collect a a reward value for that action that you've taken in that given state so over time you to learn to associate State action pairs that lead to high rewards assuming you've set up the system and the reward function right and when you see the same state the next time and when you see available actions you pick the action that leads to the highest expected reward based on what you see if that doesn't lead to a reward again over time you get reinforced and you start let's say learning something else right so it's very adaptive based on the rewards you've seen based on the actions you've taken at given States over time you change your policy there's no reason to actually keep the same policy you don't necessarily keep the same policy you basically adapt over time so it's very much more adaptive than existing policies remember the existing policies that we've discussed ffcfs for example pick the request that hits in the Rob offer first and then have also a right scheduling policy that we kind of discussed as well this is much more adaptive so hopefully this dynamically and continuously learns and employs the best scheduling policy to maximize long-term performance so I think I've said this but uh reinforcement learning is essentially one of the fundamental learning based mechanisms uh sometimes it's forgotten but at least uh there's uh I guess uh the no priz went to Alpha fult which is essentially a deep reinforcement learning agent to uh discover the structure of proteins right and I think this was a pretty uh good Discovery uh so reinforcement learning is actually employed do you agree in existing systems for protein Discovery yeah Alpha what Alpha go yes what thing the game solver kind okay so yeah Alpha go is the game for there are so many Alpha Alphas did it receive the Nobel Prize also okay why not a Nobel Prize in game also I think Gamers would be happy but these are really related to each other Alpha go and Alpha fault okay I see okay state of the in sense that what iue did to ch yes en yeah and that's also deep reinforcement learning yes yes in this case what we did was not deep reinforcement learning this is before the time of let's say deep learning became popular uh ours is a little bit table based classical reinforcement learning which I think still has a lot of applicability in Hardware uh especially when you need to make timing critical decisions right okay so the basic idea is reinforcement learning is a fundamental learning technique actually it was developed not before computer science I just really developed uh as a way of explaining how uh organisms learn like humans dogs animals Etc they interact with the environment because they're agents they observe the state of the environment and take an action and over time if they get a reward given a particular State and action pair they keep learning that reward they keep associating that reward with a state action pair and they keep getting reinforced and keep essentially taking the same action at the same state right that's how Pao Ian pawo Associated uh uh the Ring of a bell uh uh with the food and the uh and the dogs started drooling whenever they heard the Ring of a bell uh even when they didn't get the food after some point right so that's reinforcement learning for you it's very strong uh and some people actually think that this is the only I think these are extremes but for example Skinner BF Skinner he's a very famous American psychologist and he's actually quite brilliant well he was quite brilliant but he actually trained pigeons to deliver messages in second world war purely based on reinforcement learning principles and according to him everything in the world was reinforcement learning nothing else reinforcement learning I think it's extreme but I think you need some extreme views to actually Advance the state of the art and science uh because a lot of people I think right now may not think that reinforcement learning is interesting uh but I think there is actually a lot of value to reinforcement learning uh okay but basically applying this concept uh to uh a hardware uh scheduler like the memory controller uh this Schuler can be thought of as an agent that observes the state of the system and this could be anything in the system actually whatever you feed uh to the scheduler to this agent and it takes an action which is essentially scheduling a command which could be a noop action also uh and then based on that action it observes the performance impact of the action which is datab buus utilization in this case we're going to use datab buas utilization as a good performance measure you get a reward if you actually re utilize a datab bus and over time you try to associate State action pairs with these datab buus uh Q values or expected datab bus utilizations but this is also not short term you don't look at only uh the datab bus utilization that you got based on the action you've taken in the state in the uh immediately but you also look at what happened in the past so your uh your reward function is weighted based on the impact of your decision over time so you may actually get the data bus utilized three Cycles later your reward function gets affected by that right this enables you to actually optimize for a longer term reward as opposed to if you don't get an immediate reward right now it may be okay if you get rewards 10 Cycles after that all of your datab Cycles are utilized this may be indicative potentially of what you've done 10 Cycles ago right so there needs to be some design that choices that need to be made like this so this is why this a longer term optimization okay so you got the idea without going into a lot of detail you'll read the paper I think uh you the agent learns to associate system States and actions commands in this case with a long-term reward values each action at a given State leads to a learned reward and at a given time the agent schedules the command with the highest estimated long-term reward value for a given State and it continuously updates the reward values for State action pairs based on continuous feedback from the system so it's a very Dynamic scheduling policy as you can see and the describes it in more detail but basically State you can observe the state of the transaction cue you can observe the state potentially of other things but we didn't do that potentially anything in the system could be observed right the processors actually we actually have information about the processors embedded into the transaction CU as you will see and then you take an action and you get get a reward there could be different reward functions to summarize uh actions are relatively easy but you still need to be careful so these are some actions that you have that we distinguish for example read based on a load Miss and read based on a store Miss we found out it's important to distinguish between them uh we also have a pre preemptive precharge command uh pre-charge pending is basically you you you can uh you need to pre-charge it to schedule a request you need to pre-charge a bank to schedule a request whereas preemptive pre-charge is a command that you can issue even though there's no request that actually uh tries to access that bank the reason to have this command is to give uh the memory controller the possibility to learn preemptive pre-charge why because if if somehow uh the reinforcement learning mechanism figures this out it can quickly pre-charge things so that you don't it doesn't get it doesn't have to do the pre-charge on the critical path right of a request and then there's a noop basically this also important essentially the memory controller can issue a noop uh if it somehow learns that issuing a noop is better than issuing a command right and this is possible if if you do don't do anything maybe maybe there are two options for example right uh you don't do anything uh versus you do something to uh close the uh R that's currently open maybe not doing anything is better because you're going to get some other request soon to that drove that is currently open right if you didn't have the noop action then you have to choose another action which may be closing the row right so it's important to design the actions carefully also and there are good reasons why they're put over here and then there's the state attributes which is one of the hardest parts actually the hardest part is really the reward function in the end but in this case we made it easy because we're assuming that uh we schedule whenever we schedule read and write commands we get a plus one positive reward otherwise we get a non- negative uh zero reward essentially the goal is to maximize long-term data bu utilization that was our goal but this may not be the best goal right if you are trying to provide quality of service for example to different threats but our goal was we're running parallel applications and maximizing data bus utilizations hopefully good a good correlator for performance it's not necessary even with parall applications but it's not a bad metric we didn't want to deal with more complicated reward functions yet our goal was to really show that with this reward function we get benefits State attributes is really you what you could imagine in this paper we imagine 200 plus different state attributes and we basically did feature selection through a lot of simulations to figure out which state attributes are good uh things to consider when making this scheduling decision and we found out some things that are interesting right number of reads writes and load Miss in the transaction CU for example uh number of pending rights for a reference draw this is something that memory controls that don't they don't normally reference drop and we found out that this actually a good metric to use as part of the state attributes to make a decision because this may actually enable you to prioritize doing something to that drove uh maybe keeping it open for example right uh and also number of reorder buffer heads reorder buffer is the instruction ordering structure in each core and the oldest instruction is is referred to as reorder buff for head and basically this counts the number of oldest instructions in each core uh that are waiting for the reference drw so if four cores are waiting for the reference drove that that Dr is probably important and this indicates that maybe you should do something special in your scheduling for that drove if zero is the number that's waiting for that draw maybe that draw is not that important again we're uh reasoning about these because these are uh selected based on future selection we're not designing a policy that uses this information the policy itself is dynamically found by a reinforcement learning agent based on reinforcement learning principles and the exact policy is something that we cannot know actually you know exactly a given instance what the decision Tre may be you can actually uh lay out all of those weights Etc as a decision tree but this policy keeps changing so this a policy that you cannot design in a sense as a human right like this I don't know how to design as a human and I don't know a human who could design is that a human that's why this is learning in a different way right okay so but we know that uh this policy somehow makes use of this information to take better actions as you will see uh then uh in fact what we did was after doing the future selection After figuring out that these are useful things we tried to design a policy that uses all of this information and you can see that in the paper uh we came up with some prioritization order that takes into account all of these State attributes that we found are important for our reinforcement learning agent and we came up with a static policy designed by a human so this is where feature selection is aiding the human designer and we found out that our policy is better than existing policies but still much worse than a reinforcement learning agent making use of the same information exact same information and in that sense that's the most fair comparison that you can get right you're comparing exactly the same information exposed to a human who is hopefully a domain expert and that human uses the same state attributes to design a policy as best as they can and you get some performance and the re enforcement learning agent non human Hardware uses exactly the same information to design a policy that's completely different and Hardware wins in this case maybe not as much as Alpha go that rul is very excited about but still I think that's a tough problem also okay okay there's also the request relative reorderable for order which is basically potentially how important the request is compared to other things in the instruction window okay so I think I've covered all of these how do you actually design the hardware is something that I'm not going to cover that's also important and interesting because there are decisions that you need to make in Hardware uh for example uh your your Hardware cost should not be high your decisions should be relatively quick you need to satisfy the uh Dr even though Dam operates slowly compared to the processors you still need to make decisions very quickly because it's still pretty fast so how do you make those decisions quickly by accessing large tables so there are large tables in wer how do you minimize the size of the tables etc for example if you if your State uh space Grand lity uh is very fine grained those tables become huge for example number of reads uh in the transaction Cube they could be up to 128 do you have one value for uh every single one of those that's actually very costly right then you have 128 things that you need to keep track of and each if each constitutes a different state then that's very expensive right how do you index into that table because these are multi-dimensional tables in the end because you need to index it using all of those State attributes in the end and we have eight of those and if eight of if each of them has 128 possible values that's pretty expensive right so what we do is something called generalization meaning we quantize things uh such that uh uh different levels uh of the number of reads uh are actually uh considered the same state this Quant ization basically but this also this is this is good for Hardware cost but it's also good for generalization you don't want to learn separately uh that a given action is important when you have one trans one one read in the transaction queue versus two reads versus three reads versus four reads versus five reads versus six reads versus seven reads maybe all of them are kind of equal right so how do you do the quantization is important clearly because maybe equal quantization is not the right thing to do so there actually a lot of design choices over here but quantization is actually good for learning itself because otherwise you'll have to learn all of those different values in the state space separately whereas if you actually quantize you generalize the state space so that you learn once for that General quantized value so these are actually fundamental things in learning in general specifically reinforcement learning but clearly quantization helps in other learning mechanisms also uh as you probably know okay but I think my point here is quantization is not good for Hardware cost only but also learning itself because it enabl generalization uh okay uh there are also other things I should also mention that in in all machine Learning Systems there is exploration versus exploitation this a fundamental thing uh meaning you you are actually following a policy do you actually keep following the policy all the time or do you actually get room for exploration as well uh this means that uh whenever you actually uh need to make it take an action uh you always consult the table that tells you this is the action you need to take given the state so that you maximize the reward that's good but uh if you always keep doing that you may not you may be stuck in some local Minima in terms of the policy space uh and sometimes you need to explore always exploiting may not enable you to learn better this also true in human learning right if you always exploit your know how do you learn new things right sometimes you should explore right you should actually go and do something complete different and that puts you in a different space and you learn something else that enables you to actually do something do your whatever you're doing right now better right in research for example you read some complete different paper and that gives you an idea and that way you can actually do something better right similarly here we also explore in this case exploration is not as wild potentially like reading a new research paper that's completely different but you basically say okay my current policy recommends that I should take this action but I'm going to go against it because with some random probability I should be exploring so that I can find out a hopefully a fundamentally potentially better policy so I'm going to take this random action that I choose that is also legal of course among the legal actions that are possible currently in the memory controller and this is important in learning systems so that you don't get stuck in one policy and that you need to design the hardware controller to actually do this sort of exploration as well so you can see the analogy in learning for humans also right that's so don't keep exploiting the same policy over and over explore once in a while I think that's good advice for human learning in general okay so after all of this is done you can read the paper for more details uh getting the hardware right is not easy also as you will see there's a lot of hashing indexing Etc that you need to do too uh but what are the benefits basically we looked at parallel application and we looked at uh the essentially the state-ofthe-art policy at the time was first ready first come first serve which is the 1. point on average uh we found out that reinforcement learning buys you about 19% basically which is not a bad result actually because if you actually get rid of all of the datab bus constraints and they magically are gone all of the timing constraints except the all of the timing constraints except the datab bus conflicts I should say because data bus conflict is fundamental you cannot send multiple pieces of data at the same time otherwise there is no bandwidth constraint right and the scheduling is not a problem if you don't have any bandwidth constraint when we talk about simulation we'll talk about simulators that used to have such no bandwidth constraints and as a result they modeled memory in terrible ways but essentially if you have only the data bus constraint but no other timing constraint you get about 70% performance benefit so we're getting 19% of it which is not bad and this is a pretty aggressive upper bounds 70% and I should also say that the numbers May look low because these are applications that we could simulate at the time today's applications are much more intensive and we could have done maybe even more intensive simulations uh okay basically uh we we found out that the performance improvements are consistent and they're also robust over many human design policies we also looked at policies that use the same information as I mention okay any questions okay I'm running much more slowly than I thought so we'll keep going for a while okay basically yes yeah you mean the simulation infrastructure yes we built but this is only simulation yes building hardware for it is very expensive at the time CP no no the here the you you really have a hardware dedicated Hardware that does reinforcement learning in the memory controller yeah yeah exactly there there are tables not not terribly large I think overall 32 kilobytes but it's really optimized to be built in the memory controller you can think of this as a hardware accelerator for reinforcement learning built inside the memory controller EXA number yeah yeah it cannot it cannot be generally used by any Mach any reinforcement learning mechanism yes but you cannot build this in CPU because it's too much latency right to communicate in other works that I will maybe later mention uh we will we look at building things in software um so you can actually download those things but they're not they're not applicable to uh memory controllers okay so let's talk about advantage and disadvantage of this basically uh this uh clearly has performance at set advantes because it enables continuous learning in the presence of changing environment and workload so you're not stuck with a single policy in the end uh you have also a reduced designer burden in finding a good scheduling policy designer doesn't specify uh how to how the memory controller should schedule things but it only specifies what system variables might be useful in scheduling and what Target to optimize but not how to optimize it I think this gives a lot of freedom to the underlying Scher to actually do find a good policy so I think uh my criticism of existing systems is there is a memory controller uh on my laptop which is five years old probably right now it was designed three years before my laptop was built let's say it's it's age is eight years let's say uh and it hasn't learned anything it's doing exactly the same thing all the time is that a good design it's good to think about that right it's seen a lot it's seen lots of workloads lots of uh use cases Etc and it's learned nothing it's the dumbest thing you could ever find perhaps right of course maybe I'm exaggerating but it didn't learn anything right okay so there are of course difficulties in this sort of approach I mean it was even more difficult like when we were doing the 16 17 years ago uh basically how do you specify different objectives there is work that comes after this and I think it's important to continue that line of work Hardware complexity how do you actually uh make it lower complexity or how do you actually generalize it a little bit so that other things can use it as well potentially uh but I think the hardest parts are really the mindset and flow of the design mindset includes like testing verification validation how do you validate or verify these designs uh is not an easy question right because normally some sometimes when you actually design policies like this actually not sometimes usually when Hardware designer who a domain expert designs uh policies like this they also come up with a set of micro benchmarks this is a good uh good design uh Paradigm I would say you come up with a micro set of micro benchmarks and you based on the policy you designed you expect uh some performance or some Behavior out of your micro benchmarks that you designed and you basically do some functional testing based on that behavior avior if you don't get the behavior that you expect either your micro Benchmark is wrong or your policy is wrong where because something is wrong so you figure out what's wrong but it's relatively easy to figure out what's wrong in that case because you know your micro Benchmark you know your policy exactly here both of those are hard how do you come out with a micro Benchmark okay maybe you came out with a micro Benchmark you have some expectation but how do you know what to expect right because the policy is changing policy is has a lot of let's say things that you have not controlled clearly you have not designed the policy because it's on it's acting based on its own it has a mind of its own literally in this case uh how do you test it so that testing Paradigm is gone in a sense right you could do of course lower level testing that doesn't rely on intuition right you could of course exercise whether the update function is working correctly whether uh you're actually doing the right thing but you cannot really test and verify the policy at a higher Insight level right that's the difficulty in design and this is one of the difficulties of adoption in general how do I relinquish control uh of my testing to an agent do I trust that agent Etc and this a general issue with machine learning in general right uh in this case it's about performance but in many other cases machine learning is being used for potentially safety critical things do I trust that right but this trust issue is very fundamental even if you do something for uh like performance like this there will be people who will not trust what's going on because they don't understand it so explainability is a big problem in general okay uh I will not go into this more detail but uh this this an area that needs definitely more work but I think we need more ideas certainly and we need to work more on explainability of these issues but at some level explainability is also very difficult right because uh it's very difficult to get complete explainability you could actually debug I I should also say that debugging is also difficult right because once you start debugging uh again performance debugging right we could we could verify things that are very very lower level whether the circuit is working whether the multiplications are being done correctly whether the update function is working but uh how do you do the performance debugging if you're not getting the performance that you expect for example again how do you deal with what what is the policy right you could cycle by cycle look at every single value that's in the table that's relevant but that policy becomes very very difficult so we actually really need much better verification and debugging tools to enable machine learning as a fundamental component in Hardware design in general okay but that's the paper and if people are inspired and want to do more work in this area I think there's a whole lot po whole lot of potential and this is this paper was invited as a retrospective so if you're interested in our retrospective after 15 years you can look at it this was invited last year to ISA as part of their 25 years of Isa issue so there more more work in this area I'd also recommend people to take a look at this paper but I believe there needs to be more work in this area maybe Rahul knows more work but is that all you didn't add it look yeah I'm gonna talk about that yes very briefly so I think this an example while we're talking about memory controlers this a self optimizing memory controller but I think in general this concept uh should be uh appli to more things in architecture so I'm going to take a critical view of system especially Ware architecture designed today it's mostly human driven humans design the policies and they dictate exactly how the hardware should behave right uh as a result we have many too simple shortsighted policies all over the system memory controller cache interconnect uh Schuler uh whatever you find B instruction scheduler all kinds of schedulers everywhere they're all shortsighted and very simple policies and there's no automatic data driven policy learning uh the true for branch predictors also like prediction mechanisms as well but people have looked at perceptron predictors as we have covered in ddca for example uh but things are getting better I think essentially there's almost no learning as I said all of these Hardware components are designed and they see a lot they have the potential to learn a lot from the data that's flowing through the system right but they didn't learn anything over many many years right the question is can we take a fundamentally different way of take on designing architectures and can we make them better more intelligent I don't know about fun Fally intelligent exactly means but uh this reinforcement learning agent is more intelligent than uh than ffcfs scheduler for example right so what's an intelligent architecture it's needs to be data driven basically it needs to learn Based on data over time it gets better hopefully it learns the best policy adapts uh and hopefully it has a sophisticated workload driven changing and far-sighted policies and we call this automatic data driven policy learning and if all controllers are data driven agents maybe you have a much better system and maybe they need to coordinate with each other also but that's true even existing systems as we have discussed right maybe last level cash policies should coordinate with the memory controller policies right similar to that I think uh intelligent controllers should also uh be communicating so I think in the end we need to rethink the design of all controllers and how we design Hardware uh this way so this was one of the papers more recently Rahul has been leading some work where we applied reinforcement learning principles to uh prefetching uh in particular offset based prefetching Rahul can talk more about that maybe we'll have some lectures when we talk actually we'll have some lectures on prefetching so stay tuned and he's also more recently looked at not necessarily using reinforcement learning but simpler things like perceptrons to decide or to predict which uh which load request will go off chip and we' also have been looking at using reinforcement learning and storage systems uh this is where we have artifacts and software completely because in storage systems you need to make a decision on where to place the data uh in multiple different storage types for example and that decision making needs to be uh intelligent if you want to get better performance better Fair uh better tail latencies Etc and this paper looks at that issue and you can build on it actually if you're interested okay so basically this is the data driven part of this uh um three things that we have discussed in this uh course so far we clearly discussed data Centric we're going to discuss more of that data driven is something we have discussed right now data aware is basically how do you actually make the architectures more aware of data I'm not going to go through this in detail but you can look at this and we have papers on this topic so I believe I mean we're not saying that we're imitating what's going on over here but if you actually design architectures that are more intelligent maybe we're getting closer to what this thing is trying to do right uh I don't think this is using like fixed policies everywhere it's kind of weird to imagine this is using fixed policies that goes against the fundamental learning nature of beings right human beings okay so I have a bunch of stuff over here I wanted to talk about a few things but if you're interested I I'm going to switch to something but be before that we'll take a break but I will mention that uh we have more material on memory Centric Computing I think next week Friday hopefully we're going to cover more memory Centric computing uh uh but uh we have an interview on memory stic Computing that was published just today uh so if you're interested in a little bit of history of processing in memory and what we think about it you can take a look at that the reason I wanted to bring in memory Centric Computing is because I'm going to make the memory controller and memory interface a little bit more memory Centric when I talk about this one but let's take a quick break uh let's be back at 1521 and I'm going to cover the remaining slides and then we're going to part today does that sound good all right e e e e e e e e e e e e e e e e e e oh right on time okay so if people want to learn more about reinforcement learning you can certainly read those papers especially those Hardware oriented ones but there's also a nice book uh by Rich Sutton and Andrew Barto on reinforcement learning it's all free online you can find the PDF if you search for Rich Sutton who's done a lot of work on this topic on reinforcement learning he has a very nice book on that topic uh maybe you can put it on the website and Link it where who can help Okay uh so I want to conclude the memory controller lectures with a work that rethinks the interface a little bit uh not hugely but I think this is uh a an improvement over the existing interfaces essentially if you look at the interface that I described memory controller dictates everything right a deip has no Freedom it's not autonomous it's really basically obeying what the memory controller says the meory the dmip itself cannot power down anything the DM itself cannot refresh anything it basically obeys what the memory don't controller does maybe internally does something like roow Hammer protection it can potentially do but it has to hide it from the memory controller it cannot make things uh make big changes right so it has no autonomy in that sense right so if you think about it this goes against all the principles of memory Centric Computing that we have discussed earlier that's why I kind of have these slides over here right we talked about memory Centric Computing memory Centric Computing but memory is an incapable agent and that's how we've been treating it for many many years uh so another way of thinking the interface is basically have equal Partners right uh equal Partners means they actually can communicate with each other and do a good job in executing a function right you basically the memory controller says memory can you do this for me and the memory says no I'm busy right now I cannot do it but I'll do it later right something like that this is more of a uh equal partner or request reply type of interface and memory sometimes can request some things also perhaps from the memory controller right God forbid you cannot ask for that in today's systems right because you'd be violating something really important which is the prer Centric design Paradigm Pro prer cannot be asked anything by someone else right but proster can ask everything to others right if you think about it so if you think about it this uh this kind of makes little sense like the current system design paradig and then we ask a lot from memory we have seen a lot of issues refresh uh row hammer latency we tell the memory manufacturers memory manufacturers it's your problem Sol row Hammer now does it really make sense to inflict such pressure on memory manufacturers while giving no freedom to memory manufacturers right this is a kind of a very bad State of Affairs in that sense now I'm no defender of memory manufacturers as you know probably I actually think there are a lot of issues in memory manufacturing but but there's something wrong that's even bigger in the system which is really how we treat memory and how we interface with memory right that's why uh this work I think is important H it took six rejections to get this paper accepted to micro you you'll find it online uh but it's going to be presented at Micro by atabak atabak do you want to present it okay no yeah Al is going to have much better slides than what I have over here oh okay yeah yeah yeah now I asked now and you gave the right answer I think yeah unless you want to present it yes okay yeah you will do a much better a job at micro with much better slides so basically that's the idea of this work I think I kind of said the idea right memory controller manages everything including the maintenance operations that someone can rightfully argue that it's the job of DRM itself right what are those maintenance operations there are at least three DRM refresh row Hammer protection memory scrubbing memory scrubbing is basically once in a while you go through memory and uh correct things that may have been uh corrupted right using ECC read and write memory locations to basically scrub and make sure that errors are corrected before more errors can accumulate and correctable by error correcting codes right that's the idea these are operations that a CPU chip should really have nothing to do with in my opinion right in a sense right but because of the way we divide responsibilities between memory controllers on the CPU chip and memory chips these operations have to be controlled and dictated by a CPU chip today so if you want to change these things it's a painful process basically this requires changes to the memory controller because the memory controller needs to issue new commands to do rammer protection for example uh and the DM interface so that the memory control can issue some new commands right people have introduced for example the refresh management commands Dynamic refresh management commands what uh Etc to better do row Hammer protection for some definition of better uh and more recently people are introducing this uh per row activation which also requires some support from the memory controller but basically these require changes to the interface the memory controller and the standards and they take a really long time uh essentially they become difficult to realize it requires a new standard or update the standard and we've already discussed how painful of the process how political of a process this is right you have to go through this committee there are hundreds of companies that are part of this committee they don't want to change anything and any new idea that you want to push through this committee usually gets shot and as a result going from the ed4 to the ed5 standard took I don't know how many years eight years maybe longer yeah butb is counting like next standard who knows how long it'll take right of course it's not the only uh these are not the only changes that go into the standard but basically change is very slow uh this way and the idea in this work is to basically decouple uh the concerns right have a better interface such that some things can be done internally in the DRM especially those things that the DRM manufacturer can do much better alone themselves and some other things are done in the memory control and they don't necessarily get exposed to memory and the memory controller doesn't need to manage tell the memory controller to refresh at this time right so this is the idea of self-managing DRM to enable more more autonomous and efficient in DRM operations the key idea is very simple basically we don't change the interface completely it's a very simple small change which basically prevents the memory controller from accessing a dam region that's under maintenance it could be one or more regions U and uh the mechanism by which the DM chip rejects DM chip uh prevents the memory control from accessing a region is by rejecting a activation command if it gets an activate command to a region that it's refreshing for example the memory chip says try again memory controller that's it basically and you can change the interface such that this delay is predictable or bounded Etc uh the paper discusses all of those issues this way memory controll doesn't need to tell the memory controller to refresh memory control memory memory control doesn't need to tell DRM to refresh DRM decides what to refresh when to refresh based on the characteristics of the cells based on whatever it thinks is useful and hopefully it keeps the refresh of a particular region that it locks very quick and the memory controller hopefully doesn't even access that region while the DRM is refreshing it right otherwise the memory controller needs to send refresh commands every some number of uh I don't know uh every periodically essentially today right okay so what does this require this requires this a single pin added to the memory controller or you could abuse the alert pin uh that we have discussed earlier I don't like abusing things meaning abusing something that's designed for some not completely spec wellp specified purpose uh to use it for some other purposes probably not that great or maybe you redefine the pin such that this pin basically says I'm going to Knack meaning I cannot do what you're asking me to do right now try again that's the idea basically this way essentially any maintenance operation can be implemented completely within a DM chip but this could also enable other opportunities perhaps the DM chip uh may be doing computation in a region it basically says I'm doing computation in this region I'm not going to honor your request right now right I'm busy with computation so it could enable other things also this sort of interface I think so in the paper we talk about how to do DRM refresh Ro Hammer protection and memory scrubbing and we also look at refresh as it is done today fixed rate refresh we also look at radar like mechanisms like variable rate refresh that we have discussed multiple times we look at probabilistic as well as deterministic road hammer protection essentially if you could you couldn't you could do all of these optimized versus non-optimized mechanisms inside the DRM uh almost without changes to the mechanisms that are proposed in the past except the interface is much more flexible and DM can do it internally without the memory controller uh being in charge also memory scrubbing so the paper is more detailed these are some old results you should read the uh current paper uh but I think this is essentially the one of the major benefits of it by by making the single change to the DM interface hopefully you can enable other changes to be made by the DM manufacturers nicely without changing anything else in the interface without going to the jetex standard committee Etc I like this paragraph from uh uh self-managing DM I'm not going to go I'm going to I'm not going to read it but essentially this gives the DRM chip a breathing room right so that it can autonomously perform maintenance operations and hopefully future maintenance operations interface modifications for implenting future maintenance operations become unnecessary in the end okay so if you're interested uh this is this was part ofan thesis and he has a talk on this topic and atabak will soon have a talk on this topic also that's going to be much nicer than what I just covered any questions okay let's see so in the remaining time I'll quickly go through memory interference I will not go through a lot but this is kind of a foreshadowing of what we will discuss in the the lectures next week essentially we're still this is still about memory controllers but memory interference is a bigger problem it's not essentially multiple threads interfere with each other when they access system uh but usually at least in the past more and more so this is becoming uh recognized even more recognized problem but if you design a memory system that doesn't distinguish between different threads requests uh you may actually have problems meaning one thread can deny service to another so if your memory control algorithms are not aware of different threads requests for example different Hardware threads requests they may be unfair to different threats as a result aggressive threats can deny service to others and uh you may lose system performance also because of this I'm going to give you one example that we're going to cover in more detail in the next lecture uh but we're talking about memory controllers so we've talked about scheduling policies let's take a look at how do you design a scheduling policy for multiple cores and I'm going to keep this simple basically cores will interfere with each other only in the dim memory controller in this case and we're going to run two applications let's say one is a streaming application the other is a random access application and this is what we will observe random access application gets denied service and it keeps waiting and waiting and waiting and waiting in the memory controller and this happens in real systems uh well it used to happen more in the past I don't know I haven't tested the most recent systems yet but this is because there's unfairness in the memory controller right the memory controller is designed uh for a metric that's not fairness and this is the stream application that we run that we ran at the time we're talking about 2006 right now on real systems essentially it has sequential memory access very high robot for locality very memory intensive because we ensure that everything misses in the caches and this is the control application which is a random access application uh which has exactly the opposite characteristics except for memory intensity it's similar memory intensity but it's random memory access as a result it has very low robot for locality this is not exactly how you would write the applications because this random function call takes a long time so if you really want to measure the time of these applications carefully you should really do some optimizations over here but you can find that information online uh in our uh source code Etc so what does the memory hog do so we call this the memory hog program which basically Hogs the memory bandwidth uh against other applications in general but also specifically the random access application if you run these two applications together assume that they are accessing the same bank essentially streaming application opens the Rob buffer it keeps accessing sequential columns in the same row it's essentially accesses row zero and then a streaming a request comes from a random access application the memory controller deprioritize that request because it's prioritizing things that hit in the Rob offer right so streaming application keeps accessing the same Ro in the same uh that is open in the Rob buffer so it's requests keeps getting prioritized over the random app access application that's is trying to access different r in the same bank but doesn't get its chance because the memory controller is designed to maximize Robo for locality in this particular case right and you could keep animating but gets boring after some point essentially as you can see this is a denial of service problem until uh the streaming application switches to another row or stop accessing this row or a timeout is enforced which is not nice in general timeouts uh uh but it's a one one way of Sol this problem uh the streaming application Hogs the memory band within this bank and randomax application doesn't get its request serviced and you can do the calculations if you want essentially a large number hundreds of requests of uh the streaming application may be serviced before a single request of a random access application we actually did this on real systems at the time we were studying this problem we ran a streaming application like Matlab which does a lot of streaming accesses to matrices for example uh and then we also ran GCC which is not exactly Random Access but there a lot of random accesses enough random accesses to see the problem at least and these are the slowdowns that we measured this is on a system that looks basically very similar to what I showed caches are not shared and you you have sharing at the memory controller so you can see that one application slows down a lot more than the other so it's quite unfair mat lab doesn't slow down this may or may not be a problem right maybe don't care about fairness maybe what you are really trying to prioritize is ma lab but if what you really care about is GCC then you have a problem right so basically this is due to the unfair scheduling policies of the memory controller the ffcfs policy that we've discussed which is commonly used as a building block of a bigger policy in the memory controller today it prioritizes roow hits first and then all this request first uh and the scheduling policy both of these decisions are actually unfair uh the policy tries to maximize DRM throughput but when multiple threats share the DRM system it's unfair basically Ro hit first unfairly prioritized threads that keep hitting in the r buffer oldest first policy is also unfair because it implicitly prioritizes uh threads that keep accessing memory a lot right because if a me if a thread floods the memory system with lots of requests to memory and if another threat has only one request the threat that floods the memory system is really denying service to the thread that has only one request once in a while because all of those requests that are flooding the memory system will appear older to the memory controller at some point than a single request from some other threat right so both of these policies are actually quite unfair policies so we've actually run these workloads at stream and random as you can see and we see significant uh slow uh slow down differences these are the stream and random applications that I showed you earlier you can think of those as micro benchmarks but actually these are uh again similar to kernels used in real applications also so stream slows down little random slows Downs a lot and when you run stream with other applications you again see similar denial of service behavior and this becomes a bigger problem with more course essentially as you add more course these slowdown uh spreads actually increase so lip Quantum for example is a Quantum Computing simulator that's very streaming and it acts like a streaming memory performance hog so other workloads actually slow down a lot more this is another uh Benchmark from the spec Suite this is uh a video encoder decoder as you can see so basically if you have this sort of unfairness in the memory control You're vulnerable to denial of service someone can actually maliciously write these programs uh so whenever you're vulnerable due to unfairness malice can be a play also right basically someone can exploit that to actually deny service starve the others uh just like any resources even even in human systems right humans are terrible at resource sharing in general in the world uh so starvation happens in the real world also and this happens sometimes uh because of unfairness issues but sometimes it could be explo exploited maliciously and it is exploited maliciously here uh with the streaming application right so basically uh it could be den of service it could be a system level control may be lost because maybe in the system level I want really this particular video encoding application to get priority but I cannot enforce that because I have no way of communicating to the memory controller saying that you prioritize this thing and you also get low system performance because your system doesn't progress very well meaning this core makes some progress maybe but it's very intensive in memory so it's slow progress the other cores may not make a lot of progress right in fact if you prioritize some of these less intensive cores maybe they're going to make much better progress right so basically this is really bad for system performance overall system performance or throughput as well in the end if you don't control such things if you design controllers that don't take these into account you have an uncontrollable unpredictable system and this actually for many applications and you can see many works on this we're going to talk more about this in the next lecture but I'm going to kind of give you an orw what what we're going to talk about we talked about memory performance attacks essentially we're going to look at mechanisms to control this how do you solve this problem essentially if you have this uncontrolled Behavior Uh unfairness start a a denial of service in these memory resources you need to build mechanisms to control it right this is not rocket science is obvious it applies to again much broader things that are not the subject of this course but resource sharing is really a big problem uh and we basically are going to talk about especially memory controller mechanisms but we'll talk about briefly mechanisms that are applicable to interconnect and caches so essentially we need to design interference aware quality of service aware memory system so we'll talk a lot about quality of service aware memory scheduling we talked about performance in this earlier right but now we're not going to talk about only performance that's going to be important how do we schedule requests such that we provide High system performance High fairness to applications and also a configurability to system software such that the system software can designate which applications get priority or system software can designate what kind of service level each applications should get right and this I think a fascinating and important problem so to be able to do this your memory controller needs to be aware of threads threads meaning Hardware execution contexts that are sending requests to the memory controller right so we have done a lot of work in this area I'm not going to cover it because we don't have time uh in fact we have very little time but these are some of the works that we're going to cover next time a lot of scheduling a lot of interesting ideas a lot of theory behind these scheduling mechanisms as well and sometimes you get good comments from viers also this is a comment from Isa 2008 that we got for the uh parbs paper which ended up being implemented in some s so memory controllers uh okay you can read that short review but positive review sometimes it happens uh and then there are other papers that we're going to cover there are a lot of readings that I'm not going to cover right now uh but I will end with uh something that looks like this basically the problem is getting much worse essentially we have all different types of course general purpose and then special purpose they're all different types and then massively multi-threaded Etc and then there's also the io access engines direct memory access IO access they're all sharing the memory controller everybody's going through the memory controller basically today the question is how do you designed the memory controller to provide all of the metrics uh that are good and also configurability such that the system software can actually configure things so hopefully I've convinced you that uh memory controllers are critical to research and there needs to be more done in this area I believe we need actually more breakthroughs in this area because these will become more important because they're not going to go away the memory controls will be there increasingly they will also facilitate computation of floating coming from these different cores so it'll be another thing right on top of memory accesses you also need to deal with computation offloading that's coming from these cores so that's why we dedicated one lecture I thought it would be a shorter lecture but it ended up being longer but I think it was important important and next week we're going to start with quality of service aware system service quality but we're not going to forget performance because service quality is easy if you don't care about performance rates we over provision the resources a lot and don't share right then it's very costly then you get service quality but performance is terrible if you think about the resources you're dedicated with performance is always a function of how efficiently you utilize resources and we should never forget get that you can get very high performance and high service quality if you don't keep all of your resources busy but that doesn't mean it's high performance because there are so many resources that are getting wasted okay any burning questions no okay everybody's ready to go uh for the weekend okay I'll let you go for the weekend uh have a good weekend I'll see you all next week e e

Transcript for:[Lecture 10] Understanding Memory Controllers in System Design

Transcript for:
[Lecture 10] Understanding Memory Controllers in System Design