[Lecture 17] Understanding Prefetching in Computer Architecture

e e e anyway we'll discuss later yeah do it sound good yes okay okay we should get started let's close the door leave you okay let's get started shall we it's a smaller class today some event going on that I don't know of it's in the middle of the semester yeah okay today we're going to cover an exciting yet another exciting topic which is pre-etching uh which I think is be going to become even more important going into the future due to the importance of memory problem increasing uh probably we're going to cover some of prefetching today and some of prefetching tomorrow and I intend to cover simulation as well tomorrow because that's such an important lecture uh because a lot of what we do for future architectures a simulation uh because we don't have we cannot build hardware for everything and we cannot even have hardware for things that we can imagine so it's important to do simulation for future architectures and you've been reading a lot of papers that have dealt with simulation you're going to you're going to see some more today when we talk about pre fetching a lot of the ideas don't get implemented right away it's very difficult to have a big idea and have the platform to test it uh it's that's why a lot of Big Ideas get developed uh by people imagining things and modeling them in some way that may not be perfectly realistic but that's okay if the idea is big enough then you can actually model it see its effects and maybe 10 years 20 years down the road people will implement it and some of the things that we're going to discuss here uh is going to be like that in prefetching okay so let's get started I had to put that so we have a busy agenda meaning we have a lot of things to cover in prefetching we're going to talk about uh the many questions in prefetching what is prefetching and then four questions in prefetching and then we're going to talk about basic prefetching schemes software Hardware execution based execution based usually is software Hardware Cooperative but there could be software Hardware Cooperative as well we're going to talk about metrics to evaluate a prefetcher with and then throttling and then talk about Advanced prefetches uh I don't know if you'll get to it today but we'll definitely get to it tomorrow talk about reinforcement learning for example and rad execution which is my own PhD thesis and we're going to also talk about issues in uh multicore and multi-threaded systems probably some of that today but let's see does that sound good okay maybe so so okay so why are we prefetching essentially we have a memory latency problem as you have seen in earlier lectures right and you have seen uh this picture that I showed you where capacity and bandwidth have improved significantly but latency has not improved as much in memory and we've discuss the reasons for this I'm not going to go back to this again uh clearly part of it is the design Choice industry Choice uh and historical choices uh part of it is really it's difficult it's fundamentally more difficult to improve latency so this a fundamental problem in the end that's one of the good things to focus on in research if research uh uh it's if if I have to pick some research problem I would pick the difficult problem because that's where you can have a lot of impact easy problem maybe a lot of people can work on it anyway okay so you also see in this picture memory capacity DM capacity chip capacity has improved greatly as you can see uh over 54 years or so whereas latency has reduced not so greatly even less than uh an order of magnitude and if you look at recent years it has not been improving in fact it's been increasing in some areas and but we've also seen that memory latency is critical for performance these are to jog your memory for many applications and we've also seen more relative recent studies that look at new DM types different DM types and how they actually increase latency and how that affects the performance of some workload some performance of some workload actually reduce when you go from an old DM type to a new DM type because latency increases some other characteristics become better like throughput bandwidth but if your workload is not able to exploit that throughput which may require significant rewriting of the workload then your workload performance decreases so this is not good news clearly right ideally you would like with better memories your performance to get better with better Hardware any type of Hardware your performance to get better if that's going the other way around then you have a big problem okay so people have recognized latency as a problem clearly and they've developed many techniques so we're going to cover some of those some of these techniques in this lecture I'm going to refer you to some other lectures for some of the other techniques in fact we've covered some techniques in digital design and computer architecture caching for example that's a very common technique but I'm going to classify them uh more let's say uh a little bit more or less rigorously into latency reduction latency hiding and latency tolerance techniques so let's say let's take a look at what what these mean so I think we've seen three things essentially the first technique the first set of techniques fundamentally reduced latency as much as possible at the source and we have seen that when we talked about memory latency right I would call this a more data Centric or memory Centric approach I have a latency Source I want to cut down that latency as much as possible so that I can access data quickly fundamentally this is more data Centric in a sense right because you want to reduce the access lat that data and we've seen a lot of techniques to do that in the memory latency lecture there are more and you can read about them maybe you've been reading about them in the papers homeworks Etc the other two approaches are more prer Centric and today we're going to be prer Centric uh but this doesn't mean that these two other approaches cannot be employed for example in a processing in memory system you can have a processing in memory system that can also try to latency even more right because maybe you cannot fundamentally reduce latency as much as possible or as much as required to get even higher performance okay the second approach highes the latency seen by the processor fundamentally by definition this processor Centric as you can see right the processor needs to access the data and you hide the latency as much as possible and caching is one example you cach data so that you don't need to access the same data element again and again and again this reduces latency from the process perspective right but it doesn't reduce the fundamental latency to bring the data into the cache to begin with that's the difference between the first approach and the second approach they can be employed together of course uh but that's a different thing prefetching also hides the latency seen by the proc and we're going to focus on that today but there is a third approach which is tolerating the latency seen by the processor I think this is fundamentally different from both of the other approaches and the idea here is to overlap the latencies meaning amortise the cost of the latency again this another processor Centric approach uh the processor uh has uh some is accessing them data it has some latency you couldn't hide the latency you couldn't cash it you couldn't prefetch it uh you didn't reduce latency fundamentally as much as possible or maybe you did still there's a lot of latency uh how do you hide that latency one way of hiding that latency is to actually do other things in parallel so that none of those become the bottleneck alt together you can parallelize everything I actually had my favorite example when I was coming to lecture today uh because I didn't have enough time I didn't have lunch uh so one way of actually having lunch which is a pair was to actually serialize meaning have the latency uh sit in my office eat my lunch and then come here but then I wouldn't make it in time so what I did was I was eating my lunch while walking that's a perfect example of tolerating the latency of eating lunch right this way I don't I really hit the latency of that lunch with doing something else so that's an example and we do this all the time right otherwise everything would be extremely serial and processors do it as well by trying to parallelize things as much as possible Right multi-threading is one example while you're accessing one data in one thread another thread can be executing and accessing some other data element another thread could be executing so you may have thousands of threads executing to access uh different data elements and all of those Laten could be overlapped such that you kind of see only one latency for each thread right one one access latency for each thread out of order execution which you have seen in digital design and computer architecture is another example and we're going to talk about Ren execution which is actually a special type of prefetching mechanism that uh kind of tolerates the latency seen by the processor but it's an interesting one because it kind of combines both of these approaches does that sound good okay so you've seen the slide before uh this is a slide from my own PhD thesis proposed actually because I was working on memory latency in my PhD thesis uh and clearly these techniques have been employed in many systems I'm not going to go through this in detail but caching is very widely used we've seen examples of it unfortunately it doesn't work always and it doesn't work for Random Access patterns for example right if you have purely Random Access pattern your cash are essentially useless because you're not really you you you don't have in a purely Random Access pattern you don't have temporal locality and you likely don't have spatial locality as well if it's really random access at a small Grand lity we're going to talk about prefetching uh we're going to see interesting thoughts over there but multi-threading is also a good to think about basically this works well if there are multiple threads while you're executing one thread you're you execute some other thread to hide the latency uh this works well uh for uh parallel performance in general but if you cannot paralyze your your program very well if it's serial for example if if you if only if what you have is only a single thread for some reason as we will see in the later parts of the lecture we'll see those reasons why you have a single thread there's a serial bonnick in many of the programs uh with that single thread it's very hard to paralyze using multi-thread Hardware people have tried this and if we when we get to multi-threading we may actually talk about that so this is an ongoing research effort that's not easy and then there's out of order execution which is you execute instructions out of program order such that while you're waiting for the uh for a completion of one instruction that's older you can actually overlap the latency of that old instructions completion with some other instructions that are later in the program that's the core of autoo execution and uh digital design and computer architecture lectures go into a lot of detail on how to do that uh but this requires extensive Hardware uh in the end but you can also see that a lot of these techniques were developed in the 1960s which is which is very interesting right these are not the earliest papers especially caching there's an earlier paper in 1962 I should really fix the slide I wasn't fully aware of it when I actually made the slide uh 20 years ago or 20 plus years ago but uh a lot of these techniques are actually very fundamental and people have improved these techniques uh and there's still Innovation going on uh in a lot of these techniques okay so there are lectures on all of these today we're going to focus on prefetching okay so what is prefetching essentially prefetch means that you load the Data before it's needed by the program that's why prefetch right or it's called pre-loading also some people have used term pre-loading but pre-etching kind of dominated the terminology I think uh why essentially memory latency is high if we can load the data earlier uh early enough and accurately enough then we can reduce or eliminate that latency from the perspective of the processor again again keep this in mind that processor could be a current processor or an near memory processor it could be the processor next to HDM bank for example in a processing in memory engine so this idea is applicable uh across many different paradigms the upside of this is it can eliminate compulsory cach misses so what is a compulsory cache Miss cash Miss is basically something that you have not seen before uh you access some data for the first time you've never seen that address generated by this program but you can predict it does that make sense caching usually works by us see an address once you bring it into the cache and then if you access it again hopefully it's in the cache right it may not be in the cach if it's evicted by some other data but if you've seen this address once before and if it's evicted by someone else it must be evicted because you don't have enough cash space or you didn't manage your cash well so if you categorize you can categorize the cash misses to compulsory cash misses you've never seen that address before capacity cach misses your cash was not big enough and conflict cach misses I uh think of this as essentially you didn't manage your cach well even though you didn't have capacity you could have managed it better and you could have actually kept it in the cache you have a question okay no okay good uh so this is a very uh in that sense it's a very proactive it could be a very proactive technique right you as a prefetcher if you think about prefetching you don't need to have seen the address before there are some particular types of prefetching that will require seeing the address before in the program but there are other prefetching techniques that we will see they will generate the address they will run programs to actually prefetch for the main program as we will see so it could be very powerful and it also it's a there it's it's a very uh it's an area that's really ripe for a lot of creativity and innovation in my opinion okay so can it eliminate all cach misses it's good to disc I already discussed what a capacity misses what a conflict misses right if you think about caches there's also some other Miss called coherence or communication miss that it really uh doesn't fall into of those categories this happens when you have multiple processors as we will see in later lectures if you have multiple processors One processor is accessing address a it's cached in this processor but then another processor wants to write to that location which means that you need to invalidate with one way of doing it is invalidating uh the data that's in processor the first processor's cache and bring the data to the second processor now you've seen that address you didn't mismanage your cash and your cash is big enough but the next time this processor accesses the cash it will get a cash Miss right because somebody else needed that data and that somebody else was a different processor that shared that data does that make sense this is called a communication miss or coherence Miss and that's the fourth category of misses and we will see that a lot more when we talk about multiprocessors okay so caching is interesting as you can see so in all of this we're going to assume caching that's also another thing to think about right why do we assume caching caching is good in the sense that it locality is present in many programs and that locality is very hard to exploit without cash so if you get rid of caches you can actually you used to be able to do that in bias in the past has anyone tried that remember you got if you got to go to your bias and try disable all caches does that still exist this Smite doesn't exist well maybe I don't know when I'm old enough to see these things in bias settings and general purpose computer but you could do it well you used to be able to do it at least if you disable those caches and you would you would see the effect on performance that's terrible you would wait there for minutes and minutes for Windows to boot up for example so caching is extremely effective even though it may not be efficient it's still effective because there's a lot of uh locality in the program so we're going to assume caching uh to be existent in these prefetching lectures okay so prefetching clearly involves predicting what address will be referenced or accessed or needed by the program in the future and this works if uh the programs have some sort of predictability in their Mis address patterns this is interesting what is predictability right uh that's also good to Define maybe I cannot Define it fully because predictability also depends on the technique that you use to predict things right uh but there is some fundamental uh predictability uh problem which is if the pro access are completely random like entropy is 100% or 1.0 probably you're going to have a very tough time to come up coming up with a mechanism to prefetch those random addresses but most programs don't have random accesses and even if the addresses May kind of look random to you maybe you can find patterns in that by mining things so we're going to see some ideas uh some of them are quite creative I think uh to try to uh to try to figure out the predictability in the Mis address patterns and I think there's more room for improvement over here so keep your minds open I should say something over here we have covered for example in digital design and computer architecture uh Branch prediction right we have a lot of predictions speculation mechanisms speculation means doing something before you know that it's necessary to do it prefetching is one example Branch prediction is another example essentially when you want to uh uh push instruction into the pipeline you get a branch and you don't know which path the branch would take taken or not taken a conditional branch and you predict it if you predict one way and the branch turns out to go the other way you're wrong but you fetched instructions already on the wrong path and if you don't do something about it you would execute wrong instructions so Branch prediction is a speculation mechanism that affects correctness immediately if you don't do something about it if you mispredict a branch you're on the wrong path and you flush the pipeline for example and invalidate those instructions but prefetching is nicer or Kinder in that sense Kinder towards the designer of the architecture because if you mispredict an address hopefully this doesn't affect your correctness right basically prefetch data a mispredicted address is simply not used does that make sense uh so it's good to think about prefetching and virtual memory interaction maybe also at this point like if you if you prefetch some data uh you prefetch some uh address and you get a tbus what do you do tbes are okay because those are also uh maybe manageable if you get a page fault what do you do uh and if you actually get an access prote protection fold for example you're not allowed to access some location uh that you with the address you generated what do you do since this is purely speculative you should really not handle uh especially the protection exceptions right whether you handle page fold speculatively or nonsp speculatively is an interesting issue that we're not going to discuss in this lecture but think about this from now uh to begin with so basically but but essentially with a prefetcher you don't need state recovery if you design it uh reasonably uh so that makes things uh nicer in the sense that when you design a mechanism you don't need to worry about its correctness as much there could be other issues so prefetching for example causes site channels people may really try to figure out for example uh how long an access would take to an address and they could try to infer the data that you're accessing based on that that I'm not going to talk about right now that's not a correctness issue that's more of a privacy issue right okay so I'm contrasting this to Branch misprediction or value misprediction okay let's talk about Basics uh as I said I'm going to assume cashing uh in modern systems prefetching is done at cach block grity meaning you prefetch cach blocks usually you can prefetch smaller than cash blocks or larger than cash blocks also but you what you bring uh uh in general is a cash block it can reduce the Mis rate of a cache or mis latency of a cach or both it can be done by many uh parts of the stack it can be done by the hardware microarchitecture it can be done by the compiler so it could be architecturally visible it could be done by the programmer which requires again architectural visibility it could be done by the system for example the operating system and a lot of operating systems actually try to well I don't know if they do it still but there used to be uh like when I was at Microsoft there was this feature that you could enable where uh the operating system would learn what kind of applications you would launch at what time using machine learning techniques actually and it would prefetch the application from the dis to the physical memory before you hopefully start the application clearly this would save some latency right so you can prefetch at different levels of the hierarchy uh in that case it makes sense to learn from the system Behavior right user Behavior so user Behavior could affect prefetching accuracy in that case Okay so but we're going to talk a lot about Hardware prefetching today but uh I just gave you that idea to uh show that this prefetching idea is actually much more General it really spans all parts of a system and our actions also as we have kind of seen okay uh so this is an example I will not dwell on this as much because this is relatively old it's one of our old papers as you can see it's an interesting paper because it used to be that the memory controller was on the other side of the memory bus of the processor chip but it got eaten by the processor chip due to uh very large scale advanced in vlsi clearly uh but that's not the point over here the point is you could you have a bunch of caches and the in fact the cach hierarchy is growing you have memory controller the question is where do you do the prefetching right you could do the prefetching in all of those places actually in fact in existing systems there are prefetches in all of those locations this is what I when we talked about memory Centric Computing uh I said that cach hierarchy is part of the complexity but there's also a prefecture hierarchy there's also a metadata hierarchy which is virtual memory so our memory systems are extremely complex uh today this is one example over here this is a prefecture that sits in the L2 it observes requests that go into the L2 cache and then it also sees what misses in the L2 cache and this particular prefecture can actually start tracking what addresses are going to be accessed by the processor and it can start predicting that oh processor is actually accessing uh this region of memory so I should I should keep prefetching before the processor accesses things in that region right so it can see for example address a A+ one missing in the cache and it can start prefetching a plus2 A plus 3 a plus4 A plus 7 a plus 5 Etc okay so imagine putting prefetches across all of the hierarchy that's what it is today okay before we go into examples of how to do prefetching let's talk about the four questions yes that's a good question you could actually prefet on a cash hit yes because that that potentially gives you an indication that you're going toward One Direction usually what happens in a lot of the hardware petches you create uh a tracking entry when you get a cash Miss meaning that oh you're in a region you're probably in a region that you haven't gotten into the cach before and what you you track the accesses that happen in a region for example and then if you get a cash hit in that region that's an indication that you're actually touching that region so the prefecture may actually say okay I'm going to prefetch ahead we're going to see a prefetcher like that it actually is employed in a lot of Intel processors today but these are all design ches so that's that's a that's a good question that kind of prefetches what we're going to cover uh when do you prefetch when do you actually start tracking regions Etc okay so there are M mainly four questions in prefetching uh there of course details in each of them but I call them what when where and how and there's nuances in especially where what iss what addresses to prefetch basically this is called the address prediction algorithm also uh and it's very interesting clearly when we're going to talk about that when is when to initiate a prefet request do you do it early do you do it late do you do it on on time can you do it on time it's good to think about that right uh clearly this is also Titan integrated with the algorithm and also what you do in the other question these questions are all tightly integrated together actually kind of where has multiple Dimensions when you pre fetch something where do you prefetch it do you put it in caches or in a separate buffer where do you place the prefetch which level in the memory hierarchy and then uh there are a bunch of other Wares that we will see uh in more detail and then how is how does the prefetcher operate and who operates it this kind of like more principles right do does a software do it does a hardware do it uh or do an execution or thread based prefecture or do you have Cooperative prefectures and Hardware software Cooperative or do you have hybrid prefectures today the the answer is probably hybrid and all of the above because we have a lot of prefetches but we're going to look at these in a little bit more detail right now okay let's take a look at what so what is really the this is usually called the prefetching algorithm let's say it really deals with what addresses to prefetch uh this is important because if you generate a lot of addresses that are not going to be used and if you try to prefetch from them you waste resources right and these resources are precious resources we're talking about accessing memory potential of course you could do prefetching between L2 cach to L1 cache then those resources are less important probably still important but if you're prefetching especially from Main memory to the L3 cache or to the processor anywhere in the processor you're really wasting precious M bandwidth right you're wasting also a cash or prefetch buffer space you're wasting energy you're moving data so if you generate a wrong address to prefetch it could be bad for performance and there are many cases actually this been bad for performance in real instances and people turn off prefetches for example and this has been the common I don't want to call the myth but it's kind of like a myth in server systems people uh usually say oh I don't trust the prefetcher so I'm G to turn it off and the main reason for it is they don't want to waste some of the resources and they may actually think their access patterns and the workloads that they're executing in some of these server systems do not match the access patterns that are being predicted by the underlying Hardware prefecture after they actually do some profiling uh of of the hardware so B basically you need to be careful uh because uh you waste resources and these resources resource wastage could be uh the wasted resources could be utilized by demand requests or more accurate prefetch requests so it's good to think about not all prefetch requests are equal as we will see later on some prefetch requests are much more accurate some prefet requests are much less accurate if you're wasting Resources with less accurate prefetch requests more accurate ref request that could have improved performance will not get the resources okay so accuracy is important basically so this is one metric that we measure prefetches with it's not the only metric by any means but essentially prefetch accuracy is defined as the number of prefetches that are used later by the program divided by the number of prefetches that are sent by the prefetcher ideally you want 100% but we will see that there's a trade-off here uh meaning if you want 100% accuracy become conservative you prefetch less but once you start prefetching less the other metric that is important uh becomes worse which is coverage coverage is what fraction of the requests what fraction of the cash misses that you're intending to reduce are actually reduced okay we're going to see that so uh but let me not get ahead of myself uh I just want to Define accuracy here so how do we know what to prefetch essentially there are multiple ways of doing it and this really ties into how do you do the prefet uh you can predict based on past access patterns right you for example look at all of the addresses that you've seen in the past or recent addresses and predict based on those this is using the information that was generated in the past uh and we're going to see a lot of prefetches uh to do this or you can use compilers or programmers knowledge of data structures uh for example you could create a thread to actually do the prefetching as we will see later on yes yeah we're going to talk about that that's that's basically the timeliness not necessarily how do you define Tim it's really basically reduced latency potentially right there are multiple ways of defining Timeless yeah but I I understand what you mean basically why do we have just accuracy or coverage or Timeless is there a single metric that can Encompass all it's very difficult to do that I think ideally you would like to have a single metric I would say that it's close to we've seen stall time Fair memory scheduling right and I like that metric clearly ideally you would like to have a single metric saying I've reduced the stall Time by this much right it's very difficult to calculate in general as you have seen with stall time Fair memory scheduling first of all uh that's one downside and if you can calculate it it may still not be uh enough to understand what your pref what to improve in your prefecture right the prefecture may be reducing stall time but then why is it reducing stall time maybe it is highly accurate and also highly it has High coverage to begin with right yes you can use other metrics to evaluate prefet your performance how much all time it reduces but it may not provide as much Insight but that's a very good point improve the pre feter for example yes exactly yeah yes another question yeah yes yes yes I mean there there are Works in literature for example to do prefetching for graph analytics graph applications and they actually do what I just said over here you they they they usually design some Hardware that can be configured to exploit programers or compilers knowledge of the structure of the graph for example but we're going to see some other domain specific type of approaches also for link data structures for example so yes I think uh uh specialization of prefet is an interesting Direction You could also do this for different machine learning models for example right people have been doing that a little bit okay so essentially prefetching algorithm determines what to prefetch and this is clearly A first order uh thing okay it's good that people are thinking and questioning keep keep them coming maybe you'll come up with a much better prefetcher and this actually important people actually spend a lot of time designing prefetches in Industry I worked with a lot of these people I actually uh still work with some of those people uh and they put a lot of effort into getting this right uh and it affects performance this is an area where you actually you can actually see performance gains in single thread systems uh significantly today and there's more to do just because memory is a huge ball neck as we know right okay okay so let's take a look at some of these predictable or not predictable access patterns again I won't go through this in detail but you can see that this access Banner is quite predictable right you can easily design something uh called a streaming prefecture or stri based prefecture as we will see this is maybe a little bit harder to predict you don't do the next address all the time these are cash block addresses but you need to identify that the stride is 42 plus 42 as we will see uh this may be a little bit harder but you can probably get a pattern right so you get plus two plus t plus 4 plus two plus three plus 4 plus two plus 3 plus4 this looks like a like I don't know an IQ test or something yeah I hated those things when I was a kid it's interesting in a sense but come on this is more interesting now there's an application for the IQ test yeah uh you can see another thing over here which looks kind of maybe not predictable but uh xyt is repeating as you can see if you can somehow predict that maybe you'll be fine or if you can somehow keep that in the cache if it's possible and there's another thing over here which is so existing prefectures actually quite uh uh sophisticated and they can predict some of these repeating patterns which is actually a good news so there are Hardware prefetches in processor that can predict some of these patterns okay so we're going to get back to that let's talk about the other question when when is equally important well actually I think all of these are quite important in the end because you may generate an address correctly but you may generate it too late right if you do it too late then it's useless for you right so when do you initiate a prefetch request if you prefet prefetching too late is bad as I have kind of alluded to but let's start with prefetching too early prefetching too early could also be bad because you prefet the data bring into a buffer but it may get evicted from that buffer cash before the processor actually needs it generates the address so that's also an output preining too late in the end you may not hide any of the memory latency right because you generate the address the same cycle the processor actually requested it that's kind of not useful in reducing latency uh somewhere prefetching somewhere in the middle uh ideally you would like to prefetch right before the pre you would like to get the data right before the prefecture needs it right before the processor actually the program actually needs it right but it's very difficult to time that as we have seen memory latency there's a lot of in there's a lot of uh interference and uh varability in the memory latency it's very very difficult to time exactly when to generate a request and that's just part of the problem it's not just about the memory latency it's all also about the program execution timeline right you need to deal with the memory latency to bring with the data and that's that is a lot of variability program execution has a lot of variability also it the program actually may reference a data much earlier than you expect right because it took some other path if if it if it took this branch and it took this path it would reference the data maybe 500 Cycles later but if it took this branch and if it took the other path uh it could Defence data 10 Cycles later right so there's a lot of variability in the program execution and prefetching kind of sits in the middle it has to deal with what kind of variability to get the data or to get the data uh right in time so it's usually very difficult to perfectly prefetch on time okay so basically when a data item is prefetched affects the timeliness of the prefetcher as we briefly alluded to earlier uh so prefetcher can be made more timely I'm not going to Define time lims right now we we're going to get back to this uh but prefet can be made more timely by making it more aggressive meaning that uh it can try to stay far ahead of the processor's demand access fee so this uh for example if it's a hardware p fetcher uh the processor let's say is accessing address it's a predictable address perfectly predictable address pattern uh access pattern you're accessing address a A plus 1 a plus 2 A Plus 3 a plus 4 A Plus 5 a plus 6 a plus 7 you keep going and you decide that uh when whenever you see a A plus one a plus2 you prefetch everything from a plus 3 to A+ 60 so that you can start getting ahead of the access stream that's what the prefetcher does you could do it now that's a lot of request to get ahead assume that you got ahead what if the processor stops pre stops accessing after address a plus 10 you actually generated 50 prefet requests that are useless right so improving the timeliness requires making the prefecture more aggressive which usually reduces accuracy uh okay improving aggressiveness also improves coverage as we will see later on uh so in in software if you're doing software P fetching what you can do to prefetch early or try to make the prefetcher more timely is to uh move the prefetch instructions earlier in the code so you have this function that you want something to prefetch for and you have some prefetch instruction that you've determined you could actually execute you don't execute it right before the function you execute it uh maybe let's say 10 functions earlier maybe 500 instructions early of course now if you do that you will generate the prefet request early hopefully with the right address assuming you can calculate the right address at that time uh but now uh again the processor may never go to that function that you're trying to prefetch for because there might be some branches in the middle that uh uh that that Force the processor not to go to that function right because of the program path okay we'll see some examples of this in a little bit okay so that was one okay that's when uh now we're talk going to talk about where where is has multiple Dimensions where do you place the prefet data in uh what structure do you place it in you can place in a cache you can place in a separate prefetch buffer most processors today actually place in a cash this like I think kind of the consensus the main reason is complexity is too high to add another buffer the mem har our memory hierarchies are so complex that people don't want to add more and more buffers in general again there could be some exceptions so in cash if you prefet the or place the prefet data in cash this is a simple design no need for separate buffers uh but you can evict useful demand data you can cause cash pollution right but um a lot of the caches uh today are not very well utilized meaning uh most of the cach most of the data when you look at a cache and if you snapshot it most of that data even demand data is not going to be used soon because we don't have a very good caching efficiency so maybe this is okay but then you may get unlucky you may actually evict something that's going to be used soon right so you need to be careful essentially if you put in a separate separate prefetch buffer you protect the demand data from prefetches no cash pollution so this could remind you something right now you we have separate buffers or separate caches if you will either you have dedicated caches or private caches one cash for demand one cash for prefetches or a shared cash and we've seen this trade-off earlier sharing resources versus dedicating resources so whenever you dedicate a resource you provide more isolation meaning demand data is protected from prefetches nor cash pollution but now you have a whole slew of design choices to deal with more complex memory system time where do you place the prefetch buff immediately do you have a prefetch buffer for every cash level okay that's not going to be nice I think to begin with uh when do you access the prefetch buffer do you do it in parallel with the cach or in serial with the cach when do you move the data from the prefetch buffer to the cach do you move the data from the prefetch buffer to the cach these are all choices that you have to deal with how do you size the prefetch buffer in fact in one of our papers um which we're going to talk about in 2007 hbca we have looked at sizing the prefetch buffer and these prefetch buffers can get pretty large if you have a very aggressive prefetch they become bigger than some of the caches is that a good idea uh and then there's the other issue which is how do you keep the prefetch buffer coherent today we have cach coherence when you have multiple processors uh and as a result of that uh if someone modifies a data that you prefetched into your prefet buffer someone meaning some other processor you need to make sure you don't get the wrong data from your own prefetch buffer because in the meantime someone else modified it so all of these are very hairy issues as you can see and this is an example of how complex our memory system designs are today and for these reasons most of the time this choice is preferred okay I already said this over here okay so aw has another dimension uh you may say okay I'm going to prefetch into the cache and then immediately the question appears what cach I have so many caches in the system right uh do I actually do it into L4 L3 L1 L2 you can pick your choice clearly these have trade-offs right for example L1 caches as we have seen in digital design and computer architecture L1 caches are very tightly integrated with the pipeline right the design choices are really dictated by the pipeline whereas L2 caches are decoupled from the pipeline because you don't need to access them as fast so as a result L1 caches are actually very small they don't have very sophisticated management policies and if you put prefetches in there you may be actually destroying the locality of the demands easily uh that said maybe you can actually design a very special prefetcher between L2 to L1 so that you can hide that latency because that latency is still important for the pipeline and existing processors actually do that L2 to L1 preatures Etc uh so do you have separate prefecture between levels and I think the answer is yes in many cases today uh that shows that there's a prefetcher I don't want to call it the hierarchy but there there a set of prefetches that are working and then okay this is which level of cash and then there's within the cach you decided okay I'm going to prefetch into all caches pick one cach I got a prefetch coming where do I place it in the cache I have a set associative cach right four-way associative I have a direct map cache good luck you still have a choice you may actually decide to drop the prefetch for example right or uh yeah you can have choice essenti but you have a set of of cach yeah multiple ways uh where do you which way do you pre what priority with what priority do you insert the prefetch into those ways is it equal priority with blocks that you know are already requested by the processor uh if it's not equal priority what is the priority so there are these questions that basically come up right or do you change the priority based on the accuracy of the prefet or some other metric of the prefet as we will see that's I think reasonable in general but you can see that design choices actually multiply so prefetch blocks are not known to be needed in general unless you have some other information uh about the accuracy of prefetching for example if you have a least recently used base policy uh to manage uh set associ of caches uh a demand block is usually placed into the mru position like most recently used position but this is even things have changed so if you actually have a caching lecture or state-of-the-art caching lecture uh you would see that existing caches actually try to do more intelligent things than just placing everything into the Miu position if it's a demand request but assume that this is the case this is pure vanilla traditional lru policy when you access a cash block with the program you place that cash block into the most recently used position because uh the theory is that if your program has a cyclic reference pattern uh the most recently used position uh U will be erected last and hopefully that will remain the C so that you will actually maximize your cashit rate right that's the uh assumption but with prefetch blocks you don't know that information so where do you place it uh so there there used to be some processors that used to place all prefetches in the lru position in a way least recently used lowest priority you have an eight-way cache you uh place the prefetch into the least recently used position and if another request comes uh and if another fill comes to that cash to that index to that set to that set in the cache you evict that prefetch immediately now this may not be a good choice if the prefetch is accurate for example right uh so this is an interesting question though to think about and I think I don't think we found the best policies over here yet right R maybe rul has found it okay okay then there's another where which is where to place the hardware prefecture in the memory hierarchy uh another way of looking at it is what access patterns does the prefetcher see this is kind of relates to your question earlier but not it doesn't cover everything over there do you look at do you examine for example L1 hits and misses do you examine only L1 misses do you examine only L2 misses or do you examine a combination of both do you examine only memory misses like do you have a preet in the memory controller for example it's good to think about that right uh well what is this effect essentially this affects how complete of an access pattern you see and what are the bandwidth or throughput requirements of your prefetch are if you see a more complete access pattern meaning all of the misses and hits that the processor generates which is whatever is coming out of the processor load store you see all of those you you you feed all of those to the prefetcher now the prefetcher needs to keep up with it first of all but it could also identify patterns better right it could have a more accurate and higher coverage prefetching but now prefetcher needs to examine more requests it's much more bandwith intensive there may need to be more P to the prefetcher because the processor may be generating I know four addresses per cycle right if you have four uh total load and store ports okay but this is actually interesting because maybe you don't want to see a more complete access batter maybe there are some there's some predictability that you want to exploit uh in outer parts of the cash hierarchy right because if you think about that uh if you see a more complete access pattern maybe you Analyze That access pattern and you try to figure out what to preet okay but not all pref not all of those accesses may be equal right because you may not actually some of some of those may hit in the L2 cache and you may not really want to consider those right you really want to deal with for example specialize the prefetcher to Long latency cash Ms then you may not really want to see a very complete access pattern this doesn't mean that that complete access pattern may not be useful for your prefetcher but you you really need to U essentially balance the trade-off between complexity and and potentially perfect prefetching in terms of accuracy and coverage if you design a prefetcher you will know what I need okay so a modern memory hierarchy is uh kind of like this I think you've kind of seen this picture before definitely in ddca uh but here there are many places you can place the prefetcher as we discussed you can place the prefetcher between main memory and also dis we're not going to cover those uh but uh people have proposed a lot of operating system based prefetching techniques as we've kind of alluded to but essentially you can have prefetches all over the place but I should also say that this is becoming more complicated right this kind of the vanilla memory hierarchy but we're making it more complicated we're adding different types of memory DM other types of memory it could be face but it doesn't have to be face change it could be very for example high bandwidth DM versus high capacity DM right that's already happening how do you prefet data between those do you prefet data between those usually that happens actually if you if you design uh something like that people try to over overlap the communication and computation while they're doing that they're actually trying to prefetch things actually so if you can think about that overlapping communication and computation could be done via some sort of prefetching and then it actually in larger systems the hierarchy actually extends even more so you may have seen this picture earlier but memory hierarchy today is growing because you may have why is it going fast you may have compute node that's local and you may have local memory over here and that memory is not enough uh to extend that memory you have some remote memory locations this could be a memory node or it could be some other compute node but in existing systems usually you have memory nodes versus compute nodes and memory nodes essentially store a lot of memory so that if your memory is not enough local memory is not enough you go through this low latency Network to get the data from the remote memory clearly this enables even higher memory capacity and modern data intensive workload in large data centers uh make use of this databases for example or uh graph analytics increasing the machine learning also uh but you can see that now that even though this is a low latency Network and today's networks are actually quite low latency uh it's still latency so there's some physical latency that you cannot avoid and you need to search for data over here so there's more latency in the memory controllers over here now if you get a cache Miss in the local memory how do you actually tolerate the latency of the remote memory so prefetching is applied and those remote memory uh latencies also and their papers related to this we have some paper that talks about that for example okay any questions so far otherwise I'm going to switch to how fascinating huh okay I find this fascinating even even though I worked on this for 20 25 years almost and I I still think this is fascinating yes yeah yes well uh okay it's it's good to think about uh if you don't have any bandwidth so you're you're thinking about a scenario where memory bandwidth is completely 100% utilized right yes if you're completely memory band with bck uh yes putting more pressure like adding more unnecessary quests is probably not good right but uh usually that's not the case uh but then there's another issue over here which is the ordering of requests right by prefetching actually you may change the ordering of requests so the original memory bandwidth bottleneck may not uh be in an order that's kind of nice uh for the processor right whereas if you actually prefetch nicely that order may be much better does that make sense okay but I I agree that there are interesting issues over here that has not been examined as much especially on the memory system side because there's we have very complicated memory systems usually when people develop prefetching uh techniques they don't consider these I don't want to call it second order effects because they could be first order effects today like Rob buffer hit rates right uh if your prefetching algorithm is not friendly to uh get your robot for locality to be higher then your memory bandwidth uh pressure will increase for example but if you have a prefetching algorithm that's actually nicely uh exploiting open Rob offers then you can actually take advantage of the memory bandwidth capability so I wouldn't take it as a given that prefetching actually makes memory bandwidth worse yes if you actually send useless requests it actually makes memory bandwidth worse but if you actually somehow carefully optimize how you schedule your prefetches uh you may actually get more out of the Rober for example in the Baseline okay let me give you a concrete example in the Baseline uh you may have request that uh conflict in the bank open different rows in the same bank you may get a lot of Rob offer uh uh what should I call robot for conflicts with a prefetcher what might happen is you predict that okay I opened the r but I'm going to access the S again later on so let me do all of those prefetches first before opening the row for the next demand this could improve your bandwith efficiency does that make sense so that's a concrete example I think but these are issues that need to be carefully I think understood and I don't think there's enough done in this area maybe will tell me that there is enough but yes MH absolutely yes m controls are very important as we discussed I agree yes it comes back to but prefetching is all okay yeah pre I think especially High latency prefetching of high latency requests uh are really important to uh co-design with the memory controller if you're prefetching between L2 and L1 that's a different issue of course right but those are also low stake prefetches in a sense those shorter latencies that you're trying to hide you're trying to hide high latencies on low bandas communication channels those are higher stake requests both higher stake in terms of performance and also the costs are also very high energy okay so let's talk about how now so I'm going to classify prefetch into three categories software hardware and execution based clearly software and Hardware can be combined also without being execution based uh but let's talk about software first in software P fetching Isa provides uh interesting set architecture provide some prefet instructions and programmer or compiler inserts prefet instructions so this this requires some effort this usually works well for regular access patterns in Hardware prefetching you have specialized Hardware that mon memory accesses and this Hardware memorizes finds or learns in some way address strides patterns or correlations and generates prefetch addresses automatically in execution based prefetching uh this could be both either software or Hardware base actually or again Cooperative you generate a thread This Thread is specialized for prefetching and it's executed to prefetch data for the main program and I already said the second part so let's start with uh software P so this an old idea I actually change these slides but I don't know I think you got the right slides but I I didn't update the right slides because I was dealing with multiple copies so coherence is a problem as you can see right if you have same two two copies of the same presentation if you update one if you don't update the other this is what you get so there are actually multiple papers on this topic there's one on plus 1991 by David Kahan Kennedy and Porterfield that's actually also very interesting so I'll update those slides Rahul will remind me so basically the the basic idea over here is compiler in the these papers or programmer places prefet inst structures into appropriate places in code that's the idea and uh hopefully these prefet instructions well not hopefully these prefet instructions prefet data into cashes uh and uh they could be manipulated to be again timely or not timely Etc I'm going to show you some examples this is an old version of x86 it's probably the same is it the same anybody programs an x86 today you have essentially a prefetch instruction uh and you can see that there are different versions of it for example this this one move data from a memory address closer to processor using a t hint now you go and look at what is a teer hint and you find that prefet data into all levels of the cach hierarchy and then you look at penum 3 presser first or second level cach penum 4 and Intel is on processor second level cach the specification is violated immediately it says all levels of the cash hierarchy what happened to the first level cach in the second one right so this basically shows that people don't want to preet into first level cash so software P fetching uh this Isa description is written by probably a hardware designer and Hardware design doesn't trust the software as you can see so they say basically no I'm not going to allow you to preet into the first level cache so how do you prefet into first lble cache uh well uh I think you can you need to do this prefetch NTA instruction but again if you look at the specification it's kind of fuzzy right it doesn't guarantee anything there are no guarantees in fact none of these may be guaranteed so that's the nature of speculative instructions essentially you can use this instruction and hope that the processor prefetches data into whatever is it's kind of promising but it's not fully promising as you can see uh but if push comes to show and the hardware designer designs this prefetching instruction they may say okay this is a speculative instruction anyway I'm not going to do anything this doesn't happen of course but these are instructions that are uh not necessary from the perspective of execution right so they could be dropped does that make sense that's the nature of speculative instructions and again if you are designing your Hardware you're trying to push it for the next deadline you don't have time you have to drop something what do you drop well prefet instruction looks good I wouldn't drop a load instruction for example but I guess uh Rahul likes dropping some load so maybe he'll talk about that the guarantee uh as long as you can guarantee uh that that load is droppable I guess you could drop it but guaranteeing that Lotus dropable adds extra complexity also but with prefetch you don't need to worry it's you're guaranteed that you can drop it okay okay but this is uh Jokes Aside this is the truth actually this how Hardware is designed in general especially if you have a complex processor and if you designed one you would know what I mean okay so let's take a look at some software PR fetching so this actually works can work very nicely for regular array based access patterns for example uh here we have uh we're adding two arrays together uh while we're adding uh uh locations uh indexed I uh we're we issue prefetches for locations index I Plus 8 so you kind of prefetch eight iterations ahead right this is an example where you can prefetch eight iterations ahead so how do you determine how many iterations ahead you prefetch is that your question or you have a separate okay let me finish this well you need to figure out what's your memory latency are these in the cache H how uh what is uh how long does each iteration take and do some calculation say okay to overcome to to basically to to make the prefetch timely and accurate I prefetch eight iterations ahead if I prefetched four iterations ahead I wouldn't uh uh I wouldn't essentially hide the memory latency of that request when I get to that iteration if I pref 16 iterations ahead uh maybe that would be too early uh and maybe my accuracy uh would reduce also because at some point uh this n uh will be uh you you will get out of the loop and you will have prefetched eight extra things right yes yeah that's right exactly so basically uh I'm assuming these are cach block based yes you're right basically you need to do a separate calculation to make sure that you're hitting the next cach block yes I kind of assume that this cash block but it doesn't need to be you're right that's a very good point any other thoughts here so you can see that even with this you need to do a lot of calculation right and this is not even we're not even talking about executing this program together with other programs once you do that your memat it becomes completely unpredictable well not completely but likely very much more unpredictable and you may optimize your program to be perfect given one application but then if your memory latency is increased this prefetch may not be affective right so that's the downside of software prefetching in general it's very difficult to uh get insert the prefetch instructure is at the right place when your memory latencies and cash hierarchy Etc are unpredictable or hard to predict okay but there are other issues which is clearly these prefetch instructions are part of your program now right which means that they increase your program size they increase your code size they take up processing and execution bandwidth they take up space in your instruction cache think about it now you can start getting instruction cach misses once you get instruction cach misses you may need to prefetch instructions and you may need to be prefetching prefetch instructions for prefetching data cach misses it's interesting right so we're going to talk about instruction prefetching later on but everything that I said applies to instruction prefetching also except I'm going to focus more on data prefetching uh to begin with I think instruction prefetching is still easier but there are some interesting things over there so how early to prefetch determing this is difficult essentially as I kind of discussed uh because prefetch distance okay prefetch distance is how far ahead do you prefetch in this particular case uh which Loop iteration do you prefetch for is it the next deration next next next next and also do that's prefetch distance then there's also called prefetch degree which is how many prefet do you generate for example you may actually generate I Plus 8 I plus 9 I + 10 now your distance is at least eight and your degrees is three whenever you generate prefetches right uh okay but this depends this distance depends on Hardware implement it depends on the memory latency it depends on cach size as you said cach block size uh also time between Loop iterations and you may actually perfectly optimize for One processor one configuration but your prefetching may not work well if you run it in another processor another configuration or with another application that actually changes all of those so portability becomes an issue with these software based prefetches and if you want to go earlier if you go too far back it reduces accuracy because they're branches in between now you can see this here right if uh essentially U you will waste in the steady state you will waste eight uh accesses right because you would you would prefetch eight ahead over here make sense because you will always prefetch N M plus one n plus2 up to n plus 7 right whereas your Loop will terminate with n okay in the steady state of course if if your n is one then you're not going to prefetch uh uh more more than one okay useless okay and then the qu the other question is now you need to change your Isa you need special prefetch instructions in the ISA uh people came up with interesting things over here like power PC had this data cach block fetch instruction we saw the x86 prefetch instruction uh Alpha was a bit smart they basically didn't add a prefetch instruction they they use the regular load instruction but you would load into register 31 so the destination register was register 31 and register 31 is always set to zero in Alpha because zero is such a nice value right you might want to initialize memory with all zeros for example that's one of the reasons why zero a register is bound to zero but they took advantage of it so now you don't need to add a prefet instruction kind of cute tricks as you can see okay now the question is okay this is regular array based access pattern maybe you did it uh but hopefully I convince you that this is difficult even with a regular array based access access patterns what happens if you have pointer based data structures before we go into that go ahead yeah yes that's right yeah yeah yeah absolutely I didn't want to get into that but exactly yes if you're doing software prefetching and if there's some Hardware doing Hardware prefetching you may be redundant uh and your Hardware prefetcher may already be capturing all of these much better than software and that's going to be my argument actually in general Hardware prefetches can be customized for the particular processor implementation it doesn't require code to change and as a result you can make it work more easily and on top of that if you're doing software pre fetching you may actually be reducing your performance if you didn't do prefetching that's a very good point okay let's take a look at pointer based structures this is uh I'm I'm doing some pointer chasing as you can see over here I'm doing some work on the data uh pointed to by this pointer and then I'm fetching the next data structure linked to this data structure before I do that I prefetch the next data structure right okay not bad I guess you could do that uh but it stops there can you do this one this a bit difficult because this requires calculation of all of those addresses right so you're doing so pointer chasing always uh causes problems uh with prefetching in general and these are the hardest patterns to prefetch in General but we're going to talk about that also but software P fetching people have developed methods for software P fetching like using skip list Etc I'm not going to go through that uh but uh [Music] yeah exactly yeah uh it depends on how long the work is over over here right if this work is a lot maybe you will get P next early enough but uh if uh the if if this request comes uh while this request is in Flight you normally merge those requests in the memory system they're Mis buffers that merge the requests or mis status holding registers yeah okay but these are just to give you a glimpse of the issues that you see with software P fetching clearly you can imagine other ways of prefetching these structures but I'm not going to go through that okay okay okay where should a compiler insert prefetches becomes a question software prefetch do you prefetch for every load access uh this a bad idea in general two bandwidth intensive both memory and execution bandwidth because not all loads Miss in the cach right uh usually what you do is you profile the code and determines loads that are likely to miss uh now you profiled your code with some input set and you determined loads XYZ are likely to miss and I'm going to insert special prefetch instructions or or code to actually prefetch them but maybe your profile input set was not good enough when someone ran your program they use a totally different input set and loads ABC were the let's say worst offenders in terms of causing cache misses you inserted prefetches for loads XYZ didn't insert prefetches for load ABC because you thought this profile input set was good as a result your performance actually degrades on uh the real input set that your program run this is the classical compiler pre compiler profile input set represented in this problem whenever you're doing compiler based approaches or software based approaches uh you need to be cognizant of this and you need to get uh some sort of Representative input set which is a very difficult problem in General Dynamic compilers if you're actually doing this compilation and insertion of the prefetches uh in a runtime system for example in Java or uh some common language runtime where you can actually recompile your code they may be better because they actually work with input sets but the input set may change also over time so this becomes an issue in general okay so basically how far ahead before the Miss assuming that you decided uh some loads Miss then the question Still Remains how far ahead before the Miss should the prefetch be inserted now again you profile and determine the probability of use for various prefetched distances from the Miss now you your profile may not be represented for which misses will happen assume that that's sold somehow magically even then you you need to figure out where to insert the prefet in the program right and this is a difficult task because there are many different paths that could lead to the same miss or some Miss again if you profile your profile input set may not be represented so you insert it early and you don't hide the latency you insert it too far uh you actually may uh lose accuracy uh as well as time not Timeless but you may actually uh uh essentially your your your prefet data may get evicted before you actually use so usually in general because memory latencies are so high you need to insert a prefetch far in advance to cover hundreds of cycles of main memory latency and this makes software prefetching very hard to do at least in the general case with array access patterns you could actually try to come up with a loop for example that does prefetching first and then do the computation so you decouple comput and communication and somehow uh get away with it but uh in the general case this is not easy okay that's why we focus a lot on Hardware prefetching this is not to say that software prefetching may not be useful uh in constrainted situations for example if you have a machine learning accelerator and you know that that's the only thing you're running in this thing and you know the access patterns of different layers and how they communicate with each other you could actually insert these pref fetches nicely uh but that's a constraint scenario right uh it doesn't generalize to many workloads for every workload you need to do that potentially right but then we don't have accelerators for any workload okay how much time uh do we have okay this is a good time to probably take a break let's take a break for 15 minutes uh let's be back at 1440 and then we will start with Hardware pref fetching then for e e e e e e e e e e e e e e e e e e e e e e e e e all right oh okay e still coming all right please close the door okay let's continue now we're done with software prefetching at least uh the portion in this lecture uh we're going to jump into Hardware prefetching which is actually much more could be much more specialized and the idea over here is that you design specialized Hardware that observes load store or cach Miss hit uh access patterns and prefetches data based on past access Behavior that's the general idea uh let's take a look at the trade-offs in this general idea because you have specialized Hardware this can be tuned to the implementation of the processor memory hierarchy memory controller you can take that all into account uh assuming it's possible uh but that's the big advantage over here whereas you don't have that information at the software level right and also it doesn't waste instruction execution bandwidth which a big thing uh you don't generate instructions and inject them you generate memory requests and inject them into the memory hierarchy with specialized Hardware uh but of course now uh you have more Hardware complexity to detect patterns in some case software can be more efficient because in software you know the address patterns right uh but maybe not in all cases also okay so let's take a look at some examples of Hardware P Fetchers and the simplest form is really next line prefetches this was initially employed in IBM 36091 in 1960s for their instruction prefetching engine the idea is uh you always prefetch the next cach line in the generalized form next ncash lines after a demand access or a demand Miss uh this is also called the next sequential prefecture and this kind of makes sense if you're sequentially going through some memory uh uh region right and this kind of works with uh code for example if you have instructions normally you have a sequence of instructions that you execute and the instructions are consecutive in memory locations memory addresses and the memory addresses are increasing normally right unless you go to a branch and then you may actually Branch out of some part of the code and then that sequentiality is broken at that point but this is the exact reason why the sequential processing Paradigm is the exact reason why uh you this was first employed in Pre prefetching instructions because instructions are sequential but data could be sequential also right if you're going through some array again you're accessing uh array indices in a sequential order and that sequentiality leads to prefetching the next line or next end lines okay so let's give let's look at some trade-offs I could give more examples but hopefully this is is easy to understand it's very simple to implement no no need for any pattern detection in this case right you see address a you get address a A plus 1 a plus2 A plus 3 a plus4 you see address B you do the same thing b plus1 b plus2 b plus 3 B plus4 that's the next line prefet I we already discussed this it works well for sequential or streaming access patterns uh sequential sometimes called streaming although streaming has been overloaded more recently um the downside is it can be wasteful with irregular access patterns right so if you're not sequential if you're never accessing the next cach line in the worst case this doesn't uh benefit you at all it hurts you actually right okay and even with regular access patterns I already said that right for example what is the prefetch accuracy if your stride access stride is two and your n is one this means that you're accessing address a A plus 2 a plus 4 a plus 6 A+ 8 and your n is one you're always prefetching one ahead A+ 1 a plus 3 a plus 5 so you're always prefetching the thing that you're not accessing right you can always find these worst cases as you can see right so it's not flexible in that sense also of course you could increase n to be two then you would be at least prefetching something useful but your accuracy now becomes 50% only because half of the things that you prefetch are not useful if your n is equal to two and your stride is equal to two so you can have fun with these in fact in our homework and maybe in the exam you'll have some questions related to this sort of access pattern and then there's the other question what if the program is traversing memory from higher to lower addresses who said that your addresses should be increasing right what about decreasing addresses do you also prefetch the previews and cach lines just to be sure whenever you access address a you prefetch everything from a plus one to a plus n and then a minus one to a minus n that's going to be not so good right so usually next line prefectures at least uh in unoptimized form are not that good uh What uh people do as we will see later is there's something called Direction prediction especially in more sophisticated locality based prefectures You can predict the direction that you're accessing so clearly If You observe the access patterns of the processor you can figure out oh there's this direction the address are increasing or address are decreasing and you can actually try to predict that assuming you're going to do something like a so more sophisticated next line prefetcher make sense okay so a little bit more uh I don't want to call it intelligent more a little bit more sophisticated prefecture is strip prefecture uh if you look at this this is the strided memory access pattern again we're talking about cach blocks U these could be misses uh essentially this is the access pattern we have uh the distance between consecutive addresses is always n we call distance between consecutive addresses the stride essentially The Stride is n in this case and this is relatively easy to detect and the idea in a stride prefecture is record The Stride between consecutive memory acces aises if the stride is stable use it to predict the next M memory accesses M could be one M could be something else right but this should be easy to predict assuming this is the only thing that you're seeing and recording so even the stri prefecture has immediately raised a question right how do you decide uh what do you base your stride on do you determine the stride on a per instruction basis or do you determine on a per memory region basis uh here I didn't say anything about who generates this a A plus n right for example a could be generated by one instruction a plus n could be generated by another instruction a plus 2 N can be generated by another instruction different loads right if you're determining your stride given a single load you would not capture this pattern because you're trying to find strides whereas the strides are formed based on different loads accessing memory so we're going to look at both flavors so let's take a look at instruction based stri prefetching this is originally how it was proposed you essentially have a load instruction uh using the program counter or address instruction pointer of that instruction uh you store the last address that instruction referenced and you calculate the last stride comparing ing to the previous address and you have some confidence level uh meaning that you keep seeing the same stride over and over right and then you need to tag it based on the instruction program counter but basically each load store instruction can lead to a memory access pattern with a different stri potentially and you could distinguish between those load store instructions that way but the downside of this sort of instruction based stri prefetching is you can only detect strides caused by each instruction separate uh there are also other issues like timel of prefetches can be an issue over here uh so for example you executed this instruction you figured out the strip uh you generated the prefetch uh initiating the prefetch when the Lord is fetched the next time can be too late right maybe you initiate the prefetch uh much you want to initiate the prefetch much earlier you in a tight Loop you're executing this load this load may be the every fifth instruction that you execute if it's every fifth you execute and these actually go to memory you're really generating these strided accesses very very quickly right and memory latency maybe 500 Cycles so how do you get ahead well getting ahead is basically if you if I go back over here maybe you start prefetching a plus 50n right and then keep doing that but then we already know what that problem is if you start prefetching that far ahead maybe you're prefetching too far ahead and the processor will stop going through that stride at at some point right okay so a potential solution is is essentially to look ahead in the instruction screen and guess how far ahead you should prefetch there are multiple solutions to this I'm not going to go into the details but people have proposed actually some complicated Solutions like determining the look ahead PC program counter Etc I think there may be simpler Solutions but you can you can read papers to think about that uh the other uh ways of way of doing strike prefetching is memory region based uh why is based repeated over here B based okay I want to fix that right now oh no that was a bad idea okay that's better but that doesn't work because okay okay better more okay much better now okay in this case you don't uh keep track of strides based on program counter you don't distinguish instructions you just distinguish memory regions you get a cat block address you chop some bits this belongs to this region let's say in memory it could be a page for example you're in this memory page or you're in this consecutive collection of pages uh now uh you keep track of the address that you've last seen and the stride you've seen and some confidence if you you you basically keep check of whether you see a stable stride in that region a A plus n a plus 2 n a plus 3 n and then there could be another region with b b plus m b plus 2 m b plus 3M Etc right so now you're looking at strides on a per region basis does that make sense could be Pages or many pages uh this could detect stri at M access patterns that appear due to multiple instructions in this case uh as I showed you earlier right all of those uh yeah essentially each access could be due to a different instruction over here which is more General this actually happens uh in real programs also uh and if you think about it uh there is a special case of it uh where you have stream prefetching or stream buffers uh you have a memory region based stri prefetching where n is one I'm going to talk about that because it's a very influential paper that introduced the idea in the end you could do both right because sometimes it's easier to detect strides based on program counters there are some uh load instructions that are actually quite strided and then there are some memory regions that are also quite strided you could also combine these two approaches in the end make sense okay let me talk about stream or St based prefetching trade-offs before we talk about stream prefetching which a special case so special case essentially you determine n is one and it keep streaming you keep filling buffers uh called stream buffers as we will see so let's talk about trade-offs over here so you could do instruction based stri pref fishing versus memory region based this base is always repeated I don't know what's going on it's a copy paste error is this your copy paste ra hole or mine this is how errors get introduced in the code also copy paste that's how errors propagate okay so basically as I said the ladder explo can exploit strides that occur due to interactional multiple instructions and I think the ladder can more easily get further ahead of the processer access screen because you don't need to deal with whether you're going to see this instruction next or not no need for a look at program counter but the lad is also more Hardware intensive because you need to keep track of regions uh and how you keep track of regions actually uh could be uh a little bit difficult uh and the main reason is usually there are a whole lot more regions to monitor than number of instructions because the instruction working set size is much usually much smaller than the data working set sites uh if you can somehow design a strip P Fetter that hopefully Works nicely uh based on instruction based uh strides your Hardware may be less costly of course uh you can make you can have more coarser grain regions right once you have more coarser grain regions you can have for example uh large pages one gab regions now the downside is there may not be a single stride in a 1 GB region right there may be many stries as you make your regions corser grain uh maybe you need more strides and we're going to talk about that also so this is interesting basically even with a small region you may act to multiple strides as we will see if you have 4 kilobyte page for example you may be touching different uh different pieces of data structures and that could actually have multiple strides so in general a single stride B pre fetching doesn't work very well that's that's I think the consensus in the field that's why people have develop more complicated uh prefetchers okay this is uh the original paper or one of the original papers on instruction based stri prefetching was going to be on your list of papers to potentially read in the homework it's a little bit old but it's a nice one and these are actually employed in real systems so there's an instruction pointer based prefecture to level one dat cash for example in Intel processors I don't know if they're still there but I'm assuming they are uh but there there also a memory region based stri prefetching this is in this case stream buffers which I'm going to briefly talk about this is a very interesting paper from is 1990 as you can see and the idea is uh to prefetch into the data cache you can see that at that time there's one data cache there's no many many levels of caching uh and you have a stream buffer in this case I think you have eight stream buffers or some number of stream buffers you see four of them but then a lot of dot dot Dots here we're we're going to store the prefetched data into these five AQ each stream buffer holds one stream of sequentially prefetched cach lines and on a load Miss from the data cache what we do is we check the head of the cues to see if there's an address that matches and if there's an address that matches you take the data from that queue insert into the cache and then move the pointer head of the Q pointer to the next data element clearly this works nicely if you're streaming through memory right your accesses are to consecutive cach lines if you're adding let's say eight arrays together you're streaming through memory and you really need eight of those accesses concurrently uh and you can get the top of the queue from eight of those pyos okay if you don't have the data then you allocate a new stream buffer to the new Miss address and then you may have to replace one of the stream buffers that you're not using the hope is that uh if you you're streaming essentially the hope is that you're always streaming through memory if that doesn't work you may actually have addresses duplicated in different fos clearly so it's a very specialized way of doing it but people generalize it later on you don't have to be this strict as it's proposed in the paper essentially when you have a space in a fifo you you keep prefetching into that fifo uh whenever the address bus is not busy so you can think of this as a stream based execution model also uh you keep prefetching and filling the streams uh and hopefully you'll the processor would be popping the data out of these cues does that make sense it's a very strict model but this is the original idea and clearly you can criticize this paper saying that this original idea is too strict yes but what's important is the idea so that's why I think people should be more accepting of ideas as opposed to particular implementations this is a particular implementation of the general idea of stream based prefetching which uh may not be implemented this way in fact it's not Implement in any of the processes that I know of this way but the stream based prefetching has influenced a lot of prefetchers uh going down so it's good to decouple this idea versus implementation in general uh when you think about ideas or implement ations okay so this is the stream buffer design it's actually relatively simple assuming this is how you process things you have the cache block and Associated address with it and then you have a valid uh bit whether you actually have it in the Stream buffer and then you have a tag this kind of like a cache actually except it's a q right it's a five for Q when the CPU generates an address you compare it to the tag and then see if it's here in the in the Stream buffer and then there's a prefetch add address that's generated there's always a next address which is really the uh address that's missing uh that's to be prefetched next into this buffer does that make sense it's a very simple logic as you can see and this is from the paper as you can see it's you have multiple stream buffers that look like this essentially uh the next level after the data cach in this particular design is a prefet and prefetcher prefetch into these relatively strict stream buffers which is also interesting later papers have looked at for example uh whether this is good or whether it's good to have an L2 cache in some applications where all you're doing is streaming through memory you don't want to have an l22 cache because these stream buffers are good enough actually but they're not General purpos as you can see they're very specialized for streaming access patterns okay so there's a more generalized streaming prefecture in IBM power 4 which doesn't use specialized stream buffers essentially what this does is uh this actually shows a picture uh you you identify memory regions and you prefetch streaming access patterns from those memory regions for example you have uh some data I Z i1 I think these are lines actually line zero line one uh in the L1 cache uh and you bring uh after you use LINE zero you bring line one over here after you use line one you bring line five from the next Gap cache after you use LINE five you bring maybe multiple lines over here into the L3 cache and there's a logic to to determine this so it's the same concept actually you're streaming except you're not as you you need to have some metadata to decide what to bring in to each cache but you don't need to have stream buffers essentially you have this incorporated into the caches right makes sense hopefully you can think about that you can also read it over here which is which is essentially what I just said and it says eight such streams per procer are supported so this is the recommended paper stream prefetching it's interesting to look at it's interesting to look at for another reason as you can see so as I said memory latency has always been a problem uh in uh computer architecture design uh you can see some old numbers over here for example in vax 11780 which was the fastest processor of its time uh you can see the memory time was 12200 NS which is pretty large because see the cycle time of that prer was 200 NCS clearly memory times has not improved have not improved as much as cycle times of the pressers that's one of the reasons for the memory ball neck so Mis cost is six cycles and M cost is6 instructions because there's a cycle per instruction and um Norm jupy when he was designing this this prefetches was working on Western resource Labs Titan processor and their projections were more uh let let's say worse in terms of Mis cost 12 I think this is actually designed but then there were projections for having memory latency to be 70 Cycles Etc today we're actually easily in the hundreds of Cycles okay this also shows that mem latency do not there's used as much right so 200 ad is a projection that's not too far from today's systems today's system general purpose systems are close to 100 easily 100 NS maybe 120 NS and if you actually have even expanded memories it could easily go to 280 NS right and we're talking about this paper was written in 1990 that's 34 years ago is it 34 yes okay okay so any questions I'm going to generalize a stream based prefetching a little bit more uh so I'm going to call them locality based prefetching for the lack lack of a better name uh but uh one of the issues with this sort of stream based prefetching or stri based prefetching is they're relatively rigid you determine your streaming plus one plus 1 plus one you determine your striding plus n plus n plus n uh unfortunately that plus n is not easy to determine in general there are many applications where access patterns are not perfectly stried uh for due to memory allocation patterns and also due to the access patterns so your memory allocation may not be perfectly aligned with the cache block for example uh or your access pattern uh for example when you're traversing some data structures uh you're not always going through the data structures you're actually traversing the data structure and then you're looking up something based on some value and in some in the next data structure for example you're or or an element in the array you're looking up something else right so you may actually be touching different locations in memory that's offset uh from the base value which may be a as we have seen and some patterns look random to close by addresses so these may not be Eed to capture so the question is how do you actually capture such so uh I call this locality based pring this not this the paper that describes it in my opinion in the best way I'm not going to go through this in detail uh in general uh but essentially maybe I should use the board over here so how do we do that it's manual okay good even better I don't have to deal with okay but that's not matter so how do we switch the zoom I guess say it again stop sharing is good enough okay let's see yeah stop share okay is that good that's good right or no I think okay you don't want to see that part basically okay complicated mechanism okay now we have another issue which is how do we write on the board hopefully that's easier okay that's fine we don't need white okay okay so look it looks good yeah something happened uh so basically uh I'm going to give you the general idea not a particular implementation you can actually read the paper for a particular implementation actually read the patents they're actually interesting patents but the basic idea is uh you get an address you get a Miss let's say a and this is determined to be in some region right uh and what you do is you start tracking uh U so you you create a Tracker I'll call this a Tracker so there's some Hardware structure now that starts monitoring whether addresses come to a uh and you may actually decide a minus 16 let's say or A+ 16 cach blocks and then you start to wait for addresses that uh come to this tracker and you see that the processor accesses let's say address A+ five let's say now you have some indication that the procor accessed a plus 5 also maybe I should do this and then you can say oh maybe the pressor is going uh in the positive Direction you may need more confidence so wait for another access before starting prefetching and the processor accesses A+ one let's say at this point so you got the a plus one over here now you have another indication that the processor is going uh in Direction plus one or the plus Direction so maybe you set the direction to plus and now you transform that entry so that you don't monitor the negative part right now I have another issue which is how do you delete it's like a bloom filter how you delete elements from a set I don't want to use my hand but there must be some solution to this is that that's a very heavy-handed solution you got to have the right tools for it I don't want to use the solution either see a whiteboard would have been better I guess but I like the Blackboard that so basically what you do is you don't you stop monitoring to this site now you have some confidence that oh okay yeah let's do it why not you have some confidence that processor is streaming that's why I call this I I cover this streaming toward this direction in this region right and then now you can start prefetching right there there are multiple prefetching algorithms over here uh but uh for example one uh way of prefetching is maybe maybe you also get rid of that you you actually expand the region right now because you figured out that the processer is not going down this way the processor is going down this way maybe you expand the region to maximum to a plus 64 I like 64 because 64 time 64 is anybody guess what that is that's interesting math and decimal decimal binary and that's the page size in general for kilobyte right so uh usually some of these prefetchers try to stay within a page uh but now uh you have a Tracker that looks at this region and it knows that or it thinks that uh you're going in the positive direction and you want to kind of get ahead of the processor one way of doing that is basically saying oh I'm going to prefetch up to a plus 8 maybe prefetches should be different colors and I'm going to try to get ahead of the processor so you prefetch a plus 8 you prefetch plus seven you prefetch plus six I didn't do a good job distancing these things plus four + 3 plus 2 you prefetch all of these in between and hope that you're ahead of the processor by eight right so now your prefet stream is hopefully ahead by eight it could be you could actually decide to go up to 12 for example and now you keep tracking meaning you keep waiting for uh more evidence that the processor is going the direction that you're prefetching in and if you get that evidence you're going to prefetch more for example if the processor says says or I'm prefetching I'm accessing A+ 17 maybe you got more evidence right so this is plus 17 the processor access it maybe you prefetch two more over here make sense it's not perfect as you can see right it has some direction directionality and streaming toward this direction but maybe after that the pr procor will access address A+ I don't know six so you already prefetched it right so this is one way again I don't I'm not showing a particular algorithm over here usually they're more optimized than what I shown you can optimize a lot of these things but this is the idea of locality based prefetching you're really determining a direction the processor is streaming in or you think the processor is streaming in and then you're prefetching in that direction and you're waiting for evidence uh that uh the processor keeps streaming in that direction and you keep prefetching so that you can actually hopefully get ahead of the processor for example if you get plus 17 over here maybe you prefetch even more right that could be one pulse clearly it's very aggressive if you do that uh because you prefetched a until up to a plus 8 and then you saw a plus 17 what do you do maybe you prefetch all of these over here plus some more over here it would be aggressive because you would generate a lot of prefetches but if the processor eventually touches a lot of these blocks over here maybe that's a good idea right that's the idea and variants of this sort of prefecture are actually employed in a lot of Intel systems and we talk about that in our paper of course they're much more sophisticated right now uh and surprisingly they're quite effective I've actually worked on these a lot and they're quite effective of course uh if you have a you don't want to do this if you have a nice stride a plus 5 a plus 10 A+ 15 forget about this don't do this right detect that stride and do the striding thing so you want to filter out things these are actually for difficult access patterns and if you actually do this for difficult access patterns you gain a lot of performances I will show you in the next slide okay let's see do you need to do something I'm sharing but okay oh okay now I see watch out don't hit your hit your head or hands so as you can see heterogeneity is always always comes at extra complexity so we have heterogeneous modes of presentation and switching between modes comes at more complexity and more overhead latency that's always true okay so this is a description of one of the preatures you can read it uh but I'll give you an example uh we have actually implemented a lot of these prefectures in our works plus I worked at Intel actually to look at a lot of the impact so uh the performance improvements are actually quite interesting and impressive they're bandwidth intensive but reducing bandwith intensity can be fixed byri stri detection and feedback mechanisms uh one of the disadvantage of this prefecture is it's limited to prefetching close by addresses what about large jumps like you're prefetching you're looking at this window uh let's say 64 uh a between a and a plus 64 right uh what if you jump to a plus 1,24 right that could be a different page but usually these prefetches are not good at detecting it uh they think that that's a different region for example uh but it's good to think about those things uh but they work actually quite well in real life this paper uh from Intel in 2004 shows significant performance improvements with this sort of prefetcher and I can validate these performance improvements also because I've done a lot of these studies okay so I wanted to put that out because this is actually not uh written in a lot of papers a lot of academic papers I don't think uh other than some of our the papers that we've written and Intel has written I have not seen the sort of prefecture uh let's say examined as much in academic papers they're usually more limited to uh let's say uh maybe you can think of this as a more chaotic way of Designing a preater right but uh um yeah okay so what about more complex access patterns basically we talked about simple regular patterns in this case strid and stream prefectures work well uh if you have complex regular patterns uh and maybe you have some simp somewhat simple regular irregular patterns that where locality based prefetching works well but let's take a look at complex regular patterns also uh this is also interesting because if you can design a prefecture for these complex regular patterns nicely maybe you to reduce the need for these locality based prefetching that you do because this is bandwith intensive as we've discussed for example if you actually do have multiple regular strides that look like this maybe you should just use that information right and you can detect that you can have irregular patterns link data structure to Rivals indirect area accesses multiple data structures access concurrently and maybe random accesses we're going to talk about some of these let's talk about multi stride detection I'll spend uh little amount of time on this but the basic idea is you can see that the pattern is repeated you can design Hardware to capture this pattern actually this a the hardware could be an exam similar to correlation based preing which we're going to cover soon but what you're doing is really uh you you you you find out that there's a correlation uh in the addresses after seven comes minus 6 after minus 6 comes 12 after 12 comes six after uh after that comes minus5 these are deltas from the previous address and then you repeat that so if you actually somehow learn I don't want to say learn or memorize or decide based on past experience that seven is followed by minus 6- 6 is followed by 12 12 is followed by six next time you see seven you can actually guess you're going to get minus 6 right and you can actually make this more General after after you see seven not more General but maybe more accurate after you see seven and minus 6 it gets followed by 12 and then you can correlate these two things after I see seven and then minus six the next thing I'm going to see is 12 I can make a table out of this using hashing Etc you can actually increase the length uh of the uh number of things you correlate on 7 - 61 12 and then after that come six okay so hopefully you got the idea this is complex but predictable set of strides and people have developed prefetchers for this purpose the key idea is given a history or signature or pattern of strides you learn and predict what stride might come next and I think I already said this after you see 7 - 6 12 uh you predict that six will come next after you see minus 6 12 6 you predict that minus 5 will come next and again I'm not going to go through the implementation over here but there's a particular implementation in this paper uh which uh essentially uh after you see a history of strides and you make a prediction you can use that prediction to predict what your next stri will be of course your confidence may become lower and lower as you keep doing that sort of prediction speculatively but this may still be actually pretty good to get ahead and then after some point you can say okay my confidence is too low I'm going to stop preing now you can imagine the implementations of this which we're not going to go into uh right now if you go into implementation of all of these prefetches it will take a long time okay any questions yes yeah path correlation or path confidence I don't know what they called yeah I I call Delta correlation yeah mhm like 7 - 6 7 - 6 7 - 6 like that or oh okay then you adapt to the whatever repeats later right you can do that it you can you keep updating your tables or if if for example if if you always see the pattern 7 - 6 4 and then 7 7 - 6 10 and then 7 - 6 4 7 - 6 10 you can associate two different addresses with 7 - 6 right yeah so this is I think this is a generalization of Marco pre pitching which I'm going to talk next uh generalized to uh Deltas address Deltas as opposed to Pure addresses yes yeah they're not too big they're not too small yeah they're much smaller than caches yes yeah because they don't deal with data right they they really deal with especially the Delta Deltas are usually small right you don't get huge deltas and you may decide not to track huge Deltas also potentially so they're actually reasonably small this sort of prefecture but now of course if you try to start uh training a neural network or machine learning model then that may become expensive right that's true but these are really table based mechanisms that are not really learning based yet but there are works that look at neural network based prefetching also we're going to talk about reinforcement learning and those will become more and more expensive yeah but it's good to think about that yes okay so let's talk about uh more irregular access patterns a little bit more so I'm going to cover some other ideas I kind of giving you the idea of correlation based pre pitching but I'm going to uh cover it a little bit more uh I'm going to talk about content directed prefetching which is actually I think a nice idea uh it's a creative idea let's say and then we're going to talk about more generally covering uh irregular access B which is pre-computation or execution based prefectures so this is address correlation so earlier you saw Delta correlation but let's assume that you have the following cach block addresses that you see I'm not going to read this what you can construct is essentially after referencing a particular address say a or E some addresses are more likely to be referenced next just by looking at this you can construct a graph uh Marco model or Marco decis decision graph if you will which basically connects uh what addresses come after what other addresses and assigns probabilities to the transition between different addresses right for example after you see a if this is what you've seen in the past with 60% probability you will likely get B with 20% probability you'll likely get a again you can clearly construct this online or offline that's how you can do the Delta correlation also right forget about the addresses you do the Deltas right okay so once you construct this you can actually record the likely next addresses after seeing a given address for example you see address a b c d are the possible next ones you can record all of them assuming those are BCD right I just made it up over here it doesn't have anything to do with the previous one uh so next time a is accessed you can prefet BCD clearly that could be expensive right or you can assign probabilities say or with 80% probability I'm going to access B so I'm going to prefetch B not CD or you can say okay I have some extra bandwidth on my bus I'm going to prefetch all of them I don't have extra bandwidth I'm I'm not going to prefetch something that's low confidence or where the probability is low so if you know all of this information both the both what follows a given address and a confidence associated with it you can play a lot of tricks clearly this is very costly right because you're storing large addresses and if your address set is huge then you have a problem what Delta correlation does is it reduces that address set to Deltas you may have a huge address set but you may have a much smaller Delta correlation set make sense okay okay so basically in this case a is set to be correlated with BCD uh so what you do is you can prefetch up to NX address to increase the coverage of prefetching so that you can cover more misses and you can improve true prefetch accuracy by using multiple addresses as key for the next address so instead of trying to prefetch just based on seeing a you prefetch based on seeing a and then B after a now you're correlating the key AB with the next address that comes with it C right again you can use probabilities ET said so this was proposed in Isa 1997 so it influenced a lot of things including the Delta correlation based prefetchers uh I'm not sure if they credit this paper as much as they should but maybe Rahul knows about that uh but I think it's the essentially the same thing but the Innovation and Delta correlation based prefectures is really reducing the amount of addresses that are stored okay this also called Marco prefetches so let's talk about Marco prefetches as they're originally proposed the big Advantage is it can cover arbitrary access patterns for example link data structures right it can cover anything essentially as long as you've seen it before so if you're going through a link list a b CD EFG all of those link uh uh all of those uh addresses of the uh link list uh data structures are actually random they have no predictable pattern you go through it once you record it you go through it the next time then all of them can be prefetched right that's the beauty of it it can also cover streaming access patterns not so efficiently though if you're doing a A plus 4 A Plus 8 a plus six uh a plus 12 a plus 16 if your stride is four you can also do that with Marco prefetching except that's a terrible way of doing it right if you have a regular stride you just do the stride but here you need to store everything you need to correlate a with a plus two a plus four sorry and then a plus4 with a plus 8 A Plus 8 with A+ 12 A+ 12 a plus 16 so you have an entry for every one of these things now you can see that the hardware requirements grow significantly whereas if you did Delta correlation that would actually reduce the hardware requirements very similar to St based prefecture actually so that's the beauty of Delta correlation address correlation is expensive but Delta correlation is cheap that's why Delta correlation is implemented address correlation is not implemented in existing processers so this goes back to your question also so a correlation table needs to be very large for High coverage uh uh and it's clearly this a disadvantage and recording every Mis address and subsequent Mis addresses is infusible so Delta correlation sols part of that but Delta correlation gives up some things of course right so while I'm discussing address correlation I should also include data correlation your you need to have correlation patterns that are predictable right you cannot really arbitrarily prefet random patterns the beauty of address correlation is the pattern is completely random but if you repeat it again then you can prefetch everything right does that make sense there doesn't need to be any correlation between the Deltas so that's address correlation is much stronger in that sense uh whereas Delta correlation may not be able to capture that okay so you can have low time Linus uh look at is limited since a prefetch for the next access or misses initiated right after the previous one so you can improve the Timeless by prefetching ah head of course as we've discussed and this can consume a lot of memory bandwidth especially when your model probabilities or correlations are low right for example if you see address a you have 10 different addresses you may see next and all of them have low probability okay but this is still a very interesting idea I think one dis one other disadvantage I will mention is it cannot reduce compulsory misses you have to see at least the address correlation again Delta correlation May reduce it address correlation cannot reduce it you have to see uh the same addresses over and over whereas with Deltas you just need to see the same Deltas okay hopefully that's clear any questions otherwise I'm going to move on to some other idea okay let's talk about content director prefetching as I said we need more creative ideas and I think this was a creative idea that was developed in late 1990s and the idea here is uh to have a specialized prefecture for pointer values there are a lot of ideas actually in link data structures pointer values Etc I'm going to pick this one because I think this is actually a very interesting idea uh and the idea is very simple you identify pointers among all values in a fish cach block and you should prefetch request for it sounds good right now what does that mean basically you C you prefetch you you you you're fetching a cache block you scan the cach block figure out which addresses are Pointers which which which parts of the cach block the data looks like addresses meaning pointers and then you issue a prefet request for them which sounds interesting and it was uh described in this paper so if you do this now you get rid of one problem that we've discussed there is no need to memorize or record past addresses right what you do is you just scan the cache block that you're fetching and then prefetch and then you can also scan the cash block that you prefetched and look for pointers over there and then prefetch more and then you can scan the next ones you prefetched so you can actually have prefetched depth which much longer without actually recording anything the scale compulsory M never seen pointers basically before even though they may not be predictable Deltas right uh downside is it indiscriminately prefetches all pointers in a cache block we're going to try to fix that so if with more information about software data structures you can reduce this uh or some Hardware profiling uh so a key question is how to identify pointer addresses right this actually interesting one way of identifying pointer addresses is to software say okay this is a pointer and the software can potentially do that with some accuracy depending on the language you're programming in right or depending on the Run one time system uh but that's communicating more information between software and Hardware the original paper basically had a nice cute uh mechanism and the basic idea is uh you look at the cache block uh um and you have an address for the cache block which is a virtual address you need to use the virtual address because pointer are the virtual space right uh and then you look at all address size values data values inside that cache block and compare the top bits most significant bits of the virtual address of the cach block to the most significant bits of address size values if they match you say OHA I found a pointer clearly this a heuristic but it's not a bad heuristic because pointers live in the same virtual address space uh at the higher levels that's the idea okay I'm I'm going to give you the idea example over here so basically it's called a virtual address predictor whether what you've seen in your cash block is a virtual address or not so assuming that this is the virtual address of the cache block you bring it something is happening yeah you bring the data and this is the virtual address of the cach block you need to store it somewhere and this is uh the address uh these are address size values and you compare the top I don't know 12 bits uh of each of these to the top 12 bits over here it's 12 right that's good and only this one matches in this particular case the other ones assume that they don't match I don't want and you issue a prefet request for that make sense and this turns out to be not a bad heuristic I used this heuristic in my other works and some of my other works okay so this is uh not bad but what if you find a lot of pointers over what if this pointer is not very useful then this could actually be quite intensive in terms of bandwidth so we did some work to actually uh provide more information essentially the problem is Hardware doesn't have enough information about pointers uh Hardware can guess what a pointer is but it has some difficulty in guessing whether that pointer prefetching uh that address that's pointed to by that pointer is actually going to be useful or not of course you could make the hardware more complicated and profile that information and try to uh build confidence but it's it takes a lot of effort uh but software does have that information it can profile to get more information so the idea in this work is to have the compiler to profile or analyze the code and provide hints as to which pointer addresses are likely useful to prefetch so you can have a profiling mechanism to do that and the hardware uses these sys to prefetch only likely useful pointers so in a cache block some pointers are marked the implementation is a little bit different as you will see but some pointers are M marked uh to be likely useful and if you have a Content directed prefetcher content director prefetcher only prefetches cach blocks pointed to that by those pointers so let me give you some motivation over here uh implementation is in the paper I'm not going to go through the implementation but let's take a look at this interesting data struct this is a cache block uh this cash block you can see that you have some pointers over here each pointer points to a link list of keys right this is really hashtable in the end and hash table has pointers to different buckets in the hash table each bucket is a link list and a link list could be very long and clearly these are in different cach blocks once you go into that right uh so uh the structure looks like key and then two data pointers over here and then an next pointer to the next entry next element in the link list now that structure looks like this uh what you when you actually Traverse I don't know why the animations are not working hold you didn't fix the animations last year it's fine don't worry Rahul gave this lecture when I was on sabatical last year so I always had trouble with these animations but basic idea is uh this is what the layout of this cash block would look like right so before I go into the basic idea I should I should say that which pointers so you have multiple pointers over here you have three pointers right next pointer you have a data element key you have uh the next pointer and you have the data one pointer and data two pointer so we have three pointers assuming they're sequentially laid out they will look like this key data one pointer data two pointer next and this is a cache block and let's say you have two uh two different uh link list elements consecutively laid out in the same cache block over here now if you knew this information you would probably say I probably don't want to prefet the data pointers right because only one key will depending on the data structure of course right uh you may have many matches the same key but usually the way hash tables work is only one key will match and if you have a reasonable design probably your link list is reasonably long maybe not terribly long but regardless you probably don't want to patch fetch D1 and D2 for every single note but you probably want to fetch the next pointer right because the likelihood of going to the next element is much higher than the likelihood of actually getting the data one and data two pointers uh meaning you matching on all of the locations so that's the idea over here if you know this information if you don't know this information these are all pointers and the content directed prefecture prefectures six cach blocks pointed to those pointers but if you knew this information you you probably Direct through the software the contact R prefetcher to actually prefetch only the next one and you could Mark that in the load instruction because the load instruction that prefetches or not prefetches the Lord instruction that uh uh fetches the next uh uh yeah essentially you can mark it based based on the load instruction we have a way of marking it with a bit Vector but I'm not going to go through that again uh but you can actually specify which of these addresses are going to be more useful this is way of doing it this is based on the data structure knowledge the paper is actually more General you profile workload you figure out which pointers are more likely to be used and then you set your bit vectors up but even if you just look at the data structure uh yeah these animations I'm not going to go through you basically prefetch only the next Point that's the idea so I think this is actually very interesting there needs to be more of these approaches clearly this complicates the mechanism but it solves a real problem with content directed prefetching now I should also say that content Direct preet exist in real systems also they try to solve this problem with additional Hardware in general uh they have some uh downsides and upsides but people like to use the sort of prefetches so if you come up with creative ideas uh it may get adopted in fact I think there was some recent work that uh figured out uh some security side Channel issues and these content direct to prefectures on some some phone models do you know that recent work exactly yes well security reach is maybe yeah some some some sort of site channel right yeah okay okay so it's good to think about this maybe we should put that on the slides in next next year or later this this year so that we can actually use it next year because next year it'll be forgotten it'll be out of the cash tra yes yeah yeah yeah I mean that's a that's a valid concern basically I don't know I mean these sort of interactions have not been explored as much yes they could they could actually uh uh but we we we may talk about pre we actually talk about prefetched throttling so if you have a good throttle throttler uh it could detect that some of the things that you're doing prefetching may not be useless useful so it could throttle the application that's uh generating those prefetch I think that could be solved indirectly uh by doing uh good throttling but maybe there could be other Solutions also okay so uh hopefully by now you've seen that there's no single preter that covers all the access patterns so existing systems actually use hybrid prefetches they not only use different prefetches at different levels which could be customized for the access patterns at different levels but a single level may have multiple prefetches as well so clearly this gives you better prefetch coverage one prefetcher may be working on easy to predict access patterns another prefetcher could be correlation prefetch another prefetcher could be content directed prefetcher and you put all of them together and hopefully you cover a lot of the access patterns you can potentially get better L this way because you don't need to design a single prefecture that covers everything which is monolithic you design customized prefectures for different access patterns that you can more easily make timely uh but of course as with heterogenity we've seen earlier heterogenity of presentation modes heterogenity of prefetches also leads to more complexity there are many design optimization decisions it could be more bandith intensive and now prefectures can also interfere with each other they will also interfere with of course demands but they they start interfering with each other as well so they could cause contention pollution now you may need to manage access from each prefet but this is actually the state of the art essentially you have hybrid prefectures and a lot of the ideas that we have discussed at least uh better incarnations better implementation of it are in existing systems yet we still have problems so they don't solve the entire problem so that's why this a uh area for uh more exploration okay should we take a break or should we go for 10 minutes and then stop what do people think no break okay who votes for no break Okay who wants for a break who wants for stopping right now okay I see only no Breakers I mean I won't be offended if you say break or stopping right now okay let's try to cover a little bit more and then we'll continue but are there any questions before we cover prefecture performance this should be relatively easy because we covered a lot of the metrics already I just want to put them together okay no question so let's talk about performance a little bit more so this slide puts a lot of things together we discussed accuracy use prefetches divide by send prefetches we discussed coverage what's a fraction of all of the misses that you've prefetched we discussed Timeless what fraction of the prefetches are on time what fraction the used prefetches are on time used prefetches are prefetches that are used by that are demanded by the processor Timeless can be defined in another another way of course right or coverage even can be defined by another way you can you can have a combined metric earlier we were discussing you could have stall time reduction due to prefetching right that could be another metric but these three metrics are actually good to uh identify prefetchers I like the Timeless definition that's not binary uh so this Timeless definition says whether a prefetch is in the cache at the time you use it uh or not determines the Timeless so it's binary but you could actually determine the Timeless based on what fraction of the Mist you have covered it's harder to determine but that could be part of your Timeless metric for given Miss of course yes yeah you could you could but that may not mean much that so to approximate the effect of that we look at Cash pollution for example yeah so I didn't talk about that yet so usually you can design a preater that's good at two of these metrics but you cannot get three that's a bit unfortunate this is one of those cases where you have three metrics but you get only two you can think about that usually improving accuracy means you reduce coverage improving timeliness usually goes well with coverage but not with accuracy so accuracy is usually on the other side of coverage and Timeless but there could be other trade-offs also and then there's other metrics that we should be concerned about as we have discussed earlier bandwidth consumption right because uh this is memory band with consumed with the prefetcher divided by without the prefetcher and usually it's greater than one it could be less than one depending on how we account for memory bandwidth right uh like memory throughput if if your prefet is actually nice and exploiting robot for locality maybe you actually schedule requests in a nicer way as we've discussed earlier uh but if you're uh if you're actually looking at memory requests as opposed to memory bandwidth then memory requests are you usually greater than one but not not always so I keep that in mind especially if you're doing multi-thread programs for example and one program is prefetching for another program then it's actually you can you can get low bandwith consumption sometimes uh the good news is you can utilize idle best bandwith idle bus bandwith if is available uh it's also good to think about uh something that people have not thought about as much right this is we're talking about bandwidth consumption it's a procer perspective right uh but we have seen a lot of ideas in D right to improve throughput for example for example sub level parallelism right if you could actually exploit the DRM substrate sub level parallelism maybe you can actually for prefetching purposes maybe we can do a lot of interesting things I'm just throwing that idea over here so that people think about it okay okay so that's bandwidth consumption clearly not good for energy either I didn't put energy explicitly over here but energy is a metric that is affected by essentially all of these uh cash pollution cash pollution is the extra demand misses due to prefetch placement in cash right so this is kind of getting uh getting that or kind of directly getting into what you're trying to get into earlier right if you pre prefetched uh if you place the prefetch early in the cache the likelihood that you'll get an extra demand Miss because you did that early is probably higher this is a bit more difficult to quantify but it does affect performance and there's a quantification method that we have in the papers that I'm not going to go into right now so prefetcher aggressiveness affects all of these metrics and we've talked about aggressiveness aggressiv depend on the prefetcher type for most Hardware prefetches you have a prefetch distance which we've kind of defined earlier how far ahead of the demand stream procer access stream is this prefetcher and then we have prefetch degree how how many prefetches do you generate per demand access I'm going to Define these in a nicer way you have an access stream you access address X and then you have a predicted stream which is the stream that the prefetcher predicted then is prefetching ahead so you have a prefetch distance so here you have pmax over here now your prefet distance can be small you're uh not as ahead of the processor so you can be a very conservative prefetcher your prefet distance may be middle of the road or your pre prefresh distance May very aggressive clearly these have trade-offs in terms of how they affect accuracy uh coverage and Timeless right as you are very aggressive you can improve your coverage at the expense and and Timeless at the expense of accuracy okay so that's the pref distance prefetch degree is let's assume that you have some prefetch distance when the processor is accessing X you prefetch P when the proc access X plus one preface degree says how many next things you prefetch in this particular Cas I'm giving you an example of stream based prefetching clearly but you can think about a prefetch degree for Content directed prefetching also maybe depth it's called depth in content Ral preing papers okay so how do these metrics interact essentially that's very interesting very aggressive prefetcher large prefet distance and potentially large prefet degree versus very conservative prefetcher which has small prefet distance small prefet degree so very agressive preatures are nice because they can be well ahead of the load access theme it can hide memory access latency better but they're more speculative and very conservative preatures are essentially exactly the opposite I'm not going to read this and they reduce potential for cash pollution bandwidth contention because they're less aggressive so essentially aggressiveness increases coverage and probably provides better Timeless but it's likely lower accuracy and higher bandwidth and higher pollution whereas conservative prefectures are likely higher accuracy lower bandwidth and less polluting but they may not cover a lot of misses and they may be not so time and whenever you see something like this you want to think about I want to get the best of both worlds right so just to give you an idea these are some some studies that we have done so there's a good correlation between accuracy and performance degradation if your prefetch accuracy is very low like 10% you can lose 20% performance that's a lot of performance loss over here this doesn't mean that if your accuracy is high you're going to get high performance as you can see right if your accur here the accuracy is high like almost 100% but the performance Improvement is say I don't know 5% maybe maybe it's not doing good coverage or good Timeless over there right or maybe the potential is not very high yes so there could be uh especially if you're so here we're looking at single core in single core uh probably less likely to see that but if you go into multicore your accuracy may be high but because you're very aggressive you may be kicking out some a lot a lot of use use ful blocks or you may be delaying a lot of useful requests from other cores so definitely so if you read the later papers that I'm going to reference but I'm not going to go into today clearly uh you will see that an accurate prefecture can actually degrade performance of other applications as well system performs but it's less likely to see that probably in single core you need to really come up with there it is possible right uh it's possible by uh because the system is very complex you may actually cause some robot for conflicts that were not there without prefetching uh but you don't see that as exacerbated okay but those are very good questions I think this requires more understanding of interaction between prefet and DRM system and maybe the are microarchitecture as I mentioned earlier okay so let me give you some uh things over here so for example these are some results from this paper that we wrote a number of years ago uh you can see that these bars are instructions per cycle single core applications blue is no pre-etching and then we look at very conservative middle of the RO and very aggressive pre fetching so in some cases very aggressive pre fetching is very good for some workloads right you can see very aggressive pre fetching is better than any other pre fetcher including no pre-etching but for some workloads very aggressive pre fetching is very bad some of these workloads pre-etch prefer no pre-etching or very conservative pre-etching with a uh locality based prefetcher and also a global history based prefetcher as we see in the paper but I'm not going to talk about global history based prefet so if you see something like this ideally you would like to get the best of both world right even though very agressive prefetching improves performance on average over here it really degrades performance significantly on some applications you don't want to use the same aggressiveness level for all workloads over here and we're not even talk about multicore over here yet and the idea that's why the idea of feedback pref prefecture throttling was developed the idea is very simple you dynamically monitor pre performance metrics and throttle the preure aggressiveness Up and Down based on past performance that's one way of modulating the other way of modulating is you can change the location where prefetches are inserted in the cache based on pass perform so for example you identify that prefetches are not that accurate you make the prefecture less aggressive so it's a bit more sophisticated in this paper than that there are a bunch of thresholds you identify whether the prefix are accurate highly medium or low and then you look at Timeless and then you look at pollution that the prefixes are causing if you're highly accurate you're not late but you're polluting we decided decreasing the aggressiveness is a good idea if you're highly accurate but late we decide increasing the aggressiveness is a good idea so that you become timely for example and then you can fill out this table and you can argue with all of the decisions some of these decisions over here you can use a machine learning model to derive this table Etc which we didn't do at the time we're talking about 2007 year when the paper was published but if you do this if you come up with a mechanism to modulate pref fetures then you can get a better performance essentially so that's what we see over here essentially we've uh we've actually curbed the performance gains so the the the purple bar over here is feedback directed prefetching with Dynamic aggressiveness and dynamic insertion policy you can see that on average it's better than uh either Dynamic insertion Dynamic aggressiveness and the comparison point is really that single very aggressive so on average we're much better than prior work but you can see that Ona in cases where very aggressive prefetching doesn't do well we do much better than very aggressive prefetching as well as no prefet there's a bigger version of this in the paper so but there are some cases you can see where we're the the solution is not as good as very pet so there could there's room for improvements but this curbs a lot of the downsides of PR fetching and uh unfortunately uh this doesn't directly apply to multicore we're going to talk about that maybe tomorrow uh but I should also mention one more thing before we part over here uh which is uh bus bandwidth essentially we also measure memory bus accesses for 1,000 finished instructions it's called bus per kilo instruction bus access per pil instruction this includes all of the effects of prefetching additional L this includes the L2 demand misses as well as pollution induced misses and prefetches ideally you would like this to be the same as the basine right or even lower if it's possible that's a bit tough to make it possible uh unless you're doing multi threading which we're not doing over here this is really a measure of bus band with usage let's take a look at how different prefetches uh do in the measurements that we have done ipc's instructions per cycle bus band with uh bus access perul instructions also here though no prefetching you can see that IPC is low as you add conservative middle of the road very aggressive prefetching feedback dtive pre fetching IPC keeps increasing that's good so feedback D pre fetching improves IPC compared to all of them average uh bus access per kilo instraction keeps increasing and very aggressive prefetching is not good so FTP on average gets you similar to middle of the road so essentially if you look at the efficiency of FTP you can do calculations Etc its efficiency is quite good it gives you a lot of performance boost compared to both middle of the road I'm very conservative pre fetching that's not as much uh increase in bus accesses so increasingly that's going to be important but one could argue that you still don't want 10.88 you still want lower right and that's one of the goals in prefetching it's not only about performance it's also about bandwidth consumption and energy okay okay that's the paper uh this is probably a good place to stop but I should say that this paper doesn't directly apply if you take this solution put it into a multicore processor you don't get a lot of benefits as this work and later works uh kind of show the reason is in multicore you have a lot more contention you may throttle your prefetcher alone this may actually doesn't impact performance that much or you for example you may make your prefecture very aggressive this hurts performance of other prefectures you really need a mechanism for coordinated prefetching across uh different cores that are sharing resources and later works actually developed that coordinated mechanisms which we may talk about tomorrow any burning questions yes yeah never say impossible but yes I mean certainly uh uh it's it's important to consider the hardware overheads I agree uh some works are in my opinion more theoretical in the sense that they try to understand the potential of using machine learning uh but yes they may not be realistic in the current implementation or current Hardware I agree with that but they may still have value in the sense that they show the potential of what you could get assuming you could at some point implement it okay okay uh I'll see you tomorrow e

Transcript for:[Lecture 17] Understanding Prefetching in Computer Architecture

Transcript for:
[Lecture 17] Understanding Prefetching in Computer Architecture