[Lecture 22] Cache Memory Overview

[Music] can you hear me in Zoom [Music] [Music] [Music] okay let's get us started hello everyone welcome back to another lecture in digital design and computer architecture today uh professor Mutlu is uh attending traveling actually to IMW uh international memory workshop so that's why um I got this opportunity to teach uh this exciting lecture about caches so first quickly to jog your memory uh here are some readings for actually for yesterday lecture and today lecture so uh recommend you guys to check so recall that the problem that we've been focusing uh through the previous lecture was that we wanted to design a desirable desirable memory let's say ideal memory for example having zero zero latency access latency having infinite capacity and so on so forth but we realize that basically having ideal memory requirements they oppose each other for example bigger is slow faster is more expensive and also higher bandwidth is more expensive so the way that we tried to deal with this issue uh so we designed memory hierarchy so we wanted both fast and large basically we decide to have multiple levels of storage progressively bigger and slower slower at the levels are farther from the processor and ensure most of the data the processor needs to is kept in the fast uh or faster levels essentially so that was the idea of memory hierarchy that we have discussed yesterday and this is a picture for example the that we can have many different levels of caches and main memory and also um in in modern memory hierarchy we even should also consider main memory and also this and this can be also go to some uh remote memory accesses as well so memory hierarchy in general can be actually quite uh uh I would say uh heterogeneous and has different levels of memory and remember that the the main reason that this idea was working was because of having temporal locality and a special locality that we usually observe in our uh applications so temporal locality basically refers to if you access an address very likely in near future you're going to access it again so that's temporal locality that talks about your possibility of the accesses in the time and we also look into special locality that say that um basically if you access an address you will likely access also the um adjacent addresses in the memory so that's uh for example when you work with arrays vectors and also when you uh fetching instructions in an instruction stream you usually have a lot of special locality so that's why actually this idea of caching and having memory hierarchy was quite successful and was helping a lot um and basically we reached to this point that we wanted to design cache um that we couldn't continue because of the time limit so we will start with very simple cache uh how we can design quite simplified cache design and then we're going to discuss some of these important design decisions and hopefully we will get to some more advanced topics in the memory hierarchy and cache design so let me redefine the the word of cache because this is not only the processor cache that we are talking actually cache has been used in many different domains in computer system so essentially cache can be considered as any structure that memorizes use or produce data so basically to avoid repeating the long latency operations required to reproduce or fetch the data from scratch you might need to fetch like you need to access you access the memory to bring data and so you put that data into your cache such that you don't need to pay for that long latency memory access again but you can also think of in other way around like for example you need to do a lot of computation to generate some results some value and then you try to memorize those value in your in a computer uh structure in a like in a table and then hopefully um you can bas instead of reproducing this those result you can just fetch from that cache so cache has been as I said used in different topics in computer system one very good example is actually webc uh that you might be aware of that essentially when you are accessing when you're open opening pages uh through your browser the operating system has a software has implementation that tries to cache some of the materials that you need uh to open these pages and such that for future accesses hopefully you can you know start showing you the pages much faster and then it loads from a basically struure structure that can be stored in the memory or also in the disk um and of course some stuff also should come from the from the network but basically that web caching can help you to will improve your user um basically uh satisfication so cache um as I also most commonly in the processor design context and automatically manage memory structure but this can be also programmer programmer managed it can be actually software managed so we will discuss a lot in this lecture about how we can have a hardware cache which is automatically managed by the processor and the hardware but this can be actually also software cache that's uh it's a job for the programmer or also can be partially compiler to manage that data structure um this is called actually scratch fat memory in the in different context in Nvidia GPUs that we have covered in previous lectures they call it for example shared memory and I We show some slides later but essentially not all caches are hardware controlled it can be actually also software controlled so if you if you got this question for example in the exam that um is u handling cache is dependent is a is ISA problem or micro architecture problem then your answer should be it depends it depends how many like how you react or interact with your cache design so later on I think next week we're going to also learn about prefetching and there might be also there are some instructions that you use these instruction to prefetch data earlier the moment that you want to to the cache and those can provide better performance so basically this is u something that you should also get like think start thinking about it as of now that this might be also not always hardware controlled so the example in our processor architecture is that we want to me memorize in fast RAM the most frequently or recently accessed DM memory locations to avoid repeatedly paying for the DM access latency and that's essentially what we are talking about for the hardware cache so this is a conceptual picture of a cache that in your cache you you should have some basically cache blocks that you basically chunk your data into several blocks and uh usually when we are talking about hardware implementation of caches we want to have um same size for all these blocks in order to simplify the implementation even though there were also some people and researchers that they try also to look into variable block size but overall in order to simplify the design people wanted to have uh basically uh you know exact the same uh cache block size in a hardware but while you're when you're implementing cache in the software you have much more flexibility so you can actually have much different variable size block sizes uh but that's also another topic that unfortunately we don't have time to cover uh each cache block you essentially you have data that for example it can be 64 byt as we also discussed mentioned in the previous lecture and there should be also some metadata that we call it tag uh in order to such that you can understand that this data or this data is stored in the cache is corresponded to which address so that can that tag is helping you and that's not only the the only information that we store in tag we also need to store some other informations in the tag for example uh whether or not we have accessed this cache block uh or for example is it is it valid or not for example so all these actually metadata needs to be defined in this tag array so logical organization of a cache basically our key question here is that how to map chunks of the main memory uh how chunks of the main memory address space to blocks in the cache or essentially which location in in cache uh can a given main memory chunk be placed so here's an example that for example we consider uh each chunk is 64 bytes and then u you can actually chop your memory address space to equal size 64 bytes so depending on how large your memory has that defines the number of cache blocks you have but then you your cache is a limited space memory so you don't you can of course uh you cannot map everything like you cannot keep all these cache blocks at the same time in your cache because of the capacity limitation so you should have a solution to understand that basically uh one of the very first fundamentals questions to answer is how we map a chunk of the physical address space to to our cache and that actually the answer to this question brings us to three major ways of implementing cache which I will first start with fully fully associative here so in fully associative basically you can map each chunk no matter about the address of it to essentially every cache block or any cache block that you have in your in your cache so for example if u if your cache is completely empty and you you are um you have a chunk of of data or cach block you can basically map it to cach block zero and then another future access is cache another cache block like cache block let's say 100 and then you map it to the cach block one for example and so on so forth basically there is no um limitation um except the cache capacity here uh and you can actually map um cache blocks to every location in in your um fully associative cache on the other hand we can actually under other extreme you can have a direct map cache which essentially for each cache block depending on its address you you can have only one location in your uh cache so basically depending on this chunk address or let's say cache block address you use a part of that address uh in order to map your cache block to a specific uh cache block in your uh cache design we're going to learn to see that actually that function is so simple it can be complicated but usually people make it so simple you know for uh again for simplicity reasons but that actually makes it so strict essentially so if uh basically if you have several chunks several cache blocks that they are mapping to the same cache block that can cause to some issues that you're going to learn later on that we call it conflict misses you you map for example cache block uh zero basically to to this first cache block and then later on you have another cache block which is cach block four and then essentially you need to map the same uh that block to this cach block zero again because your function is telling you that the location for that cach block should be cach zero and then you need to basically evict um the the previous cache block that you have stored here in order to make room for this new cache block and later on if you have access to cach block zero like the exactly cach block zero or memory address zero you're going to see that now you have this u conflict miss because you are missing because you basically evicted uh and that block because of uh conflicts happen in in your future access in your previous accesses so something in the between is actually set associative so you can basically have several um sets in your cache design and then in so basically you use a part of your address or we will call it actually index later on uh in order to index you the address to a set and within that set you have this freedom uh to put your cache block on b each of these locations so here we are for example showing uh two-way set associative but it can be four-way it can be nway essentially and the extreme case of set associative is actually also fully associative that you essentially have only one set so in fully associative you have only one set u and in direct map actually basically yeah so basically you can call fully associative and direct map both of them using the definition of set associative so in um in fully associative you have only one set and every block is actually stored in that set but in direct map actually is like oneway set associative so basically you have all these sets and in each set in direct map you have only one cache block so you can call it one way set associative any question we're going to see with more detail and examples but I wanted to just prepare your mind but question okay so let's start with some cache basics u first uh the definition of block line that I've been also using it that's a unit of storage in the cache uh actually cach block and cache line has been used interchangeably unfortunately there is no you know uh there is no um exactly the same terminology so even in one paper in one book you might see both cash flow and cash line but essentially memory is logically divided into blocks that map to potential locations in the cache on a reference meaning that when processor for example requests loading a cache uh loading a memory address can be bite for example you can have heat if in cache use cache data instead of accessing memory so if you can't find your data that you're looking for in the cache you have a heat and in the other hand you have miss that you cannot find it in the cache and then you need to bring back the data into the cache from lower levels and that can may have to evict some other block because the capacity of your cache is not infinite and you might be basically you might uh you you might have to um evict some other blocks in order to make rooms for f uh your current blocks so there are actually uh several important cache design decision that we're actually going to look into them in today's lecture uh one of them is placement so where and how to place and find a block in the cache another important is uh important topic is replacement so what data to remove to make room in the cache basically replacement is a topic for related to how what to evict essentially when you want to make the room but placement is saying that when you want to uh put a data in your cache how you are putting that how you are inserting that in your cache another important topic is granularity of management are you going to basically work on with large or small blocks or we can also have subblocks that we're going to actually see today how you manage rights uh so your right policy is also another important topic that we need to look into what do we do about rights and also instructions and data do we treat them separately or in a unified memory we're going to hopefully learn about all of them today these are um most of the some important cache design decision but there are also many many other um important design decisions that people have been looking into uh when they want to design memory systems and actually these topics uh have interested lot of people during the several uh years I would say several decades actually um that to improve the cache performance in our system okay so let's start with some cache abstraction and metrics um I think we already also seen this uh definition of heat rate in the past so the when you have a address from your processor essentially so you need to send your address to your tag store and also data store remember that we say that in each cache block we should have a data and also some metadata that we call it tag so usually actually people use two different separated array for it so we have a tag array which we call it here tag store or we can we also have data array which we call it data store here so when uh you send your address to both tag store and data store and the job for tag store is to say that is the address in the cache essentially so have I cached this uh address before or no so that's the definition and actually use some bookkeeping techniques or so but in the end the result for tag store should be hit or miss and depending on that basically the data that you fetch from your data store might be useful or not so there are also some ways that we're going to learn later that we access our data store only when we have a heat we're going to learn about it but normally you can think of at this abstraction layer that you access both and if if it is a heat basically you uh you have a valid data out of your data store but if it is a miss and you you should not use the data that you you are reading from your data store so cache heat rate is one of the performance metric that you can think of like which is number of hits divided by number of hits plus number of missets or number of hits to divided by number of accesses and using that we have another uh important metric which is average memory access time which is essentially hit rate times heat latency plus m rate times miss latency and from uh previous lecture actually you should know that mis latency is u because the the we have hierarchy and mis latency is hierarchical so depending on the hit rate and m rate on other levels of hierarchy you should calculate this mis latency and that would affect the total uh average memory access time for you but there is important aside here that I want to mention it i'm not answer it at this moment just to prepare you can start you know think about it so is reducing uh average memory access latency always beneficial for performance meaning that if I reduce if I basically um put that as my main goal in order to reduce AAT uh average memory access time uh would that affect the end to end like the the execution time of your of my application or not because that's the bottom line performance metric that we can basically are caring about right so we want to see in the end I don't care I mean I should care but um but locally I care but globally I mostly care that how fast my processor is running my application right so when you reduce um average memory access time would that's always beneficial for performance or maybe the other way around is that should I always shoot for better um average memory access time or not so that's a I think you can guess the answer to it uh probably the answer is no um but why is that uh that needs some discussion that we're going to hopefully get to that through the end of this lecture question okay so we will start with the basic hardware cache design and they will then we will examine a multi basically multitude of ideas to make it better so remember this uh way that we have this uh we chunk the memory address space to 64 bytes and we are starting with a toy cache example that we can store four blocks here so each block address maps to a potential location in the cache and that determined by the index bits in the address so we use the index into the into the tag and data store as we already showed so this can be your address space um like the address bits we we are assuming here 8 bit addresses so that means um basically your memory assuming that uh basically your memory uh is bite addressable since your address is 8 bit you have 256 bytes so that's the size of your memory and uh this is one way of basically showing u like dividing the different bits uh for your cache so here we are considering that we have three bits uh to show which bite I'm referring to in my in the cache block which means that I'm using three bits to uh basically point to a specific bite in that cache block meaning that we have uh eight bytes so we talk about 64 byt in conventional system but here for a toy example we are considering that our cache block is eight bytes so that's why um we are using three bits here in order to specify which um byte we are referring to and then we have this index bit which is three bits of this address that we will use for example to uh index our cache array both tag array and also data data array and then we also have a two bits as a tag that you're going to see that we will compare uh the tag of our address to the to the tag of the address that has been stored there if if that's a match then we have a heat for example in our uh cache so when you have a cache access basically to summarize you index into the tag and data store with index bits in the address then we check valid bit in the tag store because we want to make sure that that uh entry is valid um when your cache so valid bit is there just to specify that this block is empty or not so in in the beginning your cache is completely empty so all valid bits are reset essentially but as you uh insert cache blocks to your cache uh caches basically you change valid bits to one uh to show that they are valid uh cache blocks sometimes also you use valid bit for some other reason that you want to say that uh this cache block that I had before is not is not valid anymore and that's uh we may actually get to that also later today uh which is a topic of coherence cache coherence okay the the third uh step is to compare tag bits in address with the store tag in the tag store which is we use this tag to check that and then if the store tag is valid and matches the tag of the block then the block is in the cache and we call it cache heat okay so let's work with our toy example we will start with a direct mapped a given main memory block can be placed in only one possible location in the cache so that's the definition of uh basically direct map cache so we has u we are still assuming the same memory which is bite addressible main memory and we have 256 bytes in the memory our block is eight bytes u meaning that in total we have 32 blocks in the memory so this is the this um this shows actually your main memory all the blocks that you have from uh block zero to block uh 31 essentially so your memory has 256 bytes but since each block is eight bytes in the end you should have 32 blocks and you can see that the address we are showing for each block is only five bits so we kind of uh basically ignoring uh because we don't need to show like the the the three the first three bits least significant bit of the address because those uh three bits are going to be used on to show which bytes we are referring to in that specific block right um so that's why you are saying that only you're seeing only five bits however your memory address space actually memory address is 8 bits here but your block address is only five bits makes sense right okay so we assume that our uh cache has 64 bytes and eight blocks so in a generic map a block can go to only one location so we have a tag store and data store in our data store we have eight blocks as we assume here and this is a tag store here are the valid bits and also we need to uh keep tag um bits for each of these blocks so then this is how you make your use your address in order to access uh your cache this is very similar actually to our example in the previous slide so your address is 8 bits you know that your block is u basically uh eight bytes so essentially you you need to use the first three bits uh least significant bit as in order to point to a specific bite in that uh block so that's the usage for the first three bits then we need to some we need to index this uh both tag store and data store and we have eight elements so essentially we need three bits in order to point to that log 8 over two is is three and there should be tag essentially so um since we had eight bits so 8 minus 6 is going to be two so we we are using two bits for tag array here as well for each address so when you send an address essentially you u index your target store and your data store so in the tag store you check if it is valid and if it is valid then you check you basically fetch the tag stored in that specific location that you actually pointed out using that index uh you compare that tag with the tag stored in your with the tag you have in your address and if they both equal then you have a heat and if it is a heat then you can get your data from your data store you can you also have a m multiplexer here because uh from this data store you are uh reading the whole cache block but then you might want to uh basically read only one bite out of it because depending on the instruction of the processor might be load bte uh might be I don't know load um half word load word or load double word for example so depending on this uh that instruction you might want to load uh you know read only one bite two bytes I don't know four bytes eight bytes and so on so forth so but there should be a multiplexer here in order to specify the bite that you're looking for essentially any question here looks simple right i hope so okay so essentially in direct map blocks with the same index contain for the same cache location in this example we are showing them for with this green green highlight you can see that uh block zero block uh 8 block 16 and block um 24 all of them are will map to this basically uh cache block zero here right and that uh that can cause this problem of u basically conflict that I said for example you processor access award which is stored in this block zero so you move that block zero to your cache and then hopefully in the future you're going to have also some other accesses to that block zero that you're going to enjoy your heat but then later on you have another access from processor that that address belongs to block 8 so essentially you need to move block A to your cache but since there's only one location for that block you need to um evict the block zero from your tag from your cache essentially both tag store and data store and then you move your block um eight to that so later on if you got a another request that belongs to block zero now is a miss right because block zero does not exist in your cache anymore so we we call this situation as a conflict miss essentially and direct map is very bad for because it can cause conflict miss even though we might have empty spaces in our cache so conflict M is actually different from capacity M that we're going to see later on some of sometimes you might have miss because okay your cache is full and you have to evict something but in uh but with conflict miss you might have still have some empty spaces in your cache that you should be able to use them hopefully but since with this your um mapping which is direct map which is quite strict uh you cannot freely map your cache block to any location in your cache and that can cause conflict misses for you okay so to summarize essentially direct map cache two blocks in memory that map to the same index in the cache cannot be present in the cache at the same time so one index one entry and can lead to 0% heat rate if more than one block access in an interl manner mapped to the same index for example with this example assume addresses A and B have the same index bits but different tag bits for example in our example block zero and block 8 if you access ablict in the cache index and then basically all accesses are conflict misses and your heat rate is zero essentially and that's why direct map can be actually very sometimes very depending on your access pattern but at the at the good from the upside direct map caches are very simple and easy to implement and uh that's why they are also useful in in some context okay okay so how we can fix the problem essentially we will make more freedom you know uh we say that I don't have to map my block only to one place hopefully I will be I should be able to at least map it to two places for example so that's a problem that's the idea that we're going to use which we call it set associivity so basically the idea is that we enable blocks with the same index to map to more than one cache location so the example is that instead of having one column of eight we can have two columns of four blocks which we call it uh two-way set associated block that you're going to see so this is our now tag store so we make two columns each of them is has four entries and also you need to do the same for data store now essentially you need to compare more so whenever when you have access you need to check the valid bits of two blocks two basically tag locations and also you need to compare your tag with the tag stored in both places and if it is a hit then you can get your data from your data store and then again here also there should be a multiplexer um to say that from which um basically column or block you need to get your data and then you need to add another mult multiplexer in order to specify which bite you want to get so this actually you can see that this complicates the design and you know you are adding more hardware but this can help uh heavily help us in order to reduce uh conflict misses so the key idea here actually is very also similar to associative memory within the set actually um I think you should remember from our out of order processor design we were talking about um content addressible memory in order for implementing a reorder buffer for example so content addressible memory is essentially the memory that you put the content and also the address at the same time and in order to find your data you need to actually read you need to search through that set and compare the address which is stored in that entry with your with the address that you are looking for so that's called it cam or content addressable memory and essentially in in set associivity uh or set associative cache you need to do that um search uh within that set so within each set you have this um content addressible memory that you need to search and find your tag if it is stored there any question okay good okay we already discussed all of them so we might need higher associivity for example in our exam we say that we might have AB AB so that with direct map we have 0% heat rate uh with two-way set associivity we can actually have very good heat rate if we ignore the the first two misses so basically you access A the first time is a miss and then you access address B for the first time that's also a miss but then later on if you just repeat these accesses AB AB all of them are going to be hit in a two-way set associivity cache but this the problem can be worse it can be ABCD ABCD ABCD ABCD right so with that in that example for example two-way set associivity cannot work again you want to have 0% heat rate now you know the trick you can you do this four-way set associivity so you can basically divide your tag store uh into four columns and also data store into four columns and then you need to compare more so you need to add for example four comparison here so your um tag um basically the heat checking logic is getting more complicated and more costly essentially and this the address uh this is how the address will look like here right um because so we still need to use three bits in order to show which bite in that cache block we want to access but then now we have only two set right so we have only two set here and so then uh in order to index two sets I I only need to use one bit essentially so that's why my index bit here is one bit and the rest is going to be tag so you need to store more tags also uh when you have higher associivity and that can also increase your uh tag um array capacity essentially but then yeah the likelihood of conflict misses are even lower which is good and in the extreme case uh extreme uh case you can have this fully associative cache a block can be placed in any cache location uh in our toy example basically we can have a we can have one set uh with all uh eight blocks are basically stored uh can be stored freely in that set so essentially whenever you want to search or find your cache block in your cache you need to compare um eight tags you can compare it in parallel like what we showed here if you want to have very good access latency but you can also think of doing this comparison in a serial manner that can reduce the number of comparator you need but that will comes at the price of longer access latency essentially so that's a trade-off that people also have looking into have looked into it and if you have a heat then you can get your so now you you need a larger M flexer um in order to decide which uh cache block you're reading and then another Mox for deciding on the bite so yeah so I think it makes sense um but any question good so in the address essentially we have three bits for the bite and then the rest is going to be five bits meaning that the index bit here is zero because we don't need index at all when you have fully associated with the cache because you have only one set so you don't need to uh you don't have anything to index uh to your cache essentially so we already talked a little bit about the tradeoff but here I want to show a bit more so the degree of associivity how many blocks can map to the same index or set essentially the number of V so higher associativity um would probably cause higher heat rate this is actually quite this is likely um of course it depends on your access pattern if your access pattern does not have much of temporal locality and special locality uh you might not you know benefit much from state associivity as well but uh on average across many many workloads that people have used they show that basically uh but using higher associivity would cause higher heat rate but at the same time it comes at the price of slower access cache access time so heat latency and data access latency they are longer and you need to add more hardware so more expensive hardware more comparators more area more power and so on so forth and the problem is that we have people show that uh the return diminishes from higher associivity meaning that when you go from direct map to two-way set associative you're going to see a lot of good improvement and then from two-way to four-way you're going to still see for example very good uh improvement from 4-way to 8way less improvement from 8way to 16 even more even less and then at some point uh the benefit you're getting from better heat rate does not uh convince you because you're increasing the access latency of your caches and the benefit that you're getting from better heat rate does not justify your decision anymore so this is um like the the graph pictorially that usually look like like this that when you increase your associivity heat rate increases but of course we have this diminishing returns uh from higher associivity and basically we saturate at some point which makes sense I guess right because the the reason that a set associativity is helping us is because to reduce uh conflict misses right but conflict misses are not the only that you have miss in your systems at some point you cannot um so by going from direct map to for example four-way set associative you reduce conflict misses a lot and that's very good but from four-way for example to 8way the amount that you're reducing is not that much and then you that doesn't justify uh the complexity of the design anymore question Okay so let me uh go into some issues in uh designing set associative caches so we should think of each block in a set having a priority so indicating how important it is to keep the block in the cache and the key issue here is that how to how do you determine or adjust block priorities so now that we have several uh blocks in one set uh whenever you want to evict for example one of them because your uh your set might be full right so you have for example in a two-way set associative you have two blocks in a set and both of them you have used it for different blocks and then you have another block that you want to move it to the cache and then you don't have a space so essentially you need to evict one of the stored block so which one you're going to select so that's the problem of that we need to find a way to basically come up with a priority uh when we have these set associative caches so essentially there are three uh key decisions in a set that uh we're going to look into insertion promotion and eviction which eviction is also known as a replacement so insertion basically what happens to uh protot so when you basically insert the cache block to your caches uh how the priorities is going to change essentially so where to insert the incoming block or whether or not to insert the block actually sometimes you decide to not insert the block and we're going to see also later on so that's a decision that you need to make another important decision is promotion what happens to proises on a cache heat so once you access a cache block and that's a heat then uh what would happen what how how does that change the priority essentially so whether or how to change block priority in that um in that set after some accesses to your caches and the third important key decision is eviction and replacement so what happens to proh priorities on a cashmese essentially so which block to evict and how to adjust priorities essentially so that's uh these three key decision is going to be the uh the main thing that we're going to focus on for a uh few minutes uh for now okay so eviction and replacement policy the the the question that we want to answer here is that which block in the set to replace on a cachm and any basically first in any invalid block first so if you have a space in your set like meaning that there are some blocks in the set that they are invalid meaning that they are still empty so you just use use them right so then your answer is simple but of course if all are valid then you need to have a policy which we need to consult with the replacement policy there should be a policy which we call it replacement policy that essentially tells you that sorry that tells you which block uh you need to evict uh in order to make room for your uh basically for the block that you want to insert so your replacement policy can be random and people actually have shown that random replacement policy is also not a bad idea um on average across many workloads of course there might be some access patterns that random um replacement policy would fade but essentially in random you say that I don't I just you know roll a dice and based on that I will just evict one cache block right it can be uh FIFO which is first in first out so the basically the block that you insert to your cache at the earlier time you consider as a um candidate to evict based on this FIFO policy which might not be the good idea because that block you might insert a block to your cache block very early in the execution but you kept using that block so you meaning that you kept accessing that cache block and you have cash heat so by FIFO you don't consider the fact that you're using that block the only thing you care in FIFO is that uh when you basically insert that block in your cache let me finish this slide and then we can take a break uh another way is using uh least recently used LRU uh which essentially we are uh we we're going to see how we can implement recently used and I will also ask you guys about it but in LRU policy basically you want to um evict the block that you haven't used it um basically is it least recently used so you haven't used it recently um so whether how we can implement that is actually problem we're going to also look into it another one would be could be for example not most recently used so you always you have a track um you know that which block is the most recently used so you make sure that you don't evict that block but other than that you don't care so you may select any of the block that you might have that they are not most recently used or it can be least frequently used also for example it can be least uh costly to refetch that why would memory accesses have different cost so sometimes um when you are evicting a cache block um you should think of okay how costly would be to refetch this block for me am I going to refetch uh this block again from the other cache hierarchy for example from L2 or I'm going to fetch it from the main memory so main memory has longer access latency so your uh the cost uh in order that you're paying for refetching that is going to be higher so sometimes you should also consider that in order to have a better performance you can also think of hybrid replacement policy remember that we use the hybrid branch predictions u you can also have think of combining different of these different replacement policies in order to make a better technique and uh you can also think of optimal replacement policy which very hard to do because it's very hard to like what does exactly optimal mean in what metric essentially so optimal can be in terms of mis rate for example or in terms of the access latency in terms of the execution time of your processor or in terms of energy for example so definition of optimal is not clear and at the same time it's very hard to come up with the um that replacement policy which is ideal because essentially you don't in order to have a good replacement policy you should have information about your future accesses and that information usually uh does not exist okay so I think this is a very good point uh to stop that we will discuss about implementing eru in more detail so let's get back uh when the ring bells and then we will continue [Music] okay so let's continue uh before I move uh forward I want to remind you about uh our next Thursday lecture that has been shifted by 1 hour I think you also got the email from the department so basically make sure that you you're attending at 1 p.m uh that okay so we uh we discussed about LRU which is uh least recently used and now we want to see how we can implement so the idea for LRU is that we want to evict uh the least recently accessed block which is essentially based on uh the fact that we we are benefiting from temporal locality and we want to um evict the block which can hurt temporal locality less essentially so that's the the key insight behind why LRU would be or pro is a good u or can be a good solution for replacement policy so the problem is that we need to keep track of access order of blocks so the question we have here is that for two-way set associative cache uh what do you minimally need to implement LRU perfectly any thoughts so how many bits I need uh to implement LRU perfectly minimally uh for it associative just guess it's not that hard yes true but by but there are only two blocks one is enough so one is a minimal number here so the the minimal number of bits you need here is one uh in order to implement LR here but how about four-way set associative i'm asking the same question no it's not two yes um no but uh what's your logic to get four no no essentially you need to count uh the different uh the number of possibilities it's actually kind of a pro probability question like the how many different possible um order you can have in a in a four-way associative cache maybe you can answer this so how many how many orders I can have when I have four blocks if you remember something from mathematics so is yes exactly four factorial yes perfect yeah in the in the first location you can have four possible cases the next one only three the next one two and then the third the fourth one you have only one so it's like 4 * 3 * 2 1 it's going to be 24 uh which means you need five bits in order to encode these 24 cases but then you you also need to have a logic uh like decoding logic that get that five bits as a input and output basically telling you that basically which order are you in like who what block is actually the least recently used for you but there is also another way in order that you store a bit more um but then that can simplify your logic for example if you store eight eight bits essentially you use uh two bit per block so then you have eight bits in total and that's a easy way of encoding so you are storing more bit now so you're paying more capacity cost but that provides you more simplified uh basically decoding logic in order to understand what is the LRU but essentially um in the end um you can repeat the the calculation we have done for nit associative so it's going to be n factorial the minimal uh number but as I said sometimes people don't use that minimal uh number because that can also uh complicates your um the decoding logic a lot so people use like um login n login um yeah number of blocks times login exactly number of um blocks um login uh that can actually simplify the logic that you have for that but but as a conclusion of that unfortunately I don't have enough time but as a conclusion you you know you can see that implementing LRU is actually costly uh in terms of storage space and also in terms of logic that you need to add uh specifically when you have a lot a lot of ways for example when you have eight way set associative cache and you want to implement LRU is not easy so people uh in industry and also research work a lot to approximate LRU so most uh most modern processors do not implement true LRU also called perfect LRU in highly associative caches so there are actually two reasons for it the first one is that true LRU is complex as we already discussed and the second one is actually LRU is an approximation to predict locality anyway so no one said that actually LRU is the perfect way for doing you know this cash replacement policy so maybe it doesn't make sense that we spend a lot of uh resource and time in order to design something which in the end is not going to be also perfect anyway so that's why people try to use some approximation for LRU like for example one way is using not MRU so we only uh we need to keep the most recently used so that's easy so you you just want to know which one of which block has been used mostly and then you make sure that you don't evict that one and the rest you can for example select them randomly you can also have hierarchical eru um I'm not going into detail of that but just to give you the idea you can for example divide you can make a hierarchy um in in your set you can make chunk it to different smaller sets in each of these smaller set you can have lru that can be also true LRU because you might have less number of ways in that smaller sets but then um globally so locally you can have true LRU but globally you don't need to have u basically um true LRU so you can have some other way of using it so for example you can track the MRU group and the MRU way in each group for example this is one way of implementing hierarchical LRU and another uh another example is using victim uh next victim replacement policy essentially here you make sure that you don't uh for example uh evict the most recently used and also another block one more and that another block there's a policy that you might have in order to select it but I'm not going to detail of all these implementation but essentially uh just to give you this idea that uh people have tried to approximate LRU in order to simplify the design for the caches another question can be LU versus random so which one is better um for our example like four-way cache and we have cyclic references to ABCDE the thing is that we have 0% hit rate with LRO policy because so you have ABCD um and in your four-way cache all of them are stored in your cache and then you get to the address E right so when you want to insert E the least recently used is actually address A right so you basically you evict A and you insert E so the next A that because it's a cyclic references the next A that you're going to access is going to be miss again and then the then you need to bring A and then your least recently used is B so your B is going to be missed also so in the end you're going to have 0% heat rate with LU policy so and that's a problem that we call it actually set threshing so when the program working set in a set is larger than set associivity here for example we have five blocks that we are accessing them in a temporary way in a consecutively and that can basically these blocks can cause trashing in the set which in the end can cause alert you not work perfectly here so in such cases people have shown that random replacement policy is better uh when trashing occurs and in practice actually people show that performance of replacement policy depends on workload i mean different workloads we can have different replacement policy and average heat rate of LRU and random are similar so when you average uh the M rate or of the performance of them across many many workloads you can see that they are working overall the same on average but when you're checking every workload um separately you can see that for this specific workload LRU might be better for that one random might be better or even if you actually want to uh look a bit deeper you're going to see that inside that workload uh when I treat some sets uh in a random way uh I can get better performance and other sets should be LRU people actually also have tried that like they try to um use LRU and random um together in a hybrid manner and then you can actually the granularity can be even set based like and for each set or or some set sample of some set you can decide basically which policy you want to apply essentially so you can as I said use hybrid and then you have to choose between the two you can have the set sampling and then you basically for some of the sets you can have LRU the other some other sets can random and so on so forth people also look into using some machine learning algorithms as well in order to implement replacement policy probably not for L1 because for L1 you need to be very very fast and you probably cannot afford the latency for um complicated logic for replacement policy but maybe for other levels maybe L2 L3 uh you can actually have more complicated u replacement policy and you can use machine learning uh for doing that so uh the the idea of set sampling you can for learning more you can actually check this paper we're going to get to this paper hopefully through the end of this lecture but for another reason but you can also check this one if you are uh for this reason interested okay so about the optimal replacement policy people have also looked into it a lot um I as I said optimal replacement policy is not easy to do because first of all we don't know what is optimal so there are many many different metrics and uh in bell's optimal replacement policy that they call it bleed's opt so we replace the block that is going to be referenced uh furthest in the future by the program so essentially you need to have the un have the knowledge of the future accesses in your program and that has been actually presented in IBM system journal in 19 uh 66 for basically for virtual memory but how uh do we implement this actually is it implementable so probably no because you need to have information from future and if you think about it that can actually cause the um the causality of the system right your system might not be causal anymore because you need information from future to implement your current decision but you can predict essentially and all these replacement policies that we are thinking of they are trying to predict the future right but you can also simulate it people have looked into it how to simulate um you can actually have the trace of your future accesses and then use this uh information in order to implement bleed's uh replacement policy for the for your cache and this can be used in order to show the gap for example so you want to if you consider bleed's replacement policy is the best is the optimal one by the definition of optimal which embedded is actually the the lower m rate so whenever you want to have lower m OP ability is opt is the is the optimal one so people have tried to implement that or simulate that in order to show the the gap or the room uh that you can improve your cache uh by using better replacement policies so for example you you know that by implementing LRU in this cache my average miss rate for example is 10% but then you also implement bellies on and simulate it and you see that your M rate is 4% and you can see that okay now I have 6% here difference between LRU and belly and that shows a room you know for you to work on and improve essentially so it's always good to think about some oracle policies which ability is also kind of Oracle policy that people think about implement to show what would be room uh in order to improvement okay so validity is the basically optimal for minimizing m rate the answer is yes uh but is this optimal for minimizing execution time the answer to that is no because u as we're going to also see later that cash miss latency and cost that varies a lot from block to block so you don't pay the similarly um for different misses partly because of uh when you miss um from L1 basically you might be you might hit in L2 for some block for some other block you might also miss in L2 and you might miss also in L3 and you might need to get data from data so that uh cause cache mis latency and cost varies from block to block there are also other reasons like memory level parallelis that can cause mis overlap mapping uh that we will hopefully see later today but these things have we'll discuss in this paper if you're interested you can also check okay so we will hopefully get back to this paper later uh today so what uh now let's get back to more implementation um details so what's in a tag store entry so we show that we have this valid bit and tag and there should be also some bits in order to implement our replacement policy we might also have another bit for example dirty uh which we call a dirty bit and that can be used uh to control rights um meaning that we're going to see about right back and write through caches meaning that one uh if you have a dirty bit you can specify that this cache block I have already written to it meaning that I changed the content of that cache block so you can have a bit for that to show but or you might not have that beat and then you should have a different u scenario different techniques in order to handle rights that we're going to see now actually so this is important for handling rights so when do we write uh the modified data in in a cache to the next level essentially so you uh your uh you have a cache block uh in in L1 and the processor sends a store request for example and that store actually hits your cache L1 cache essentially so you change u the data um that block in the L1 cache so now the question is that uh when do we write this modify data to the uh to other levels to the next level to L2 L3 and DM for example so essentially you can have uh two approaches here which is write through one of them at the time the right happens so when the right happens you update your cache um L1 cache for example you also update the other levels essentially you also update L2 L3 and memory if they also have your cache block you can also have another version uh which is right back um which is when the block is evicted so basically you delay the updating the other levels um until you have to evict this block from your cache so for that you need to keep this uh dirty bit essentially uh to say that because when you want to evict this data you need to check if this data or cache block is modified or not if the dirty bit is zero meaning that this cach is not modified you just evict it you don't care but if it's dirty you should actually update also the memory as well the other levels so basically you update the next level and then after that you can evict it so that's the right back uh mechanism here so let's see about like some pros and cons about these two techniques so in rightback cache um we can combine multiple rights to the same block before eviction which is really good right so you don't you are not sending every right to the lower level and that can potentially saves a lot of bandwidths between cache levels and that can also save energy so you filter for example a lot of rice in your L1 cache and you don't send many of these to L2 and then you don't send many of them to L3 and so on so forth but uh you need a bit in the tag store uh that indicates the block is dirty which or modified um for right back cache right through cache is a simpler design of course and all levels are up to date and consistent which is in the end um I think through the probably the next lectures We're going to you're going to see that this is simpler uh for cache coherence because there is no need to check um close to the processor caches tag stores for prisons i'm not going to detail it uh right now because that's a topic that we're going to see later hopefully um and then uh right through cache is um is bad for bandwidth because it's going to consume more bandwidth so it's more bandwidth intensive there is no combining of rights every right you basically also update other levels so you need to send and consume a lot of bandwidth at the same time you also consume a lot of energy for doing that question so which one do you prefer right back or right through it depends it depends on the level uh and the processor you're designing it depends on your uh essentially the access pattern for example you might have a application that you you don't update much so you write to a cache block and then uh you don't have many many rights after that so if that's the case maybe write true is a bit better case but if the application and basic access pattern you have many many rights frequent rights to the same block then you might prefer right back so the reason we have trade-off here that shows to you that there is no uh basically definitive answer to this question so it really depends on your uh design decisions in the end and also your access patterns okay another important uh question in handling rights is that do we allocate a cache block and a right miss so you access for example L1 and for a right and then you you couldn't find your block in your L1 and you have right miss so do you allocate on right means uh if we do it we basically call it allocate on right means policy or we can have no allocate on right means which the answer is no to this question meaning that um you have a right means you basically go to the next level to the level that you can find your cache block and then just only update that block you don't bring that block to your L1 cache or the level that you have miss so that's a no allocate on write me but if your policy is allocate on write me essentially you go to the next level you write it and then you also you bring the updated data to your L1 cache to hopefully have future heat in it so that also gives you the idea here so we can combine with allocate and write miss we can similarly to uh write back we can combine rightes instead of writing each individually to next level and this is also simpler because right misses can be treated the same way as readmisses so for readmisses we know that okay when when we have a readmiss we go to the next level and that we find the data and we allocate to our cache um so we can handle read write misses similar to also readmisses but then we that requires transfer of the whole cache block that might not be useful maybe you only want to uh update one bite in that cache block so why are you loading the whole cache block uh to your L1 in order to update it specifically especially when you don't want to use that block a lot so again it depends on your access pattern so if you're if you have a allocate on write me so you update a block and then you're going to also read from that block a lot you know um so in that case allocate on write means makes sense but sometimes you write and then for a sometime in near future you're not going to access that block at all you know for reading and writing so maybe in that case you don't need to allocate it so in such scenarios no allocate is better because conserves cache space if locality of written block is low so if the locality is not for that specific block is not that high so essentially you are better you're utilizing your cash capacity better and that can potentially uh provide better cash rate any question okay another important question in handling right is what uh if the processor writes to an entire block over a small amount of time yeah so basically is there any need to bring the block into the cache from memory in the first place so the processor wants to write to an entire block but for some amount of time do we need to bring the whole cache block from the memory uh to the processor in the first place or we can bring it uh partially basically why do we not simply write to only a portion of the block which we can call it subblock example is that four bytes out of 64 bytes like you have a 64 byt block and you just want to update only four bytes of that block why not only updating that four bytes um in the your cache block so the problem we have here is that the valid and dirty bits are associated with the entire 64 bytes remember that in our tag array we add the valid and dirty bits for each block so we don't have for example valid and dirt bit for each um individual four bytes for example here right and the way that we handle this issue is that we uh define the sublocked cache or sector cache and the idea is that we divide the block into sublocks or sectors and we have separate valid bits for each sublock essentially so we allocate only a sublock or a subset of subblocks on a request so we only allocate a sub some sublocks or only one sublock uh to your L1 cache or to the cache that you are accessing more and um and basically you update that so other subs you don't allocate them in the beginning maybe in future if you are accessing them and you need them you will bring them so of course the upside is that no need to transfer the entire cache block into the cache which is very good so a right simply um validates and updates a sublock and that gives you more freedom in transferring subblocks into the cache so a cach block does not need to be in in cache fully so how many sublocks do you transfer on a read we can also think of implementing this idea for read also in the past whenever we have a readm we were moving the whole cache block to the to the cache but here we can think of okay maybe I don't need to move the the whole block I can just move some of part of that block some sublocks maybe only forwards of that block to the cache and that can save you um basically u energy and also performance because you don't need to you're um utilizing bandwidths less here so that's of course u more complex design more valid than their debates and also may not exploit a special locality fully so you should be careful when you are implementing this kind of idea so when you have a block um and so you miss on a bite on a bite access or word access and then you bring the whole block to your cache the whole idea is to benefit a lot from special locality right but now that you are not loading the whole block and you are just loading some block sub some subblocks of that block to the cache you may not exploit a special locality event so you should be careful to not basically cause a low heat rate uh because of this issue in your cache question okay another uh decision that we need to um made is about how we handle instruction cache and data caches do they need to be separate or unified so first let's look into pros and cons of unified cache so when it's when they are unified um dynamic sharing of cache space so you have a shared space that you're using that shared space for L1 data for example cache as well as L1 instruction cache so that's provide better overall cache utilization why I'm saying that because when you have a separate cache spaces meaning that you have a instruction memory instruction cache and you have a data cache and for some specific program you might not benefit from that instruction ction cache much like the capacity because your working set might not be that big similarly also for data cache there might be some application that your data footprint is not that uh huge I mean that not uh is not that much so you are not you don't need to you are not utilizing that data cache much but you need higher capacity for instruction cache so once you are designing the whole uh idea this two structure in a unified manner so you have a unified structure and you're managing instruction and data cache together potentially you can have better utilization so there is no overprovisioning that might happen with a static partitioning of cash spaces but uh unfortunately it has it also comes at some uh cons here which is uh instruction and data can evict or trash each other so there's no guaranteed space for either so basically your accessing to your data cache can cause eviction to your instruction cache and so on and the other way around vice versa i mean with your policy you can also try to control that um with some technique uh there are works about that but more importantly I would say that instruction cache and data instruction and data are access in different places in the pipeline and that's the main reason actually that people uh decided to have separate uh instruction at least at the L1 level they decided to have diff separate L1 instruction and L1 data cache because in your pipeline uh First in the beginning of pipeline you want to fetch instruction so you're accessing your instruction cache to fetch instructions and then later on in later steps of your pipeline you want to access your data cache or data memory to read data or you or to store data so you want to be the you want to have a memory which is close to that part of the pipeline in order to have fast access similarly also for instruction if you have a unified memory access unified memory space then you cannot put it close to both uh fetch logic and also the um memory access logic essentially data memory access logic so that's why um basically people decide for the first level caches they uh decide to use almost always uh separate so L1 instruction and L1 data cache uh are separated very usually in our systems but mainly for the um for the last reason I said but other level cache are almost always unified so L2 for example you don't have separate you have unified L2 for data and instruction L3 of also in your DMOS they are unified so um so that's the idea about this question okay also now let's look into multi-level cache design management so we discussed about this hierarchy we might have different L1 L2 L3 we We already also discussed a bit in previous slide that for example about this having this unified or separate depending on the level of the cache u we might decide differently and that's actually the main reason you have different essentially constraints and optimization goals in different levels so at the L1 level the most important thing for you is that you want to be fast because in order to keep your pipeline full you should be able to access your L1 cache u very very fast and that's u that's the most important uh design decision that you should consider so that's why L1 for example caches are not um very large i mean they are relatively small also people use um fast transistors fast SRAMM because you can use different technology for transistors also you can have different way of implementing SRAMM uh that can have different latencies or for L1 cache people try to have it as fast as possible um and that's the main reason for them so decisions very much affected by cycle time and pipeline structure uh in first level cache we usually we are small and we use lower associivity because again we want to have low access latency because latency is critical for us and the tag store and data store are usually accessed in parallel um so basically you access your tag tag store and also data store at the same time if uh if you hit your tag store then perfect it's good the data that you also load from your data store is is also good and useful but if you miss in the tag store the the data that you fetch from your data store is not useful anymore so that's the overhead that you are paying here but people decided to pay that that overhead because they want um fast access latency to their caches however for second level caches u we are we have different trade-off here so decisions need to uh balance heat rate and access latency for second level caches for example so they have usually large and highly associative um caches because latency is not as important as L1 and tag store and data store can be accessed serially which is important especially here because your data store is quite big is a large array um several of megabytes so if you hit um so you access your tag array if it's a miss basically you don't access your data array at all and that can save you a lot of energy which is very very important for second level caches because uh those caches are weak much larger but if you hit uh access your target store and you hit then you access your data store uh to fetch your data essentially so that comes at the price of some longer access latency in your heat but maybe that's not that much important because In L2 and L3 we are far we are much further from the L1 cache and the pipeline design is not basically uh is not heavily dependent on the access latency of L2 and L3 for example and for further levels and larger caches essentially access energy is a larger problem due to cache sizes so tag store and data store are usually also accessed serially and there are a lot of design decisions that people are making in order to reduce the energy consumption of these large caches they may actually use also different memory technologies like some emerging memory technologies instead of a strum that they are more capacity optimized and also they consume less energy so there are a lot of u design tricks that you can also think of when you have different trade-offs and constraints and goals optimization goals in different cache levels any questions okay so uh yeah we already talked about this serial and parallel access of cache levels essentially in parallel um no sorry this is actually different this is about um when you when you access your L1 do you also in parallel access other levels in your cache or not so you access L1 do you also access L2 and L3 at the same time or not so in parallel next level cache access in parallel with the previous level which a form of speculative access because you say that okay if I uh if I miss in L1 at least I want to issue the access to the next level earlier such that I can hide some of these latencies and that's the parallel access in serial basically next level cache access only if previous level misses so of course in parallel access we have faster access to data if previous level misses but there is unnecessary accesses to next level if previous levels needs of course in serial access we have slower access to data if previous level misses and there is no also wasted accesses to the next level if previous level hits so that's a good thing about serial accesses next level does not see the same accesses as the previous one as well so this is also very important to think about so in serial access you only access the next levels when you miss meaning that your upper level like L1 for example is acting as a filter to L2 and L3 and that would affect the temp the locality that you are observing in those levels so normally usually people see uh less locality in for example L2 cache in L3 cache or let's say different locality so you need to be careful when you are designing your cache your replacement policy also when you're designing your prefetching technique because the the the pattern the locality pattern are different at L2 level at L3 at L3 level and so on so forth because those uh previous cache levels they are filtering lot of accesses and your observation of locality is going to be different so management policies are different across cache levels in the end so just to give you some example um in uh in this architecture of Apple M1 for example we have both L1 and L2 uh which they are private to course uh but at the same time we also have um shared level cache which is shared across uh CPU cores GPUs and accelerators for example so that that's another decision that you might also need to consider for example when you have different uh processing elements you don't have you don't have in only for example CPU uh because GPU and CPU sometimes there are some applications that they are collaborative and you are using GPU and CPU together in order to accelerate them then having a shared level cache uh that can serve both at the same time can be good thing but at the same time u managing this shared level cache across GPU cores and CPU cores is going to be also very difficult because the way that GPU and CPU access the memory they are very very different gpu is heavily throughput oriented and send a lot of memory access to to the memory but on the other hand CPU application usually latency oriented but essentially in our modern architecture we have deeper and larger cache hierarchies and all of it because people make caches larger and hierarchy deeper in order to tolerate essentially the data movement problem the the the problem the bottleneck of bringing data from slow memory uh to the system to improve the performance and also energy consumption of the system here is also another design for AMD which you can see the size of L1 L2 and L3 caches uh this is another architecture for AMD that they use a 3D structure for L3 uh in order to make it even larger so they have additional 64 megabyte L3 cache die stack on top of the processor die um this is also architecture from IBM power 10 in 2020 which uh essentially also have different levels of caches uh in GPU also as we discussed we have um L1 cache L2 cache but also in GPU we have a scratch pad memory which in um Nvidia termin terminology they call it shared memory and that L1 cache L1 cache is actually hardware managed but the scratch pad memory as I said is software managed so the programmer can decide what to do with the scratch pad memory so in that the structure is actually unified for example n 192 kilobyte of data per SM and this capacity can be used as L1 cache and or a scratch pad so basically the programmer can decide how much um the program wants to allocate to L1 cache and how much uh the program wants to allocate to scratch pad memory for example um using that 192 kiloby of data this is also uh for Nvidia hooper which you can see uh L1 cache or a scratch but still Nvidia have this solution that they have this unified structure for L1 c scratch memory and programmer can decide based on that and L2 cache is there i should also say that L2 cache I think I said we we partially said actually in our GPU lectures but essentially uh L2 cache was not there um in Nvidia GPUs before 2009 I guess so at some point Nvidia in firmy architecture uh I believe they added L2 cache because they realized that they cannot tolerate um memory accesses essentially so the the me the memory bandwidth was heavily saturated for many of these applications so they had to um basically make cache hierarchy deeper and larger in order to reduce a bit of memory bandwidth saturation essentially okay okay and this is also another architecture of Nvidia you can check uh people also have looking to many different ideas um this is also again for Nvidia so in um in A100 which is amper GPU you have this feature that you can directly copy from L2 to your scratch pad memory so when you you don't um by bypassing L1 and register file so you don't when you want to copy data from L2 to your shared memory which is scratch memory you don't need to go through the L1 also register file and then move data to your shared memory so you can directly copy that so that's also one of the design decision that essentially designer uh and architect should make that decision and then software should be able to support that and that's the whole system that they provide uh in order to support this idea another um basic improvement in Hooper um H100 is that they provide this direct copy from the global memory which is the GPU device memory uh to the shared memory for example also they show that shared memories they can communicate data together that's another support that they provided so in multi-level cache design uh as I discussed there are a lot of decisions like which levels to place a block into from memory which levels to evict the block to or from an inner level or also bypassing versus non bypassing levels that we already also seen in some of the example like when we wanted to bring data for example from L2 to the shared memory on that example we simply bypass L1 uh for example caches so uh people have defined this inclusive exclusive and non-incclusive hierarchies inclusive says that a block in an inner level is always included also in an outer level meaning that if a block exists for example in L1 cache that block definitely exists in L2 cache as well and also if L2 cache is also inclusive that block should be also in L3L as well that's inclusive another u so with inclusive cache that simplifies your cache coance that we don't get to that today but we can hopefully discuss it later exclusive caches a block in an inner level does not exist in an outer level for sure like if the block exists in L1 cache that block definitely does not exist in L2 that's a that's for sure because that cache is exclusive so this is good better utilizes the space in the entire hierarchy so in inclusive cache you are duplicating data a lot so you duplicate a lot of access data that you have in L1 you need to duplicate all of them in L2 also everything in L2 need to duplicate if L2 is inclusive you also need to duplicate in L3 and so on so forth but in exclusive cache you basically you don't duplicate and that better utilizes space in the entire hierarchy but then um there is also another way which is non-inclusive which basically you don't care so a block in an inner level may or may not be included in an outer level that relaxes design decisions meaning that for example when you access data um access L1 and you miss you go to L2 you may find your data from L2 and you bring your data to L1 so if you are inclusive cache you need to make sure that this L1 and L2 both keep the data together but when you are non-inclusive you don't care so L1 may stay keep the data but L2 for some reason might already evicted um basically that data so that data might that block might not be uh might not be might not exist in that L2 anymore because of eviction policy that you might have however in inclusive cache if you if you're evicting something from L2 you also need to if that block exists in L1 you also need to evict from L1 as well so that complicates the whole design actually so that's why many of the caches also they are non-inclusive because that relaxes a lot of design decisions for us okay any questions okay I think I will stop here i wanted to cover more but um the topic actually is not easy to convey so uh I I think I went a bit slow but that's perfectly fine uh next week on Thursday uh we will pick up from this part and then we will hopefully get to more um advanced topic like prefetching um caches and so on so forth so wish you a nice weekend and see you all next week

Transcript for:[Lecture 22] Cache Memory Overview

Transcript for:
[Lecture 22] Cache Memory Overview