[Lecture 21] Understanding Memory Ordering and Cache Coherence

for [Music] okay let's get it started welcome every on to another lecture of computer architecture today we're going to discuss uh two interesting topics memory ordering is the going to be the first topic that we're going to start and then we're going to get to Cache coherence um all these topics are actually coming up when we want parallelism when we are using multiprocessor and multicore systems so that's why we are covering such techniques after the paralyzing paralyzing lecture okay so um I'm going to start today's lecture with this recall from last week so essentially we were discussing about difficulty in parallel programming and we say that uh there is little difficulty if parallel is natural like when we have embarrassingly parallel applications uh like multimedia physical simulation and Graphics but we have difficulties in uh basically when parallel programs to work correctly and when we want to optimize performance in the presence of B Mi and in the end much of our computer architecture is about designing machines that overcome the sequential and parall bck to achieve higher performance and efficiency which we have discussed a lot um like last week uh with B neck acceleration for example and uh another point is that making program's job easier in writing correct parallel programs not even high performance so we've we're going to focus a lot how to how can we make things correct but then we also going to also see a little bit how we can make things high performance so essentially we have this tradeoff between performance and correctness and we have seen it um a lot in this semester as well so these two metrics they are fundamentally at odds with each other as you can imagine and you can always improve performance at the expense of correctness for example you can say that uh I like to multiply two numbers or two vectors uh two matri matrices for example and then uh you're going to guess the output you're not going to compute so that of course is high performance but probably not that accurate so that's uh so of course this is quite uh uh you know uh kind of exaggeration but essentially you can actually make some uh tradeoff by guessing by speculating and providing better performance but at the same time you need to make sure that you are not doing something wrong so for example we can forget some critical luck in your program um we discussed also a little bit this in the in the last session that essentially you can um for example speculatively execute some critical sections assume that there is no uh conflicts like we assume that there are not there there is only one thread in the critical section and then we had to check the end of the critical section that what we have done is correct or not and then it's not if it's not correct we need to redo things so the these things can uh be important can provide better performance but at the same time you need to have some technique in order to recover uh from the execution which was not correct or for example we can design our architecture to ignore ordering of operation we're going to see today actually about that and uh we're going to see examples in fundamental support in multi multi trust or operation mimd like memory ordering and cache cerence today but uh I I like to also mention that there is also sometimes a real trade-off between performance and correctness there are some application that you are okay with like reducing correctness uh by some definition and then to provide the performance any thoughts or any example you might remember from this course yes yes and in which work we discuss some of them actually do you remember anyone do you remember Eden yes yeah Eden for example is one of them that um basically we show that we can um improve dram energy and performance by making d a little bit faulty but hopefully for those layers or for those those parts of your data that they don't not they don't need exact uh correction you can use this tradeoff essentially so Edan is one of them heterogeneous reliability memory in DSN 14 is also another one and of course you can check these papers uh we already covered them uh so if you're interested you can take a look yeah and but this is um when you think about performance and correctness is also very similar to the latency and reliability tradeoff essentially so Rel reliability is at the hardware component level like each component is reliable or not but correctness is at the program semantic level or Hardware function level so if you think about it there is actually a very nice analogy between them so there are also some um ideas you know that you try to make a latency reliability to like Eden actually Eden make Dam a little bit unreliable um but provide you know better latency but when you think about uh use of Eden for at the at the application layer for neural network let's say then you are actually also considering this correctness issue so you need to always uh consider these both tradeoff here and of course we have seen examples of latency reliability trade of before like uh lecture nine memory latency which are these are actually from previous lectures but you can see the links to the YouTube later we also show that for example sometime you can use these tradeoff for doing something useful like this uh for D puff that for example you can when you violate timings there are some celles that they always fake and you can use them as a signature of your device for example for this or you can use them to uh use a kind of uh basically tradeoff in order to provide random number generation in this work and also qu RNG other one but now today we want to look into memory ordering in multiprocessor which is actually very important in order to make sure the correctness of the execution and uh we don't want to make tradeoff here essentially so we want to make things correct but we can also try we will also release the basically uh relax the model a little bit in the end so these are required readings for this uh topic we have uh this basically two-page paper from Lamport which is a the the concept of sequential consistency is uh proposed or defined in this paper so I would like to suggest you check actually it's going to be one of the readings as well uh one of the required probably but there are also some uh recommended readings that we're going to also discuss them a little bit today uh you can check them for memory consistency so let's let start with some definition so what is memory consistency and what is cache coherence we're going to see cash coherence also later today but it's good to know about them in the beginning so consistency is about ordering of all memory operations from different processors so first of all we are thinking of we are considering several processors it's not only one processor so we have multiprocessor system or multicore and then we want to we we care about the order of all memory operations that coming from different processors or essentially uh these operations are from different memory locations they are not looking into one location so this is the topic of global ordering so Global ordering of accesses to all memory locations are important uh in the consistency memory consistency topic uh related to this topic we also have coherence but with a little difference here that coherence is about ordering of operations from different processors to the same memory location so basically in quance we care about local ordering of accesses to each cach block uh that we're going to also see today but we're going to start with uh consistency as I said so we discussed about this much of a parallel computer architecture that is about designing machines to overcome sequential and parallel butex and also make program job easier in order to write correct and high performance code so this is the basically the problem statement for memory consistency we have operations a BC D they can be Memory operations in what order should be the hardware execute and Report the result of these operation essentially so that's the question about memory consistency so there should be a contract between the programmer and microarchitecture that is a specified by Isa and then that uh protocol preserve an expected or let's say more accurately actually agreed upon order uh that simplifies program's life so once the programmer knows that the ISA will U obey some rules or some protocol then uh basically you can expect you can see that okay uh you can you can basically make some you can have some expectation about how these things going to run so of course why is it important you can guess is because of ease of debugging um so if you don't know what's going on you know you don't know and so you run your application and you see that okay semantic is wrong and then you want to debug it but the second time that you run it they can be actually your program might work uh because of some orders that you don't know right uh so once you know that okay some which orders are uh possible or which are not then you can actually have better uh debug also when about ease of estate recovery and exception handling you might remember from our the ddca course um digital design computer architecture that we discuss about precise exception once you have exception or interrupt in your uh your execution uh in out of order execution you need to have the precise State such that when you when you are done with that exception you should know that which state you need to come back essentially so the these are really important in order to provide otherwise essenti again it also goes to correctness and the difficulty of deing but of course when you want to provide is kind of expected um it usually makes the hardware designers life difficult but is very important for programmer perspective so you need to make things also easier for them so in the again is a trade-off again uh between programming and Hardware design so you need to see which for this topic um how much we need to make program life easier how much we need to make Hardware like we're going to start from a model that makes the program life very easy but then we're going to see that that model is actually quite costly and then we G to get to the model that show that okay programming can actually you can make the programming life a little bit harder but at this price of I mean as a reward of a lot of good less um basically uh overheads in the hardware design okay so let's say about memory ordering in a single processor first uh you know that that is a specified by the fum model we have sequential order in uh in a single processor essentially so the hardware executes the load and store operation in the order of specified by the sequential program even though you might have out of order execution but out of order execution does not does not change the semantics because in the end you retires um execution you retires instruction at the order of programs essentially so that's why we have this reorder buffer for example that uh to make sure that things are happening the sequence in the sequence of the program so these are like advantage of that of course you can see that the architecture state is precise within an execution which is I said why is important to have precise State especially when you have uh exception or interrupt in your application also the architectural state is consistent across different runs of the program which again is important so when you see an error you want to basically reproduce that error such that you can debug it but once if you see errors randomly it's very hard to you know to originate localize the error and fix the program essentially but of course it comes at some dis disadvantages like we need to preserve order uh and that adds overhead like this instruction window that we have this large instruction window reorder buffer all these things are there and provide overhead I mean adds overhead to the design and they also reduce performance at the same time because at some point you need to um in order to ensure the correctness or in order to obey these sequential execution and uh you cannot I mean you need to sacrifice performance sometimes so you need to install um and also it increases complexity and reduces scalability in the end so you can also check our lecture uh in this link below of ddca for these stuff also but now also let's see another extreme of this which is data flow processor you might also remember data flow processor from our DDC course how many of you actually take our DDC course before you okay yeah but you can uh other also can check these links if you're interested okay so in data flow processor basically you know that memory operation executes when it's um opans already essentially in data flow is data flow architecture is that so there is no um instruction level instruction uh basically control so you have operations and each operation has some opans and when the opans are ready you just fire the instruction so it's true data flow and as I said ordering is specify only by data dependencies so that's the only dependency that you need to consider and of course uh yeah two operations can be executed and retired in any order if they have no dependency so that's the thing so it's very hard to basically reproduce or U so when you rerun data FL architecture the order of instructions might be completely different because there are there not many instructions are actually dependent on each other and they don't have data dependencies many of them they don't have data dependencies and you can just run them in any order uh in data flow processor so of course it's it's very good because it provides lots of parallelism and high performance but unfortunately there is no precise state or ordering semantic so essentially it's very hard for debugging and uh and also you cannot easily recover from uh issues essentially okay but now let's see memory ordering in a MD processor so we know that each processor uh processor's memory operation are in sequential order so each processor actually obeys for new man model so we know that um with respect to the trade running on that processor but now we want to see if uh we have multiprocessor and each processor obeys for new man if everything is bright or we might uh run into some issues essentially so how does the memory see the order of operations from all processors or in other words what is the ordering of operations across different processors and if if it matters in in the end so why does this even matter there are three reasons of that ease of debugging correctness and performance on overhead so for E of debugging it is useful to have the same execution done at different times to have the same order of execution which is important to for repeatability so even though this is really important but many of actually parallel processors like memd architecture They Don't Really provide that uh there are actually some works that you can record and Replay that you record a sequence of instruction or sequence of memory operations um when you're running in a debug mode and then you replay from that in order to see the problems essentially so you can replay uh from your record but normally without that support uh you don't really have uh basically you know you cannot have the same execution done at different times so depending on how cores are busy how like the what's the status of your memory what what what's the status of your pages in the memory so there might be many different orders essentially across threads so in the end uh when you work on multiprocessor and parall programming debugging is always a hard task and if you write a program parallel application I I guess you can tell that essentially it's not really easy to debug your um code but correctness uh we want to provide that so can we have incorrect execution if the if the order of memory operation is different from the point of view of different processors so that's we're going to see that this is bad so if different processors observe different orders that shows that there is a issue in your execution and we're going to see with that with some examples so it's really important to not uh run into run into this issue so we need to have make sure that there is no in incorrect execution in such cases and then performance and overhead like we can enforce a strict sequential ordering that we're going to see actually that can make life harder for the hardware designer but is actually easier for programmers but that uh is you can see the overhead here like the tradeoff so that kind of a strict ordering um will add a lot of overheads to the design and uh especially like the whole design components like your Architectural Components like out of order execution caches so once you have caches for example um not many like so there is a local cache and many of operations from that processor to that cache is not observed by other processors so now if you want to have a kind of sequential order such that old processor knows about what is happening so you need to communicate all information and that complicates a lot of things which we're going to see a lot today so we're going to focus a lot on correctness first and then we're going to also see how we can reduce the performance overhead in uh designing this so uh essentially this order affect work when we are protecting share data if you are operating on private data like assuming that you are like optim like updating some pixels and each processor or each yeah ex exactly each processor each thread is operating on um some local pixels essentially so those can actually work completely um you know independent and you don't really care about the order of memory operations but things are important when you're operating on Shar data and then the order is important like if you are sending several rights um like which right is actually should be um earlier or later essentially so as I said protecting shared data is important trads are not allowed to update share data concurrently for correctness per process and that's why we have these accesses to share data are encapsulated inside critical sections for example or protected via synchronization constructs like logs semors condition variables that they can actually lead to critical sections in the end as well so we know that only one threade can execute a critical section at a given time which call it Mutual Exclusion Principle so we want to see that actually so this is at the software U basically level we want to see that if you write a code which is uh correct at the software level so you actually you have done you have you have done a good job in um making these critical section logs and handling them but if you run that code in the hardware is it going to work or not essentially so a multiprocessor should provide a correct execution of synchronization Primitives in the end so these are PR like logs to enable the program to protect shared data essentially so you need to provide a support uh good support for these uh essentially these synchronization Primitives in the hard any questions so far okay so program programmer needs to make sure that make sure Mutual exclusion is correctly implemented uh we will assume this even though it's not really easy and there are many Works going on there um but essentially correct parallel program is an important topic as I said and you can check this D St work uh is actually one of the basic of this topic uh I'm not sure if you any of you have read this paper but yeah this is actually one of very basic work uh like basic uh content of this so if you want to work on Parallel programming essentially you cannot avoid drra so you need to of course check this um check this paper but you might be aware of deer's algorithm for Mutual exclusion okay so we're going to use DEC Care's algorithm actually in in in this example but of course there are many many other also way to provide um synchronization across different traits so essentially programmers relies on Hardware Primitives to support correct synchronization and we want to see that if that's happening or not question you have okay so if Hardware Primitives are not correct or unpredictable then program's life is tough if harder PRI are correct but not easy to reason about or use then still programs life is still tough so it's very important that you provide at least you need to provide a contract uh from the hardware to the from the ISA uh to the to the programmer such that programmer knows what to expect but those also should be easy to understand so that put a lot of basically difficulty in design uh such processor okay so I want to explain this with this uh example here assume that we have two processor P1 and P2 and you have two synchronization variable fub1 and F2 both of them are initialized to zero after some time and you have a critical section here and essentially you are controlling the access to the critical section by these synchronization variable essentially so this processor um set F1 to one meaning that I want to enter the critical section but I but this needs to test the F2 variable it can enter critical section if uh the other one is not in the critical section meaning that F2 is zero so this test F2 is zero and then if if it is zero then it can enter in critical section and at the end of that it needs to reset the uh this synchronization variable such that the processor P2 can inter it if it's waiting for that essentially and the same thing is happening at P2 processor with the difference that it sets F2 to one so you need to use different variables and it also checks the F1 essentially here question this is very simple decares algorithm so this can work right yes if what yes but we don't want to discuss that those kind of issues now but yeah it's it's it's not a best way of actually providing this synchronization that's that that's that's why I said that uh at software you need to actually Implement a lot of like there are different levels of providing this synchronization and you need to work hard on that but we think that this is okay but now we want to see how this might not work in the hardware essentially okay so the question is that can the two processors U be in the critical section which not which might should not happen at the same time giving that uh they both obey the Forum model so that's the question and this is the Baseline we consider so if you consider that you have two processor and then there is interconnection network and there is one memory here this might work but if your memory is banked then you can see that this might not work and we're going to see with this example okay we also actually have a do come here but I don't okay let's uh work through this example together this is the y- axis for the time and then we can basically see the like check this scenario so P1 uh first uh it needs to set F1 to one basically right F1 U set to one and complete from p1's view so basically P1 set that one and then retires the execution so and then a u sent to the memory so this a is like set F1 to1 is already sent to the memory and from the processor one's perspective this is done here also the same thing in parall uh P2 assume that it reached to this x uh we call this uh instruction X here so these two A and B instruction and these two are X and Y so P2 also uh executes instruction X meaning that it's set F2 to one and then from processor two also point of view you you retire this instruction essentially and you send this x to the mem memory Essen so later on uh you reach to that so P1 reached to this B uh instruction which you need to test the value of FS2 so you need to test F2 meaning that you send this read request um to read the value of F2 from Main memory essentially and now P1 is installed actually there because P1 doesn't know the value of um FS2 the same thing happen also for P2 uh in P2 you need to test F1 but F1 is uh so you don't have it you need to request from memory so you send it request from the memory and then you installed here that uh instruction so after some time memory sends F2 um to P1 but this can happen actually before that right before that right actually takes place in the main memory so from the processor view you are done with that instruction because you already commit that instruction right but it hasn't done uh like eventually like eventually can happen can be done but uh so far it hasn't done in the main memory like the main memory still have the uh Absolute Data let's say zero and zero so memory basically replies to your read operation uh and then sends basically uh zero essentially send to the to the P1 and then P1 can enter the critical section right because this F2 is zero um exactly since if F2 is zero P1 enters critical section the same thing can happen also for uh the processor P2 so memory replies with zero to P2 and then P2 enters critical section then after some time memory complet the operation a and F1 is now one and also in this side also memory completes Operation X and F2 sets one but these are clearly too late so this can happen also if you know that like these two operations are load essentially right and in many actually cases uh in for example in interconnection Network people come up with a lot of prioritization techniques like you want to priortize reads overwrites usually because they are on the critical path of execution so it's actually kind of possible that you uh receive your read operation earlier than your store operation essentially even though you issued that store earlier but you can get your um read operation faster okay so this is incorrect so now we need to see uh want to see like each processor's view what's happening here so from the processor one perspective this is the order so processor one sends a right and then processor one sends B and then processor one um basically sees the X happening in the main memory right from the p2's view it sees that okay it sends X and then Y and then a it's happening from the P2 points of view so essentially from processor one view you have a BX meaning that you have a and after that you have X but from processor to's view you have x y a and then essentially you have X and after that a so these two cannot be correct at the same time right so one of them is seeing a before X the other one seeing X before a and that's the reason we are saying that in memory consistency all processor should see the same order as long as you see the same order things would actually go correct you can actually test it and that's the topic of memory consistency essentially so the two processor did not see the same order of operations to memory and this happened before relationship between multiple updates to memory was inconsistent between the two processors point of view and as a result each processor thought the order was not in the critical uh thought the other was not in the critical section which as a result things can go wrong essentially so that's the problem so how can we solve the problem the idea is sequential consistency which is actually proposed in that two-page paper Lamport so all processors see the same order of operations to memory it's actually quite strong U consistency model so all memory operations happen in an order called the global total order that is consistent across all processors so this is really important that all processors is the same order uh across and as our assumption here is that within this Global Order each processor's operation appear in sequential order with respect to its own operations meaning that each processor obeys uh AMW sorry fan model each processor obeys fan model which is sequentially consistent but uh but also beyond that uh all these multi processors that you have they all observe the same order of operations in the memory so this you can actually read from this Lamport paper and these are the two condition that I already mentioned so this is actually a memory ordering model or memory model people call it and specified by the ISA so this is but how actually microarchitecture implements that model is another basically issue here so now we we know that this is a model and once we somehow we uh basically obey this model in the micro architecture then we are our multiprocessor is sequentially consistent and then programmer can easily for example run this deers algorithm and everything would be bright but the thing is that uh how we Implement that is actually quite also it's not easy and that's why we don't really have much sequential consistent uh processors u in the market because this actually adds a lot of overheads which we're going to see to the design and also to the performance of system from the program as abstraction uh you can assume that in a sequential cons in a sequentially consistent multiprocessor essentially there is a switch between your memory and and also the shared boss so memory only you know uh tries to basically um service basically memory service has only one uh request at at the same time and that uh service is actually is observed by other processors essentially it's very similar to implementing that remember that I talk about multibank memory and also One bank so if you have only one Bank memory like the whole memory is One bank then you might have similar stuff but the problem is that your memory is multi bang and then you need some you need to provide such kind of Illusion um like understanding by providing sequentially consistence okay any question good okay so there might be a lot of potential correct Global orders and all can be correct as we said we don't really uh care about debugging here much we only care about correctness so if you get back to the to our example before you can see that all these orders are possible and uh but the point is that each of them is consistent so if processor P1 observes axy processor P2 will also observe the same order that's the key here um but if you want to make it also easy to debug uh you need to make sure that for example you always have one of these order or you you might have many other orders but at the same at the time that you have some errors in your program in semantic you should be able to repeat the same order you you should be able to control that and that's why people come up with this you know replay and record and replay stuff which is also as overhead okay so which order or this also called as inter living uh which order or inter living is observed depends on the implementation and dynamic latencies so yeah there are a lot of dynamic latencies as you can imagine uh when you want to access your memory you have interconnection Network you have memory controller memory controller has scheduling stuff queuing so there might um there might be there might be a lot of different orders um that can happen but the the important part is that all of them are each order is consistent across all the processors so these are also another important things to say like within the same executions all process see the same Global Order of Operation to memory so we don't have correctness issue just to want to repeat it and that satisfies the happen before uh intuition and uh across different executions different Global orders can be observed Each of which is sequentially consistent so debugging is still difficult as order changes across PRS if you want to learn more u i mean you need to check this paper uh it's actually not long paper it's only two page but uh provide that these Concepts very nicely okay question okay nice okay so now let's see what are the issues with sequentially sequential consistency so we know that this actually kind of nice abstraction for programming but there are two issues first is that it's too conservative um like for for the ordering requirement our approach is too conservative and also it limits the aggressiveness of performance enhancement techniques so once you want to make this sequentially consistent you need to actually sacrifice a lot of uh performance Improvement techniques like as I said for example like caches like caches are important in order to provide performance but if you really want to make uh things sequentially consistent you need to communicate the order of all memory operations across old processors so many of these memory operations are happening actually inside locally so then you need to actually communicate many that happening local in the processor to other at some point you want to get rid of caches essentially or if you have cash you need to communicate all the updates or all the like updates that happening into your cach I mean that's also interesting because that's part of the issue that we're going to see today about cash cence but from the sequential consistent point of view cash can be actually filter out a lot of memory operations and making old processor aware of all memory operations is very hard once you have caches clear yes that's uh so first of all the sequential consist doesn't say that so that's why we're going to relax the model in sequential consistent say that the order of all memory operations should be yeah and that's because we want to make the life easier for programing because once you want you want to do that which you're going to see actually programmer needs to differentiate between what is shared what is not shared right so how Hardware can decide on shared or not shared to reduce I that's no in the end it's actually part of semantic of the program right so program knows that for example this is critical section and I want to operate so this is for example my consistency variable like not synchronization variable which is important so I want to uh make sure that that is happening consistently but so in the end programmer needs to do something if you want to do such things but sequential consistency say that I will make sure that everything is consistent and all processor observe the same order then the life is easy for program program does not need to do anything just write a code I mean of course program needs to write this daycare algorithm for example but as long as uh they write you know these simple Primitives things will be bright essentially but yeah but now the question is that is the total Global Order requirement is uh like too strong which I think we all agree that indeed it is so do we need a global order across all operations and and all processors how about a global order only across all stores and people actually like come up with all these weekend consistency model we're going to discuss about weak consistency model but that weak consistency model is not only one thing there are many many uh like Works in Academia and also industry the way that people um implement this microarchitecture that they come up they try to not be that strong sequential consistent they Implement some sort of weak consistency model that we're going to see so but this is one for example example that you can have Global Order only across stores because for a store you might uh care I me for reads you don't really care much and then uh or for example you can uh enforce a global order only at the boundaries of synchronization which is actually quite interesting uh and we're going to see about that uh today and this is actually relaxed memory models and you can get this call this consisten model as a and release consistency model so performance enhancement techniques that could make sequential consistency implementation difficult so this out of order execution is one example that I already uh mentioned kind of that loads happen out of order with respect to each other and with respect to Independent stores so loads are usually happen out of order right so this makes makes it difficult for all processor to see the same Global Order of all memory operations and another one is cashing which I already mentioned that a memory location is now present in multiple places and that prevents the effect of a store to be seen by other processors make it difficult for our processor to see the same Global Order of all memory operations and we're going to also see that U caching can also cause this quance issue but that's relates to one specific memory location but the perspective of consistency like memory ordering even though you might write to different addresses but since you have caches you are not if you don't communicate that to the to the whole system then the the whole processor the old processor do not know the the order of all these memories so that's why actually why caching makes things like providing sequential consistency what once you have caching is not easy that's why people actually came up with this weaker memory consistency model so the ordering of operations is important when the order affects operation on shared data meaning that when processors need to synchronize to execute the program region with consistency uh this one idea is that programmer specifies regions in which memory operation do not need to be ordered and uh this is with memory fence instructions for example in some isas that you need to delate those regions and all memory operations before a fence must complete before a fence is executed so before fence there are some memory operations all this needs to be done and then all memory operations after the fence must wait for the fence to complete so that's that's our model essentially and this fences complete in program order it's very also similar to barrier in some IA actually they call it memory barrier or so okay question so this is a one example so like examples of V consisten model as I said uh this is one way of implementing B consistency but people also came up with different ideas in order to weaken the consistency model this is sequentially sequential consistency which everything should be like this arrow mean that V cannot perform until you is performed so in sequential consistency we have Arrow basically among everything essentially this load cannot happen before this load finish this store between before that load and and so on so forth but there are also some other uh like consistency model that people uh come up in order to reduce the overhead this is processor consistency and people say that for example they observe that uh these store and load things these two are not needed to be consistent you can check these papers if you want to learn more but um this paper actually discuss also different U ways to provide V consistency here is actually two example one of them is a v consistency model um WC is kind of unfortunate naming and also release the consistency so in a we consistency model that we have here essentially you have this ACC and release so you need to and then you have some critical sections so you you Echo the loog and you know that uh by the definition that we say everything before that Aqua Lo should be done so when once you reach to this point you acire the log and then you enter critical section so in this critical section you can actually do order like operate on any order because these are not your um even this is critical section but you are handling that like you are making sure that uh with these logs that um not only one processor is inside this uh critical section so you can actually operate them benefit from your out of order execution that being said we still need to obey uh for new man model but with that you don't need to care about much so things you don't need to make sure that things are sequentially consistent across all processor inside this critical section the only thing which is important is this echoing lock and releasing locks those should be consistent across all processors and then after this releasing you have a also a code here which is not a critical section actually but then this also doesn't need to be sequentially consistent then here you have another critical section but this critical section needs to be again you need to accur Lo essentially so these two uh needs to be accur so this is a vak consistency model that this paper discussed but people also uh look into it like a little bit more and they come up with this release consistency technique which actually quite uh interesting um it's kind of actually obvious but at the same time this release consistency is very hard to implement so not many processor today actually have that maybe none of them which actually I'm not aware but in Rel consistency uh you can see that in order to assuming that these operations these block like block four block two and block six they might have a lot of out of order execution like the independent instructions right they might have but the problem is that you cannot execute them you cannot benefit from out of order execution because you are running them in a sequential way you know you you accur a lock you run this then you release the lock then you need to you can go to this section then you another lock run this critical section and then release the lock so all these blocks block two four and six are happening u u in the in a kind of sequential order essentially but they might have some instructions that they are independent right so with this kind of graph you can actually try to run them uh in parallel as much as possible so running uh basically block four needs only basically according U lock a essentially right no if yeah but the thing is that if it's not related to your um shared stuff you know you can as as long as you get the lock so if you get the lock at some point you're going to release the lock right you know that so you get the log you can run this as long as this block four is not related to block two you can run that right so you now you can actually benefit from this paral but the interesting part is here um in order to get to block six you need to get the log a and then you need to also get the log B so you can actually request acquiring log A and B in parallel not not in parallel but back to back you get this lock and then you get the lock b and if you get all these locks then you can actually run block six you can see this actually quite intuitive but it's actually not very easy to implement but yeah this clearly provide very good parallels yes no you need to be consistent for that um for aquar for that exactly yeah exactly so you need to be consistent for your synchronization variable essentially so order matters for when you are uh operating on synchronization variables and you don't have many synchronization variable hopefully in your code doesn't matter exactly yeah so if you want to learn more about V consisten you can check these papers U they are actually quite well written papers with a lot of interesting uh techniques yeah I think I already also presented this release consistency in the previous slide so you can check release consistency from this paper as well so now let's uh kind of conclude with some trade analysis so of course with Vare consistency Advantage is that no need to guarantee a very strict order of memory operations which is really good and that enables the hardware implementation of performance enhancement techniques to be simpler like out of order execution caches and can be higher performance than stricter ordering which is clear but as a disadvantage you have more burden on the programmer or software because you need to get the f and labeling of synchronization operations correct and debugging is also harder harder to reason about what went wrong and uh and that's yeah but despite all these uh issues for the programmer all all this year people realize that we need to make the program life a bit harder in order to provide this uh consistency model because sequential consistence is actually just too strict and doesn't really make sense because then uh it adds a lot of overhead uh to the to the architecture and it's also another example of programmer microarchitecture Trad off that you've seen a lot um in in this course like in many cases question okay so you can check these papers if you are interested but this topic is actually very interesting for exam so I'm going to show some sample exam questions I'm not going to solve them you can solve them or just check the solution for that but this is for example one example question that we had this is like the two threads here thread A and B and they are concret running on a dual core processor that implements a sequentially consistent memory model so that's the Assumption and we say that assume that value at address um basically hex 1 0 0 is initialized to zero so that's the initialization and these are the two codes we have so the first question is that LE all possible values that can be stored in register three R three after both STS have finished executing like depending on the order order of these instruction like we have inter living right and you know that we when we have sequentially consistent model um we might have actually and we do have actually a lot of inter living uh across memory operations the only thing is that these orders are the same from the perspective of Trade A and trade B so that's the only thing here but we might have and we do have a lot of orders so depending on each order you might have different values for register Tre so you can practice on it of course at home the next part is that after both strs have finished executing you find that these values for these registers how many different instruction inter livings of the two tra produce these results essentially this is also not that hard but it might be a bit like you need to basically spend some time solving these questions of course What's the total number of all possible instruction inter living you need not expand factorials essentially yeah on non-sequentially consistent processor is the total number of all possible instruction inter livings less than or equal to or greater than your answer to question C this also interesting to think about you can easily find the solution actually for all these uh there might be actually also some homeworks uh question um and you can check for solution of them this is another example that we provide more let's say uh like detailed code and then we ask some question about do you find something that could be wrong in the C code segments explain your answer for example and then multiple other questions so this topic is actually very quite interesting for exam also for architects of course but yeah and that's all question okay then um we are done with the first part after uh let's get have a break and after that we can continue with cash cerence let's have a 15 minute break and come back at two 27 right 227 e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e thank you am I audable niik can you hear me in Zoom or YouTube uh do you hear me in Zoom okay good okay uh welcome back we're going going to continue with cache cerence uh which we already actually discussed like at least the definition so caching in multi processor you know that uh it's not only complicates ordering of all operations as we already discussed a memory location can be present in multiple caches and that prevents the effect of a stored or load to be seen by other processors it also complicates ordering of operation on a single memory uh location which is the topic of to the this lecture essentially so a single memory location can be present in multiple caches which makes it difficult for processors that have cash the same location location to have the correct value of that location in the presence of update to that location we're going to see actually with some example so that's the definition of memory consistency and cach cerence which we already covered these are uh some required reading uh this cooler and syn book is actually very useful for this topic and we're going to also use a lot from this P from this book but yeah so you can check these recommended papers as well so this is a shared memory model essentially that many parallel programs communicate through shared me shared memory we we also have another uh Paradigm which is message passing like whenever you want to communicate um across processor you send a message we actually kind of seen it in also one of the lecture U I guess last week but in shared memory model essentially you have a part of your memory which is shared and you use that shared um section of the memory in order to communicate in order to synchronize communicate or whatever so all these things happen in that shared memory so like processor zero can write to an address and that can be followed by processor one reading so this way processor zero and processor one can communicate information essentially so each read should receive the value less written by anyone and that's the key here which we're going to see that this can be not true actually when we have caches so this is a basic question that we want to answer here if multiple processors cach the same block how do they ensure they all see a consistent State let's see with some examples so processor 2 um load x value from the main memory to register 2 and the value assume that assuming the value is 1,000 after some time processor one also loads that then processor one can update that to this value uh 2,000 after some time P2 actually wants to load this U block again so since um P2 still has this value in his uh in his cache it just reads 1,000 right so P2 should not load 1,000 and that's the issue that we're going to see how we can ensure that uh you don't load Absolute Data essentially is this example clear you can get the understand the issue right okay so the first question is that who's uh respons responsibility is to provide cash cerence there might be actually some software Solutions we're going to actually provide a very brief overview of software techniques and then we're going to dismiss them uh because uh we can actually argue that true software implementation is almost impossible so in the end you need kind of support from hardware and and people actually come up with more let's say Hardware based Solutions of course you can think of something between like software base and Hardware base and you combine them but if you want to have a truly software solution and also at the same time effective then actually we can argue that this really almost impossible it's very hard to say something is impossible but we can say almost impossible here uh so the first question is that when if you want to do it at a software can programmer ensure coherence if cach is invisible to software so we know that caches are essentially they are not really visible to softare there are some instruction that the ISA that you can uh you know know about caches but after all I mean if you want to make sure that software or programmer can ensure cence you need to actually communicate a lot of information a lot of details about your caches uh to the to the program ler so one way to provide cerence is actually doing it at a course screen that you can have page level cerence um and people have come up with those uh those implementation but clearly it has a lot of overheads so you lock the whole page uh once the processor wants to operate on data so you lock the whole page meing that only this processor can access this page other processor should wait essentially and you are doing this coherence uh at a page grity which clearly has a lot of overhead if you want to update a block or a word which is not which has nothing to do with other words or other blocks that other processor are updating you cannot do that right because you are cence protocol is uh is implemented at a core screen essentially another solution or let's say nons solution I would say is that you can make shared logs or data non-cashable clearly it's not a solution right so of course you can say that I'm not going to cach this and then yes that can provide qu because you raas the problem to begin with right you just say that okay I don't have cash so this is okay it might be a solution but yeah clearly it's not a solution a combination of non-cashable and cor scen is doable and people have propose that uh you can also think of f grain implementation that uh like the ISA provides a cash flush instruction then you actually need some instruction like flush local which is a flushes or invalidates the cach block containing address a from a processor's local cache or you can have flush Global a which uh invalidates The Clash block the cach block containing address a from all other processors caches and you can also kind of like kind of heavy-handed uh flush cach X that flush old blocks in cashx yes um by adding all these instruction you might be able to provide coherence but clearly they are not really effective and efficient even if they are effective then puts a lot of burden on programmer and makes programming life really hard so now the programmer needs to deal with consistency and coherence and many and together with many other issues as well so these things are not really easy to handle so that's why actually people mostly use Hardware based uh cerence techniques or at some point also they combine um any of you are is aware of selb IBM SB Processor have you played with PlayStation so IBM CB is actually a processor which is using Sony PlayStation I have work actually with the the IBM Selby which is used for PlayStation 3 so in in PlayStations you could actually um put it in a kernel mode and you could install a kernel and then you can run application you know you can just use it for a parallel programming and we've done that actually I use that also for some I I I program a lot with celb I actually taught a course for that uh when I was when I was in when I was in Iran I was teaching uh in in there was a course in multicore programming and I was teaching how to program gpus CBE and also multi- traded processors so with Selby actually I love that architecture honestly there are many interesting things about Selby which uh they're some of them they're really I would say Advanced and intuitive but uh IBM designers they wanted to make Selby quite simple so they said that okay we're not going to support cash coins at all and everything should be handled by the programmer and that's why uh in the end even well experienced people well experienced programmer they cannot really use sby easily I'm not sure about other editions of sby I clearly I haven't followed up op I only work with that sby which is used in PlayStation 3 but it was really kind of headache to program and use that so even I I guess uh IBM Engineers they after that they also realize that that wasn't a really good idea so there should be something between not completely software um if if you look also to GPU you can see that GPU uh in the beginning they were not providing any kind of coherence um essentially you have only a shared memory and you in that shared memory which is inside each chorde you can provide some sequential like uh synchronization but in GPU you say that if you are operating on your Global uh threads they you should not really expect that they are coherent or consistent and then people actually have tried a lot to implement different coherence technique using sequentially cons using consistency model essentially so you can look into that uh if you're interested to learn more about these topics but there has been a lot of Works in order to use U software way um in order to reduce the cost for Hardware but in the end people realize that there should be something between or completely Hardware relying completely on software it just you know does doesn't doesn't work well okay so in the hardware you can uh greatly simplifies the software job and one idea is that you can for example invalidate all other copies of block a when a core writes to a for example so when when a core writes to cach block a you can U invalidate all other copies of block a in other cases I should also say that here we are talking about block so that's the granity of the cache cerence that we're going to talk about today like that we want to have it like the granity of blocks it can be also in granity of wordss or bytes let's say or bats but of course that creates a lot of orad but this cach quance at the level of block can cause also some false dependency which has been actually uh which has been seen in in in many actually software works like software programming like when you when you have uh when you map your data Di different cores and these cores are or threats are operating on different essentially different WS but if these wordss are Shar they are shared across one block you might actually think that oh like the processor thinks that you have cerence issues and it try that tries to handle that and cause a lot of performance overhead for you even though these threads are not really operating on the same boards they are operating on different boards so if you have the cerence metric at the like cerence protocol at the the level of words you could potentially you know um fix that problem but then it has also comes at the price of huge overhead to the design so it's always tradeoff and good to think about it and now for that we think that it's actually a program's job in order to map data nicely such that you don't run into false dependency much so hopefully threads that they are operating uh they're operating on different cach blocks so when you partition your data you should not partition a cash block across several threads essentially but is that easy it's that easy because sometime you don't know how cash blocks are are you know how your data is going to map to cash blocks so that's one thing another issue is that also there are some applications that your accesses are not that regular regular you know you might have a lot of random accesses and you're going to run into some false dependencies at some point say it again sorry yes that's true okay I provided kind of huge background here but it's good I think this kind of uh perspective is good to uh communicate such that we can follow the lecture better okay so now let's see a very simple cerence scheme uh valid invalid we can just name it right now so essentially cashes Snoop or observe each other's write and read operations if a processor writes to a block all others uh invalidate the block this is a kind of simple protocol I'm going to actually explain it um in detail now because after that we're going to build up on this protocol so it's very important that we learn about this protocol nicely so here we are considering a cache which is right true meaning that when you update cach block you also update the main memory this is right through and also you have no right allocate cache meaning that if you have right Miss so clearly is a miss right so you need to uh update the memory you need to you don't read the data from the memory put it uh in the cache and then update it you just update the main memory so that's why you don't you have this no right allocate right you don't allocate rights if you of course if you have your data you read your data and you put in the cache and then you have a right so you have right heat so you just update your cache and also update the memory because it's right true but if you don't have the data and you encounter right M you don't allocate that in your cache clear and there are two types of actions here uh actions of the local processor and the cache block processor read and processor write and actions that are um broadcast on the boss for for the Block Boss read and boss write so let's take a look at this very simple Pro U protocol which is actually a correct protocol you can uh prove it uh that this works whether or not it's effective like performance effective that's another question but this is a correct protocol so we have the this kind of this FSM you need to keep um uh for every cach block and also you need to for every processor essentially and there are some points with from different uh processor perspective let's see uh we are in in like as a as One processor um you U you want to read a memory address if you are in if you access cach and you're in invalid State meaning that you have RM right because it's invalid so you send this processor read and also you send BS read to the to the whole processor and then you go to the valid state right this actually protocol doesn't really do much when you send bus rate but we have this signal here because we're going to use it for other protocols so now you are in a valid State and then um at some point you do processor write so you want to write to that block as long as you do process or read let's say first if you as long as you do process or read basically you stay in that valid right you don't do anything but when you do a processor right um You need to send bus right still you stay here because you have the cash uh copy so it's valid and you have right heat so you stay in this state but at the same time you also send bus right to the boss so from the another processor perspective if another processor has the data like it was in the valid State and it observe BS right then it goes to invalid State because it means that the copy that I have is obsolute so I'm going to make it invalid such that if my the processor like let's say P2 accesses that cach block it encounters a Miss such that you you need to go to the main memory and load it and we know that the main memory copy is actually up to dat because we have right through cach right so yeah that these things can work but the thing is that when you want to implement them you're going to get to a lot of also race condition which you you need to and you're actually going to exercise with that at some point in lab six uh which is again a bonus lab um but yeah you're going to implement M protocol that we're going to see today which is actually quite interesting okay question of course I didn't go into all these uh let's see if any of them yeah here also if you're in invalid State and you send you have a processor right this is actually the the the point that I mentioned we are in know when we are we our cash is no right allocate so if your cash block is invalid and you have a processor right you send boss right which is important because there might be other copies that other caches have and they need to go to invalid state but you also stay in the invalids because you don't allocate rights to your cash essentially question about this protocol good okay so I like to review very quickly some non solutions to Cache cerence um no Hardware based cerence that we need to keep keeping caches cerent is is a software's responsibility um this makes microarchitecture life easier but makes average program is life much harder and then we need to worry about Hardware caches to maintain program correctness uh which usually as a program you don't really care about that you need to care about consistency uh but you don't really care about coherence when you work for example with multiprogramming multi trading also there are overheads in ensuring cerence in software that can actually comes at the price of other overheads like a page protection as we discuss um also page based software cerence and non-cashable so these are some techniques that we discuss very briefly but you can you can imagine that the overhead of them yes sorry not necessarily faster because you know for example with this page based software coherence clearly your synch you're serializing all the all the cores if they are I mean if you can actually map your map the data to threads at a page level like each thread works on a page and they are not sharing then yes but then if you're not sharing then you don't really have a chus issue right so at some point if you're operating on a Shar data right if you want to uh protect that share data in a page like say that okay I'm going to consider that page and protect it then um you are serializing all the operations essentially right so I mean might be faster in some cases like for example in terms of micro architecture sometimes your micro architecture is simpler meaning that they can operate on uh larger frequency higher frequency for example you know but uh but even that the these techniques will add some you know other system level issues that overall your system performance I don't think actually is going to be better but there might be also some cases do you have some example in your mind uhuh yeah so if your data is not shared um ex so yeah but Hardware based cordance control also they don't need to operate much if you're like I mean there might be some okay depending on how sophisticated the technique is uh once you update a cach block you need to send this bus right for example with this previous technique right you need to send boss right command and if your data which is with the cash that you update is not shared across them so other uh basically other cores will receive this B right but they don't do anything because they don't have this um cash block to be with right but it's still you send that request so that actually is not free so you send that uh broadcast information but you can also come up with a better implementation of this protocol in order to uh reduce that overhead this okay and another nons solution can be all caches are shared between all processor like now we know that uh so the reason that we have this cash quer is that we have local copies of cash like we have local L1 and so now these copies might not be seen by other processor but if you assume that all caches are shared then yes I mean everyone is observing everything but then yes we don't need cerence but at the same time these shared cash becomes the of course which is very easy to understand and very hard to design a scalable system with low latency cach access this way of course also for Lan you know you want a few cycle access to L1 and that does clearly that cannot be shared across like hundreds of processors question okay so in order to maintain cerence we need to guarantee that all processors see a consistent value or consistent updates for the same memory location rise to location a by processor Z should be seen by processor one eventually it doesn't need to be communicate um at that time very immediately but eventually needs to be observed and All rights to a should appear in some uh in some order so cerence needs to provide right propagation so we need to guarantee that updates will propagate and write serialization to provide a cons consistent order seen by all processor for the same memory location so these are the criteria for cerence technique and essentially with any protocol you need to um prove that you're providing these two U requirements and of course we need a global point of sterilization for this right ordering and we're going to see how we can U provide that Global ordering so harder cash quence the basic idea is that a processor or Cache broadcast its rights or updates to a memory location to all other processors another cach that has the location either updates or invalidates its local copy yesun us to so there are there are some works that they uh like there are techniques that they simplifies coherence using consistency model so the fact that as I said for GPU actually there are there are many works that people try to provide cach cerence techniques but there are also many other works that they try to implement cence technique using consistency model U so this is interesting yes the other around no no cash cash quance uh is very I I don't I mean I'm not sure I because cash quance is actually only talking about one memory location it doesn't so you care about cash cerence when you are talking about one memory location right but in consistency you care about all memory locations so it's kind of if you think about it consistency kind of super set of that so using cash Co to implement consistency at least I don't know about it yeah the other way around is possible actually and people especially as I said for GPU I I've read a lot of papers okay so another cache that has the location either updates or invalidates its local copy this is simple so let's uh make sure let's review our basic protocol that we discussed how can we safely update um update replicate data so one option is that we use a update protocol so we push and update to all copies so once you update you also update uh you push this update to all uh caches that you have in your system system and they should update another option is that you just tell them that I update this block and they need to uh invalidate they need to invalidate their copy essentially so on a read if a local copy is invalid uh put out request if another node has a copy it returns it otherwise memory does so this is depend depends on your way of implementing quer essentially so sometimes you can get the to date result from another cash copy because there is also point of serialization so once you put your request on the boss or or you're on a shared Fabric and another all caches are also observing that this is actually part of a snoop Snoopy cach we're going to also see more but the other caches are observing that boss and once they see that okay you want this block and cash for example for processor zero has the most up to date it just puts the data on that right and you but it also would be the case that you just read from memory or you can actually also only say that I'm I'm not getting copies from other caches to simplify them I'm just getting data from Main memory and that's why for example in that example we needed this write through for example cach for example right on a write request uh we read block into cach as before and then uh for update protocol uh we need to write to the write to blog and simultaneously broadcast written data and address to shares so this is for update protocol other NOS update the data in their caches if block is present essentially for invalid protocol uh you write to block and simultaneously broadcast invalidation of address to sharers and other knows invalidate block if they cashes if block if block is person essentially so makes sense right essentially very simple when you write to a cash block you either send your updated data to other caches or you just send invalidation command okay which one is better like many questions that we have seen so far it depends so it depends on right frequency and sharing behavior um like for update technique if sharer set is constant and updates are infrequent we have we can avoid the cost of invalidate and reord broadcast update pattern so essentially you don't need to send your updates are infrequent so that's good whenever that happen you just send them and you don't really uh impose a lot of overhead however if data is ritten without inter um interfering reads by other cores that updates would be useless so if you you update you send your updated value to other course and other cores also update without reading so if you if you update a cach block send it to another core and that core actually before updating read that data at least it's useful right that processor reads it and update it but if you update send the updated value to another core and that core also update it and then another core updates it so this is kind of useless right so that's why uh this update protocol might not be ideal and in right through cash policy BS can become a bing of course because you need to um inform memory as well with invalidates uh after invalidation course has exclusive access rights which is important so once you update you invalidate you send invalidation request for other caches now this core has exclusive access to that cach block meaning that you can silently operate on that data as long as your access is exclusive that we're going to actually see how we can support that that core can easily just silently update for example which is good and only cores that keep reading after each write retain a copy which again is important but if right contention is high uh it can lead to Ping ponging rapid invalidation and reur traffic from different processors makes sense right so there are two cach coess method essentially how do we ensure that the proper caches are updated uh one way is using Snoopy boss which is Boss B boss base um our boss shared boss is single point of sterilization for all memory request so processors observes other processors actions using that boss so when processor like P1 makes read exclusive request uh for cash block a on Boss P0 Al CES this and invalidates its own copy of a so they all Snoop they are snooping that boss they are observing that boss and they are taking actions based on them another way is actually using directory um which is single point of calization per block so you have a directory which going to actually see the example very soon that in directory you per per each cach block you uh store what are the cores or what are the processor that has the data essentially that has that cash block and this is can be also implemented distributed across notes and then processors make explicit request for blocks so in Snoop model uh once you send invalidation request you need to broadcast it you don't know to whom to who should send you just broadcast on the BS right and if matters to some of this proc they will take actions but in directory base you know what are the cores that they are shading that cash flow so you can only send request to all specific cores or cash caches essentially so directory tracks which caches have each block and directory coordinates invalidation and updates of course so let's see a very brief introduction to directory base then we're going to actually move to Snoopy and then we're going to get back to directory again soon so the directory based coordinates the idea is that logically Central director directory keeps track of where the copies of each cach block recite caches consult this directory to ensure quance so whenever you have a request you need to consult with your directory so there an example mechanism can be for each cach block in memory we store P plus one p is the number of U processors and one is actually if One processor has that cash block exclusively essentially so that one is because of exclusive beats on a read uh set the caches beat and arrange the supply of data and on the right we need to invalidate all caches that have the block and reset their bits we can also have this exclusive bit and when you write to a cash flow you can set the exclusive Bas so let's see an example here so we have four processors and here we all have also this exclusive bit processor one uh takes the read M uh to block a so um directory is completely zero so no one has uh the data so essentially you just read the data from Main memory and then you need to set this bit in your directory to one meaning that P1 has this cach block then at some point P3 takes a r me then uh you also set the P3 corresponding bit to one at some point P2 takes the right Miss then you need to invalidate P1 and p3s caches because you know that these two have the data so you invalidate them you send invalidation request only to P1 and P3 and then uh you need to use support right request and set this bit to one and also set exclusive bit to one here this is really important that we have this exclusive bit because it shows that you know I I have this this processor has the data exclusively and that makes it possible to do some operations silently essentially so yeah and after some time P3 takes the right means then you just reset and uh P3 has the data you set this and also set the exclusive bit again at after some point P2 takes the read me so you provide your Supply from uh P3 it can be also Supply from Main memory depending on the the way that you implement your directory mechanism but essentially now two processors have the data and you reset the exclusive be because now two Processor have the data yes yes and that processor writes to that if you read you don't set exclusive only yeah yeah so this exclusive beit actually here in the in the concept of this example it means that uh you like here U like actually like yeah the here example is better like P P3 um updates and it it has the data exclusively meaning that P2 when it's uh encounter a read M directory knows that uh P3 has the most up toate Val Val of that cash block so basically P2 ask P3 to suppli that data essentially yes yeah which is yes and there a way actually to reduce the cost but in the in the end it's actually quite expensive that's true but that doesn't mean that a Snoopy base is also easier because with a Snoopy base you need a boss and boss might be okay tomorrow we're going to actually learn about interconnects but bus uh might be okay to connect eight processor or 16 processor but once we have hundreds of processors yeah but then the question is that do you need to make thousand processor cerent you may not want that right you you you can also think about clustering your chip and try to be coherent in every clusters and then um when you want to allocate your program to them you don't allocate to all the cluster you only allocate to those partitions that they are coherent so you can of course make some things but if you want to keep thousand processor coherent it's not really easy so there are of course a lot of optimization in directory um so directory is the coordinator for all to be performed on a given block by any processor and it guarantees correctness ordering and of course there are many opportunities for optimization for example uh we can enable bypassing the directory and directly communicating between caches we will see example of this optimization later actually when we provide the directory uh techniques a little bit with more um details but now let's get back to Snoopy cach coordinance so the idea here is that all caches Snoop all other caches read write request and keep the cach block cerent each cach block has cence metadata associated with the Tag Store of each cache so in that example we said talk about invalid valid protocol so you need to keep that metadata for every cach block right so this is of course easy to implement if all caches share a common boss each cach broadcast its read and right operation on the boss and which is good for a small scale multiprocessors but what if you would like to have a like 10,000 node M processor clearly it's not easy so here is an example of that you have end processors and each processor has a cache you need to keep the metadata for uh coherence um in the your tag array essentially and this is your share boss and these are the operation essentially uh processor read processor right bus bus read and bus right and of course you can also have this bus read exclusive meaning that you read from your uh this rad this data is read exclusively essentially so you can think about some optimization so this is a example that we have seen a simple protocol valid invalid protocol which is implemented um on top of a snoop Snoopy cache but now we want to see some extension how we can make it better essentially so what if you want write back caches so in this example we consider that we have right through right what if you want to have right back caches so that means we want a modified State we there should be a state that you say that okay the data that I have is dirty essentially so that's bring us to this more sophisticated protocol which we call it MSI modified shared invalid protocol um that extend metadata per block to encode three states um modified meaning that cash line is the only cash copy and is dirty shared meaning that cash line is one of the potentially uh several cash copies pay attention to this potentially meaning that potentially this cash flock is shared but it doesn't say that it's uh for short shap and also it's clean so these are two keywords here when you are in a sh State your cash line is potentially shared and also it is clean not potentially clean definitely clean because if it's not clean you need to be in a modified State Essen and invalid state which is cach line is not present in this cache so when you have RM RM makes a read request on boss and transitions to S so whenever you have a Ries you go to the shared and that's why we are saying that we are potentially shared we don't know there might be only one one copy but we don't care in this MSI protocol we just go to the shev state right Miss makes a read exclusive requests and transitions to modify the state essentially and when a processor snoops read exclusive from another writer it must invalidate its own copy if any so that's how we Implement invalid makes sense these topics are overall easy but also sometime not that easy to reason so you need to actually go over them multiple times think about them to digest them nicely because after all you might also again you know confuse all these states but usually they make sense it's not very hard to reason about them okay and we have this share to modified which we call it as a upgrade because you're going to the your upgrading the privilege essentially shared modified uh that can be made without accessing memory via invalidations essentially yes so you send a read exclusive uh when you have write me essentially so when you have write me you want to update right you want to update that cach block but you don't have it because it's a right Mi right so you need to read it you need to read it but you don't uh do read M you don't send bus read yet read request because if you send that it means that you don't want to update it you just want to go to the shared State you send read exclusive requests such that other course that they have this cash flock they understand that okay this core is going to update that so we all need to invalidate our copy so that's the way that they can differentiate across so this is the FSM for this more complicated protocol these are essentially the main difference here we have this processor right and you need to send boss read exclusive and uh if you are in the basically if you're in the shared State meaning that you have the copy and then you want to write it um so you go to the modifi state and also you send this bus read exclusive to other course just telling them telling them that you they all need to invalidate and also you can uh move from invalidate again to modifi as well so this is the case that you get right hit this is the case you get right Miss but both you go to the modified the state and you invalidate other uh basically cash uh copies and other copies when they receive this boss read exclusive they all need to go to invalid invalid State essenti okay this is important important yeah so if you're also in the shared state but then you get this bus read exclusive you need to go to invalid essentially once you get this boss read when you see this boss read exclusive on the boss you need to invalidate your copy so that's why you go to this state of inv no this is just I guess naming yeah this the kind of terminology and if you check different also books and paper they're going to you're going to see a lot of different terminologies but the reason about this is that you want to read this I agree actually you want to write it that's true but yeah it is exclusive because your access is going to be exclusive you want to up this question which side uh this site when you are in an invalid State and you want to write to that cach block so you go from invalid State you go to modify the state so it process right and once you are in the modified State you as long as you see this processor because you have this copy exclusively right so you do this processor read processor right you don't need to inform anyone essentially and that's the beauty yes yeah yeah yeah that's right because you need to read it bring that cach block thank you okay but now the question is that can we do better than MS so a block is in no cache to begin with so essentially the problem with MSI is this thing that you are when you are in invalid State and you read you go to the shared state so you are just assume that your data is shared even though you it might not be shared you know you might have the only copy in the system so that's the main problem with MSI so this is what is uh written here on a read the block immediately goes to share the state although it may be the only copy to be cached no other processor will cach it so why is this a problem suppose the cat that reads the blog wants to write it write to it at some point so you don't you need to broadcast essentially right while if you have that exclusive copy you can you can just silent update right it needs a broadcast if the cash knew it had the only cash copy in the system it could have written to the block without notifying any other cach which save unnecessary broadcast of invalidations so that's bring us to this solution messy um modify exclusive shared invalid but essentially we add another state which is exclusive State and the idea is that this state indicates that this is the only cash copy and it is clean so when you read when you are in invalid State uh and you want to read the data if you can actually read it exclusively and U get to them exclusive so how you can uh realize that your you have the only copy one way is to uh use wired or implementation so you have boss and all caches are you know they are observing this boss right when you want to when you are in an invalid State and you want to read a cash block um all caches are also they are observing the bus right if they have the copy of this they can assert the boss or they can assert this wired or like this is a wire that if one of them is one then in the end is one meaning that um you you are not your access is not exclusive but if none of these caches uh they assert this wired or I mean all of them put like zero let's say then the result of this wired or would be zero meaning that I can get the copy exclusively and then you move to the exclusive state and the good thing here is that you can silent transition from exclusive to modified U with rights this Messi is also called the Illinois protocol which actually has been designed by Pat and one of his master students in 1984 okay yeah you can see here so you are in invalid State and you need to read data you send a processor read and only if only this processor has data you go to the exclusive clean state but from invalid if you send frosto read and others also have this data you go to the shared clean essentially now we have two different state exclusive clean and shared clean the good thing is that now you can silently go from exclusive clean to exclusive modified State clear this another way of looking at M State machine it's very similar so you can also check this is from color and sync book and this is your lab which we're going to release at some point lab six that you need to implement Messi even though the protocol makes sense you're going to encounter a lot of issues when you want to implement it and make it work you're going to see which is quite I mean it's not interesting that you're going to encounter issues but it's interesting to see how these things are hard to implement like this messy protocol has been used in Pentium Pro and that actually has some box so later on people actually even though you know they test a lot and do the protocol makes sense the protocol works right but once you want to implement it actually make it work is a hard test so that's why this lab is actually quite interesting so I like also to emphasize these two terminology downgrade and upgrade so a transition from a single owner State like exclusive or modified to Shared is called a downgrade because a transition takes away the owner's right to modify data and another terminology upgrade which is a transition from shared to a single owner State exclusive or modify then we upgrade it so it's just a terminology that you need to be aware because and this is your lab this is also the massive protocol that has been used in Intel pum Pro but let's see how if we can do better so now I like I'd like to discuss basically some tradeoffs so should the downgrade from modified um should we go to the shared state or invalid so shared is that if data is likely so you are in a modified state right and then another uh another core wants to get have the data essentially are you going to the shared State meaning that I want to I want to be one of the cores that has this data or you just invalidate your copy it depends so shared it makes sense if data is likely to be reused before it is written by another processor and invalid if data is likely to be not reused before it is written to by Au essentially so again depends you can actually think of some intelligent techniques that deci side but those can also complicate things even more uh cash to cach transfers on a BS read uh should data come from another cache or memory should he Supply the data from memory or from another cache so another cache may be faster if memory is a slow or highly contented but from memory is simpler no need to wait to see if another cache has a data first and less contention at other cach and that requires uh right back but this requires right back on uh modifi down so once you are done grading from modified State you need to write back to the memory and that brings us actually to another uh model which is more easy right back on modified to Shared needed since shared state is clean so that's the definition of shared state which is clean so when you want to go from modify to Shared you need to write back can we avoid that one possibility is we Define a state named owner in Moy protocol that essentially one cash should own the latest data the thing is that when you from the uh modified the state you go to the shared and all of them they have updated data so you actually communicate but once you want to evict the data who is responsible to update the memory so that's the thing you know that's the thing that you need to make it clear so that can happen with this uh owner attribute so once you go to the shared State One processor is responsible for this specific cach block to update the memory as so memory right back happens when all Cates AIG copies essentially so Messi actually has some problems uh again like the observation as I said shared State requires data to be clean all caches that have the block have the up toate copy and so does the memory and that problem is the you need to write back so we can improve it uh one idea is that we can we do not transition from modified to Shared on a BS read we just invalidate the copy and Supply the modify block to the requesting processor directly without updating memory we actually discuss it so it can be useful in some cases if uh your data is not likely to be used right um You don't have uh future accesses or read to that data so it might be actually it might make sense to just invalidate for that cach block the second idea is actually transition from modified to share but designate one cach as the owner who will write the cash block back when it is evicted so that's the key here now shared means shared and potentially there so you can actually implement it by adding another state to your uh State machine or you can uh change the definition of your state now the shared State you can Define it as shared and potentially dirty which can also Implement that and this is a version of Moy protocol but essentially the trade-off in sophisticated cache cerence protocol the protocol can be optimized with more States and prediction mechanism so you can reduce unnecessary invalidates and transfers of blocks however more stats and optimization are more difficult to design design and verify like it's it's not only about designing them it's also like making them work and also verify them as I said even for messy that people test it later on people observe box so it's very hard to make these works and also we have this they provide dimunition return returns meaning that from valid invalid protocol to MSI you actually observe huge improvement from MSI to Messi You observe a little bit less improvement from Messi to moisi you see some improvement but not that much and that's reduce you know your improvement and your basically uh returns is get um lower and lower as you optimize more and more so at some point it doesn't really makes sense like it's not worth it um depending on the your tradeoffs essentially question okay nice so yeah this is the two protocol now let's compare them very quickly like a snopy Cas Miss latency critical is short um essentially we need to just uh request you need to check this boss transaction to memory and this there is this Global serialization which is easy to implement as long as you have a boss so boss provides this already uh so implementing that Global calization is easy and it's simple which can adapt boss space uni process sour easily but it relies on broadcast message to be seen by all caches in same order essentially having this boss is not easy boss is not scalable at some point you need to design a virtual boss so you have a network which is not a boss but you can make a virtual boss in that Network we're going to see a little bit actually tomorrow how we can do that but this is uh not really easy to implement with directory uh it adds induction to M latency uh which is you know it can uh cause some critical pass it it can increase the critical pass so like request goes to directory and then memory and it requires extra storage space to track share sets this can be approximate you can actually reduce that by you know because false positives are okay I mean for at least for correctness so you can think of storing U like some bloom filters essentially so you can have your directory as your as Bloon filter that since it doesn't so balloon filter does not have false negatives right so it's okay for correctness but false positives can essentially impose some performance overhead which we like to minimize uh protocols and race conditions are more complex for high performance but the good thing for director is that does not require broadcast to caches and exactly as scalable as interconnect and directory storage this is much more scalable than so you can easily easily you can distribute your directory across a lot of course and as long as you can scale your interconnect hopefully you can also scale directory well okay let me quickly uh also revisit directory base and then we can take five minute break unless there are questions okay so logically Central directory keeps track of where the copies of each cash block reside and caches consult this directory to ensure quance there can be some example mean this is one example uh for each cach block in memory we store P plus1 bits we already discussed it and on a read uh we set the cach bit and arrange the supply of data and on a right invite this all caches that have the block and reset theas we can also have exclusive bit associated with block um for that we already seen this example yeah so essentially directory base is required when scaling past the capacity of a single boss so when when we know that this single boss cannot be really scalable we can U it's really important to use directory base and it can be also distributed so cence still requires single point of serialization for a given block uh for right serialization but you can actually distribute it so serialization location can be different from different blocks so the point of serialization can be different uh when you when you distribute your directory meaning that the metadata for different cach blocks can be stored in different versions of directory so then the point of serialization is different for every cach block but still there is a serialization point of view so it's very also interesting that we can reason about protocol um for a single block one server which is directory node and many client which is a private caches so now that you have this intermate node you can actually make it intelligent so that directory can make a lot of decisions essentially because it's a you're Consulting with another U Noe like a server and you can actually Implement different algorithms in that server essentially right so that's actually quite interesting and uh yeah another point that I wanted to mention here is that um with directory you can easily also update your protocol with a Snoopy base for example if you want to from go from MSI to Messi you cannot right you just need to redesign the whole thing but with directory you can actually have some storage that those beats or data uh shows how you make decision or it actually in the directory can be a software code or something like that you know just you can just update that and then uh the policy can be different so for directory data structure require to uh we need to require I mean this data structure is required to support invalidation and cash block request so the key operation to support is set inclusion test which as as I already discussed false positives are okay uh want to know which caches made contain a copy of a block and basically in order to avoid invalidations but false positive rate they uh comes at the price of performance overhead most accurate and also most expensive way is to have full beit Vector um but of course that's not uh happening if if we have many processors so we can also use this compress representation using link list balloon filters they are all possible and people actually have worked on compressing directory a lot as I should also say that directory that we have in our Hardware is actually a cache for a main directory because your main directory might be actually huge right so if you cannot keep it un cheap so sometimes people cach directory but we have inside the hardware um and the main directory is stored in the main memory Essen okay so basic operation essentially they follow semantics of a snoop based system they are not really that different but with explicit request uh and reply messages so you you need to communicate with your directory which is like a server you communicate and that reply to you so directory receives read and read exclusive and upgrade request from notes and sends invalidate downgrade message to shares if needed and also directory forwards request to memory if needed and replies to request and updates sharing State let's see some uh example but before that as I said protocol design is also flexible because you can actually Implement different technique right because it's it's another node and you can um reconfigure a bit as much as you want of course you need to consider some reconfigurability in that directory but as long as you consider that you can easily okay so yeah here is one example uh processor zero once uh data so acquires an address for reading it access home which is home directory and then this home directory uh sends data exclusively to processor zero or send it in the shared manner again you might have different uh policies here and directory can have can Implement different policies or for example U directory can ask uh another note like processor one to equire the dat data send the data to the P0 essentially so P P0 wants to read the data this home directory knows that okay P1 has the data has the updated data so we'll ask P1 to send the data to P0 essentially read exclusive with the former owner so processor zero uh once sends this read exclusive command to home and then this home directory needs to invalidate send invalidate request to the owner and then owner sends a revision as I said there are many different terminologies here this revision is like revised or update to that cash block essentially so this owner sends the revised revision of that cach block to the home directory and also sends it to the processor zero and here also another example when you have contention and how you resolve that so both processor zero and processor one they send read exclusive to the home directory home directory acknowledge one of them and send the data to that and for other one you can send neck knowledge you can send Knack or you can uh for example cue the request like cue the request and acknowledge or service when you have time and that makes procer zero happy and first of one uh unhappy but at some point uh first one sends read like repeats its request read exclusive to the home directory the home directory sends invalidate to the processor zero and supplies data to P1 sounds good right and P1 eventually happy but of course when you have contention you have issues so you need to escape race conditions by knacking requests to busy entries like pending invalidate uh or request they they need to roriz the request or you need to cue request and granting in sequence and of course that comes at the fairness issues like which requestor should be preferred in a conflict interconnect delivery order and distance all these matters like if your as I said directory is actually distributed across nodes right so your processors are in a network and directory is also in the network sometimes the cach block that a processor P1 let's say request like P1 and P let's say 100 they both request cash block X and cash block X is a stored in a directory which is closer to P1 like in the network so P1 can access that directory in let's say one or two hops let's say but p00 needs to take 100 hops to get to that directory so now okay now you have different definition of furnace this is closer and that one further so what should we do essentially should we basically U Knack firster one just because another p00 needs to work 100 hubs to get to that so there are many basically challenges here and pin ponging can also this can be reduced with protocol optimization of course or better higher level synchronization so in the end there should be a for from hardware and software to make things uh nicer so how large is the directory these are some scaling uh questions how large is the directory how can we reduce the access latency to the directory how can we scale a system to thousands of nodes can we get the best of snooping and directory protocols with heterogenity which you can check this token cerence it's very nice paper uh that you know take advantage of both and again this is also very nice topic for the exam so you can see some example question here um I'm not going to review the answer you can see later this is also not that new I think you have seen this paper Moy Prime uh in our cash quance not cash quance our rammer lectures so people actually have shown that uh the cerence protocol in Moy can actually cause Ro hammer and so and this paper actually shows that problem and prevents cerence induced hammering in community CL which is actually quite interesting okay let's take a five minutes break and then I will uh provide or we can actually continue and then finish early whatever you prefer five minute break now okay no one vs for that okay so now I want to very quickly talk about the cach quance in a recent example remember that when we have lectures on crossy memory we say that this memory Centric design in order to make it work you need to also revisit the the entire stack right and you need to also address a lot of uh adoption issues and we actually we've seen this slide that there are many barriers to adopt pin and essentially U cerence is actually one of them yes here uh coherence and synchronization is one of them so we actually work on that this work lazy p and also the extension of that which a lot of improvements cond um is actually the work that provides cash coherence uh support when we have uh near data accelerators or let's say process in near memory engines I'm going to cover this paper very very quickly I'm not go into details of that hopefully you will read it as part of your readings but I will give you the high level idea so essentially you know that we have a specialized accelerators and the nda's uh this NDA here is the near data accelerator so now the challenge is that you have NDA compete unit in D and you have your CPU which level of caches so now the challenge is that how you make coherence between ndas and cqs so these two should be coherent essentially right because after after you can also think of NDA as another core right so now there are different cores but some of them they are not the cores at the CPU so but now you you still you need to make them coent so n the application generate a large amount of of Chip data movement if you want to make it uh coherent and it's imp impractical to use traditional Co protocols I'm going to show some result for that we extensively study existing NDA cence mechanis and make three key observations these mechanis eliminate significant portions of nda's benefits meaning that uh you use this proc near memory to improve energy and performance but once you want to make it coherent and if you don't support coeny cerence intelligently then you're going to basically you don't get performance Improvement at some point you actually can reduce performance and also wor than energy consumption so the majority of off chip cerence traffic generated by these mechanis is unnecessary that's actually one of the key observation so which is good you don't need to send a lot of cence traffic if you do it intelligently and much of the off traffic can be eliminated If the currence mechanic has insight into the memory accesses so we use an optimistic approach which I guess you are aware of this terminology that you're optimistic you say that everything should be okay so I assume that things should be okay so you run um you run in day and assume that everything would be fine with meaning that you don't you're not going to into cerence issues but then after that um you need to check after the execution you you do these quance checks and we need to perform only the necessary quance request meaning that uh we need to update or read on execution okay so let me provide some more background so NDP you know I'm not going to talk about it so these are the applications that we look into for example here in different domains hybrid databases and graph processing so uh we find not all portions of applications benefit from NDA meaning that you you these applications have memory intensive and comput intensive or cach friendly portions meaning that you need to use both uh NDA or TR near memory engines and also your CPU at the same time right because it you don't uh offload the whole application to either of these engines essentially so the first key observation is that CPU trads often concurrently access the same region of data that NDA Kess access which leads to significant data sharing which is not which is not good news essentially meaning that it's important to make them Cent but the second observation makes things interesting like CPU trades and NDA carers typically do not concurrently access the same cach lines which is good and then we are actually benefit from this key observ second key observations so this is also the report from this graph this graph application connected components so we we conduct some analysis on existing cerence mechanis we analyze three existing cerence mechanis one is non-cashable like we Mark the NDA data as non cachable essentially meaning that the CPU needs to access main memory for any of the when when it needs to operate on these uh cach blocks or memory addresses another one is course grain cerence uh we get the cerence permission to for the entire NDA region so NDA get the quence permission and then CPU should not do anything b c if CPU needs to access that region need to beult and another one is fering cerence which we Implement additional cerence protocols uh like M see these protocols that we already discard you can check the paper for details of that but essentially you can Implement that right you just you need to send all these read requests write request I mean everything needs to communicate through the boss right clear so with non- cachable uh you mark n data as non cachable so CPU generates a large number of off chip accesses and significant it this significantly H CPU thr performance we observed that non-cashable pH to provide any energy saving and perform actually 6% worse than cp1 so even though you have a processing near memory engine overall with this non-cashable approach your performance is 6% worse with course grain um basically unnecessarily flushes a large amount of the dirty data especially in pointer chasing application so you need to get permission right uh when you want to access uh that region so we use a core scaning logs to provide exclusive access like CPU is running and then um NDA wants to operate so we access to NDA data and then CPU stop because it it doesn't get the permission right and when Once NDA is uh so let me explain again so CPU is running is doing some computation NDA is also doing so they are overlap at some point but at some point CPU reached to the region that NDA is operating on so CP requests uh to access that region and N of course doesn't permit that because we are using the cor screen um blck management cor screen locks so then as a result CPU needs to be stopped and CPU can continue when NDA is done so that's the way of implementing these cor screen locks essentially so we observe that yes oh yes yes yes uh yeah it needs to be in the host score but also it needs a hardware support as right both yes yeah of course yes virtual address yeah so this course scen um fails to provide any performance benefit of NDA and performs actually um 0.4% worse than CP only again another way is to do fer quance using fer quer has two benefit simplify NDA program model and allows us to get permission for only the pieces of data that are actually accessed but the problem is that we have basically there are a lot of of cheap cence traffic through this boss and that cause a lot of issues so we observe that fun grain cerence eliminates so fun grain cerence actually provides some performance Improvement compared to the compared to other uh cerence techniques but it eliminates uh 7 almost 72% of the energy benefits of the of an ideal NDA mechanism an ideal NDA mechanism meaning that you don't really do any quance you ideally assume that you don't need to do any quance uh checks and here are the results that you can see um this is the ideal one which that shows the room for improvement and C everything is also normalized to CP only meaning that you don't have any access processing near memory engines so non- cachable you can see that it reduce performance course scin in many cases actually reduces in some cases slightly outperformed fine grain actually provides some performance Improvement but in terms of energy you can see that f in actually reduces um also reduces energy but like you can see a huge gap between IDL NDA and this fering method yes you can walk through all these animations so that's the motivation of this work like poor uh poor handling of currents eliminates much of nda's benefits and the majority of off chip currence traffic is unnecessarily our goal is to design a currence mechanism that U retains benefits of IDL NDA and enforce uh enforces currence with only the necessary data movement that brings us to cond I will just provide the the overall key idea of this technique so we leverage two key observations we know that uh having in sight enables us to eliminate much of unnecessary qu traffic and row rate a low rate of collision for CPU trades and NDA K from the observation that we had so we propos to use optimistic execution for NDA let me yeah so NDA executes the Kel assumes it has cerence permissions and then gains insight into memory accesses when execution is done performs only the necessary quance request that's the whole technique but the way that we implement it is like that so CPU Trad is executing at some point we start the optimistic execution for NDA so we upload the Curr to NDA and this is optimistic we don't check any queries technique and you can see that concurrent execution between CPU and NDA can happen there is no cence request while we are doing uh computation and then after that we need to resolve cerence we need to see if things are okay or not so basically we need to commit or reexecute essentially so how we do that we just check the signature so B essentially once NDA is running you get the signature meaning that what are the U basically memory accesses that you touch you know you can you can get have the record of them using you can actually summarize them in a with a hash hash function essentially but yeah that's signature is actually kind of uh you know make a signature of all the memory accesses that you touch or memory U exactly you touch uh and also the memory request that you also touch in CPU and then you need to compare ND extend signature to CPU and then you need to compare these signatures and based on that you can decide to commit or reexecute so how do we identify cence violation So Co squ are only necessary if both ND and CPU access a cach line and at least one of them updates it right clear so in in the paper you can actually read but we discussed three possible inter living of accesses to the same cach line so NDA read and CPU write so this is cerence violation essentially that NDA read on a block that CP wres to that block so this is quance violation but in another way NDA write and CPU read and NDA write and CP right from the point of view of NDA these are not Co issue because NDA does not read something which is wrong but then you might have consistency issue which kind of assume here I mean you can actually check the paper for more detail but these things these two things can be handled by consistency model essentially because kind of ordering that you have so here is an example uh CPU writes to this cach block and NDA reads to that this cat block so now we have cerence issue because NDA reads old value of Z and we need to ask CPU to flush Z to the D and then reexecute NDA essentially but here we have right and read um to Y and here also in NDA we have write to Y so this is not a currence issue because C4 and C5 are ordered before N5 so you can handle that with your consistency model essentially okay so now we have a bunch of slides about architectural support that I'm going to just skip you can check for detail of the mechanism but yeah the important thing here is that this L1 cach that you consider for your NDA core uh you cach things but you don't commit things so you commit them to the main memory once you you make sure that what you have executed is correct so after checking U like the signature if things are okay you just commit things lot of slides can see okay and here are also some results like that's um basically K you can see that it uh bries the Gap nicely and it reaches to the kind of almost the performance of Ideal NDA so NDA only uh sorry let me yeah cond consistency retains most of Ideal nda's benefits and coming within 10% of the ideal NDA performance which is good and you can also see the good results for energy consumption there are many other results in the paper that you can also check and also the conclusion that I'm not going to repeat it okay so yeah if you want to learn more about details of this technique because I just provide the the overall idea of this technique uh please check the paper it's also one one of your reading of course and that brings me to the end of this lecture any question okay so tomorrow uh there are actually some other topics in cash Quin that Professor mut actually will uh discuss them next week uh on Thursday I guess but tomorrow I'm going to uh talk about interconnects which you kind of know why interconnect is actually very important uh we discuss a lot about interconnect in today's lecture in cash quance and we're going to learn a lot about interconnect which is interesting so see you all tomorrow and uh have a nice evening e e

Transcript for:[Lecture 21] Understanding Memory Ordering and Cache Coherence

Transcript for:
[Lecture 21] Understanding Memory Ordering and Cache Coherence