[Lecture 14] Understanding Out-of-Order Execution in Computing

okay with that said I think we're going to start uh perhaps one of the most exciting topics we've kind of been building up to it uh in this uh class uh the out of order execution this is uh a way of executing instructions out of their original program sequence uh and it's a very influential way it's been used in essentially all computers today even though some people started saying that we're not going to use sophisticated hardware techniques to actually execute instructions we're going to punt all of the difficult things to the compiler they've all come to use out of order execution today after 10 years 20 years they figure out no you have to use hardware to actually accelerate instruction processing by doing things out of order so you'll see the importance of this into the future also uh we're we'll cover the basics there are a lot of optimizations that build on what we cover unfortunately we'll not be able to cover all of those optimizations clearly today's out of order execution engines like what's in my pocket for example uh or what's here are very very sophisticated engines uh and I will show you a picture of uh I guess kind of what's in my pocket somewhat reverse engineered as the very last slide of this lecture does that sound good okay cool uh okay so this is where we are uh basically uh we've covered a lot of pipelining precise exceptions yesterday we set up for today we talked about register renaming renaming the registers from the architectural register space to a micro architectural name space which we called reorder buffer yesterday today we're going to use that concept but today we're going to finish out of order execution hopefully and uh I will uh release a premiere of load store handling unfortunately we don't have time to really cover load store handling uh but if for on your for your own benefit you can watch that and see that's really the most sophisticated part of the machine today how to handle loads and stores it used a lot of content addressible memory you got introduced to content addressable memories yesterday unfortunately that part is very difficult to eliminate content addressable memories from okay these are your readings i think you've seen a lot of these but except for this optional one this optional one covers a state-of-the-art processor in 1999 a lot of the principles remain the same today uh today there are a lot more optimizations and next week we're going to cover branch prediction which is essentially how to keep the pipeline full in the presence of conditional branches and this paper covers I'm uh this paper covers a lot of interesting ideas in 1990s let's say of course we're going to look at much more uh advanced ideas uh this week okay this is where we uh kind what we built up to yesterday we built an in order pipeline with reorder buffer with multicycle execution units as you can see uh and we did that uh by using a reorder buffer we did the instruction reordering precise exception using a reorder buffer i'll go through this just to jog your memory but basically in a in in a pipeline like this in the decode stage the instruction accesses the register file in the reorder buffer allocates an entry in the reorder buffer and you check if the instruction can execute meaning if the source registers are ready if that's the case you dispatch the instruction into the functional unit okay if one of the source registers is not ready for whatever reason you cannot dispatch the instruction that's a non- ready instruction that's basically stops the pipeline uh and instructions can complete out of order which is good and when they complete they write their results to the reorder buffer first which is microarchitectural they don't update the architectural state immediately the architectural state get update gets updated in program order when the oldest instruction uh uh doesn't have any exceptions and it's completed uh it writes uh there there's some control logic that writes the result to the architectural register file or memory uh otherwise if there's a problem with the instruction the control logic flushes the pipeline and starts from the exception handler or if you're handling an interrupt at that point in time basically this is an in order dispatch machine dispatching means sending instructions to the functional unit uh an out of order completion machine you complete the instructions out of order uh and you retire instructions still in order in program order as specified by the sequential semantics of the program right this is also called an in order execution machine because in order dis I like in order dispatch better because you're really dispatching an instruction to the functional unit in program order but you're also starting the execution of instructions in program order basically you don't start the execution of instructions out of the program order out of sequence if you will today we're going to start the execution of instructions in functional units out of order okay okay so let's move on so this is what we've covered we basically said that register renaming eliminates uh the right after read and write after write dependencies which are not real uh these exist because there are not enough names in your register file in the architectural register file and we're really uh the only dependencies that we really need to deal with are flow dependencies so in order to deal with flow dependencies what you really need is to make sure that the dependent instruction gets connected to the producer so you can think of this as a uh the this instruction is producing the value of R3 and this instruction is consuming the value of R3 you really need to connect the producer to the consumer when the consumer comes into the machine the consumer needs to know where to get its value from okay here you can get rid of these dependences because they're not real dependencies you can rename uh this R1 to something else R1000 let's say or reorder buffer something some reorder buffer entry number as a result uh this value has nothing to do with this value which makes sense right we've discussed this multiple times and that's true for this output dependence also basically we have seen yesterday again that uh uh the uh the uh output and anti-dependences uh exist due to lack of register IDs or names in the ISA instruction set architecture uh and they're not real dependencies because the same register refers to values that have nothing to do with each other meaning there's no producer consumer relationship between these two instructions they're just they just happen to be writing to the same register that is allocated to the value that they're supposed to produce but from a communication perspective between instructions there's nothing between these two instructions right they just happen to collide in the name space if you will and basically we eliminated this problem this collision in the name space uh by renaming the architectural register ID to the reorder buffer entry yesterday and you remember the illustration we went through we're going to go through a very similar illustration today basically we said that register ID the architectural register is renamed to a reorder buffer entry ID you can also think of this as architectural register ID being renamed to a physical register ID and after renaming uh reorder buffer entry ID is used to refer to the register basically you don't you don't need to hold on to the architectural register ID until you update the architectural register state at the end of the machine and we will see that and the benefit of this is now you this eliminates anti and output dependencies so that when you're dispatching an instruction in a multicycle execute machine early on you don't need to stall so basically you can dispatch instructions by eliminating anti-output dependencies you don't need to stall at this point okay so as long as they are ready of course as long as these instructions are ready to go uh okay and this also gives the illusion that there are large number of registers uh because of ISA you have limited number of architectural registers because of the instruction encoding is limited but now you can expand the register space by renaming that those that architectural register space to a much larger physical register space or microarchitectural register space called the reorder buffer yesterday but you will see today that this reorder buffer uh uh is actually just one place where you can rename architectural registers you will need a larger name space and you will need to provide space for it it doesn't have to be the reorderable buffer and we will see that today we will introduce the concept of reservation stations and we will basically rename the registers to the reservation station ids it could be any name space actually in existing processors it's the physical register file and we will also build up to the physical register file in a little bit okay but yesterday we saw that you rename uh the architectural register ID to the reorder buffer entry ID and this is how we did it i'm not going to go through this example again but remember that if an instruction is writing to R0 for example it sets the valid of the R0 to zero the value becomes uh not useful for future instructions at that point but it you also the instruction when it's being decoded sets the tag to the reorder buffer entry ID of the instruction because every instruction gets allocated a reorder buffer entry so this tag at that point refers to uh the renamed register so register zero is renamed to this tag which happens to be the reorder buffer entry uh of the instruction that's producing that register and we said that yesterday if all of the instructions in the machine are writing to a particular register say register 3 there may be 16 definitions of R3 in this particular reorder buffer because there are 16 entries as you can see uh as a result the register may be renamed 16 times and the later instructions will get the correct value because instructions are being decoded in order they will see the tag that is updated by the latest instruction that's writing to R3 that's the beauty of the in order decode okay so every instruction that's going to source R3 will get the correct definition of R3 or correct location in the reorder buffer that's going to produce the R3 value that this instruction is supposed to take because of the in order renaming that happens uh in the beginning or in order decode so basically this is the uh new name you can think of this tag as a new name of the architectural register makes sense right and internally in the machine you don't need to know about R0 R1 R3 R4 anymore you just care about this tag and this tag can be used to communicate the readiness of a value so for example if an instruction that's producing R5 uh has a tag 16 reorder of entry well it doesn't exist over here we'll say tag 14 reorder buffer entity 14 it can say I'm producing reorder buffer entry 14 anybody who wants reorder buffer entry 14 compare your tag to it if it's what you need then get the value that I'm sending to you so this is how we're going to accomplish communication basically no more register architectural register names and you're going to see that soon okay and uh this is also one slide we covered yesterday we're going to look back to this today also but basically you can think of uh this as a renaming table and Intel calls this the register alias table essentially you have a bunch of architectural registers some of them are in the architectural register file and some of them are in the reorder buffer and you basically uh have pointers from this register alias table to uh the definitions of the registers that you're supposed to get okay so this is very similar actually this picture is very very similar to what I've shown except Intel has decoupled taken uh this tag and only the tag is over here basically so what I have done is basically tag and values together and valid bits uh and then a separate reorder buffer but Intel has pulled out the tags over here and we will see the reason for this also later on but it's essentially the same thing these two pictures are essentially the same thing the values happen to be separated uh over here in the retirement register file that Intel calls okay we'll get back to this picture and we'll get back to a more advanced picture also any questions it's kind of a relatively uh short review but also I wanted to motivate some things that we're going to discuss today also so now we're going to talk about autoord execution which is really dynamic instruction scheduling this also called dynamic instruction scheduling because uh the scheduling of the instructions done by the compiler right when you get the program compiler does some reordering of the instructions etc and that's a static schedule now we're going to violate that schedule internally to decide which instructions should dispatch to the functional unit in which order so let's take a look at this pipeline that we have looked at earlier basically remember dispatch is the act of sending an instruction to a functional unit i I ignored the reorder buffer allocation over here but I will have it over here later i'm going to eliminate reorder buffer to explain things more simply but we're going to reconstruct it again hopefully uh basically uh we uh we used renaming with a reorder buffer to eliminate stalls due to false dependencies basically if you do renaming when you get a false dependence you don't need to stall because you eliminated that dependence right now the problem still remains that if an instruction is not ready uh if if there's an instruction that's being decoded over here and if it doesn't have the values that it needs or any one of the values that it needs unfortunately it cannot go and get dispatched as a result no other later instruction can get dispatched so basically an instruction stalls if it's not ready so we're going to fix this problem today and I'm going to use the or abuse the example that I used earlier so if you look at this instruction this jam over here it's long latency for whatever reason i don't care if it has an exception maybe it's long latency because it's doing something important right unfortunately this instruction stores the pipeline there's another independent instruction it's going somewhere else let's say it's it's using some other functional unit it's going to use some other functional unit unfortunately it cannot enter the pipeline because there is this long latency instruction stoalling the pipeline i think it's very similar to this concept and then there are other instructions so one way we could have enabled this instruction to uh enter the pipeline is bypass it right maybe put put this instruction somewhere else right and we're going to do that essentially of course it's much harder to do in the tram lines but if you unless you have parallel tram lines or somewhere to buffer so if you have somewhere to buffer this tram like a bigger area then other trams can bypass it right and that's the idea that we're going to do today okay so you can see the consequences of stalling people start walking as opposed to taking the trap but finally the independent instruction is dispatched and it's executing over here but we don't want it to wait if we know that an instruction can actually go ahead uh earlier than an older instruction we should enable that instruction to go ahead so that we can get high performance right that's the idea okay so that's the problem we're going to solve today basically so basically can we do better can we do better than what we have built up to yesterday and the idea will be yes how can we do better uh before we uh discuss this I'm going to talk about some alternative ways so let's take a look at this example code uh both of these codes are actually very similar to each other if you uh if you're observant you will see that the last four instructions are exactly the same add instructions dependent on the multiply here add instructions depend on the load instruction here and all of these three blue instructions are independent of the at or the load but basically when you fetch the first ad it stalls the whole pipeline because it's waiting for this R3 to be available right assuming that multiply takes a long time and load takes a long time okay if it takes one cycle it's not going to stall the pipeline because but unfortunately the pipeline may be long so it depends on when you actually generate the value so basically AD cannot dispatch because it source register is unavailable uh because it's dependent on the multiply or the load but later instructions also cannot get dispatched basically blue instructions these poor blue instructions have nothing to do they're not waiting for the result of the ad or the multiply but unfortunately they cannot get dispatched because there's nowhere to put them let's say or nowhere to put the ad so we're going to fix that problem basically so I I should also mention that these two code portions are different in the sense that uh load latency is actually very different potentially right load latency is quite variable it's actually unknown until runtime and this causes more problems so multiply latency may be known it may be all multiplies take four cycles or eight cycles but load latency will depend on where you will service the load from will it come from the L1 cache L2 cache L3 cache main memory will it get a page fault as we will see but we will see all of those later on starting with lecture 20 or so so hold on to that so loads are going to cause us a lot more headache basically in the future that's why we're not going to deal with loads today at least we're going to deal with an example of multiplies so basically think start thinking about this basically this is easier for the compiler to potentially try to schedule and avoid this uh stall it's still not that easy but compiler's job is a lot harder over here basically because compiler has no idea potentially that this load is going to hit or miss in the cache okay so whenever you have memory access things become a whole lot worse let's say so that's why I think a big focus today in modern systems is to figure out how to deal with memory in a much better way okay so let's talk about preventing these dispatch stalls you have this instruction that's not ready at it's not ready because it's depend on a prior instruction that doesn't produce its value yet how do we prevent these stalls so basically the problem is really in order dispatch or in order scheduling or in order execution the solution that we're going to pursue is really out of order dispatch basically figure out some way of dispatching these instructions out of order of course you need to make sure that everything still works correctly that's why we're going to add a lot of hardware but we've seen the basic idea before actually if you remember the data flow data flow says that fire an instruction only when its inputs are ready right it's essentially the same thing right start executing an instruction when its inputs are ready in fact start fetching the instruction when its inputs are ready so dataf flow takes this concept to a next level so what we're going to do in this lecture is build an internal data flow engine inside a control flow machine meaning from the programmer's perspective it's a oneman machine but from the micro architecture perspective it's actually going to produce a data flow graph of a program and then and actually I'm going to go through an example where I will show you that by just looking at the internals of the state of the machine you can reverse engineer the data flow graph of the program that's executing i think that's very beautiful basically and we're going to combine these two concepts but we're not going to make life harder for the programmer telling the programmer you're going to uh program in dataf flow we're not going to do that programmer is still going to program using sequential instruction streams okay basically we will use similar principles as dataf flow but not exposed in the ISA that's the same thing I said in a different way but we've also seen any other ways to prevent dispatch stalls anybody remain some uh name some we've seen three ways of actually preventing dispatch stalls without calling them dispatch to of course anybody yes so data forwarding uh helps yes it didn't come the first way okay think about it maybe some other things so I I've given you this one right compile time instruction scheduling reordering so if the compiler is able to schedule instructions such that the value of an instruction is always ready when the instruction starts getting into the pipeline that could work right but this is a very tough problem people have tried to actually work on this problem for decades and decades it's still not enough but this this is still possible to do in some cases especially if the code code is regular maybe related to data forwarding you can predict the values this is a tough problem also okay if you can actually predict the value for example if you go back over here this picture uh so here for example the compiler could have reordered these blue instructions earlier than the ad and tolerated the latency of the multiply right potentially so that's the compile time reordering if the compiler can do that usually there's not enough parallelism that the compiler can expose value prediction says don't do that but predict the value of R3 that's happening here maybe you don't have the value of multiply okay predict that it's zero is that a good idea how do you predict Not so easy basically value prediction doesn't always work but we've seen one idea that really works very well anybody your last chance really works very well under specific circumstances fine grain multi-threading basically if you do fine grain multi-threading there's no problem there's no dispatch stall if you have enough threads all of the instructions are independent they're from different threads basically the pipeline never stalls right that's beautiful but this requires you to have enough threads and also if you really care about the performance of a single thread this is useless for you right if if your goal is to really improve the performance of a single thread you cannot basically resort to multi-threading unfortunately fine grain multi-threading at least so we've seen actually all these ways uh the first two can help the performance of a single thread but usually they're not very well applicable the third one is very useful when you have lots of threads but it's actually hurtful when you have a single thread it's actually because because you're fetching from a one thread every n cycles where n is the number uh of pipeline stages uh that you have okay so that's why autoward execution has been very successful basically none of these methods work uh so the idea of auto execution is very simple uh I kind of hinted at this move the non- ready instructions out of the way of the independent ones basically add some hardware buffers and if an instruction is not ready no problem move it somewhere else in fact what we're going to do is do it more methodically every instruction is going to go into a buffer first and then we're going to schedule the instructions out of those buffers make sense these will be called reservation stations and we're going to examine which instructions are ready and which instructions are not ready and then we're going to schedule the instructions or dispatch the instructions from these buffers okay that's the beauty so you can think of at the beginning of the pipeline we're going to add a big buffer called reservation stations and all of the instructions are going to get buffered there until they're ready to execute you can also think of this as rest areas for non- ready instructions or reservation station if you're driving uh on the highway uh when you need to stop you don't stop in the middle of the highway right if you do that you may have a lot of trouble with people behind you but that's what we were doing in the pipeline earlier somebody needs to stop this multiply instruction is taking a long time he needs to go to the restroom or whatever stops right in the middle of the uh middle of the highway and everybody else gets backed up well go to the rest area right wait over there so that's the idea basically but again we're going to be more methodical every instruction is going to go there uh because otherwise it may it's a bit a bit harder to deal with some things but we're not going to go into that basically in this resting area waiting area instructions are going to monitor the source values basically is my value ready is the register that I'm sourcing ready and if both of them are ready if if it's a binary in if if it has two registers uh then the instruction becomes ready when all source values of an instruction are available we fire in data flow terms or dispatch the instruction to the execution unit sounds beautiful now right so basically instructions are dispatched and data flow order not control flow order so that's the key realization we can dispatch instructions in data flow order so the big benefit over here is latency tolerance meaning that uh if an instruction is taking a long time it's not going to stall stall the pipeline independent instructions can execute and complete in the presence of a long latency operation which could be a multiply as I showed you earlier which could be a load as I showed you earlier which could be anything actually as long as it takes more than one cycle you can tolerate that latency if this mechanism is able to find independent instructions in the program and people have shown that actually there's a lot of independence in the program there's a lot of instruction level parallelism if If you look at a larger window of a program this instruction level parallels may not be easy for the compiler to discover that's why hardware is easier more easy to discover because compiler may not be able to safely determine which instructions can be executed before which other ones because if it moves the code from one place to another place it's really potentially violating some semantics in the program right here we're not violating semantics in the program we're just deciding which one should execute early or uh late we're not moving code anywhere the code is exactly what the compiler produced and the compiler can say I'm not going to change the code that the programmer dictated right okay everything is clear okay so basically let's take a look at this pictorially this is the in order dispatch machine with precise exceptions that we've built uh this is the code that I'm showing you red means dependent uh so this is dependent on the multiply this is dependent on the multiply but the blue ones are independent as you can see here in this pipeline you execute the multiply multiply takes four cycles in this particular case as you can see when you decode the second ad you stall the pipeline right and the next instructions everybody stalls after that with out of order dispatch and precise exceptions instead of stalling the pipeline and the first ad you put it into an area where it waits and then you keep filling the pipeline and you figure out that this instruction can execute right it executes generates its results right in the reorder buffer and then the next instruction can also execute but this instruction also goes to waiting so you can see that by just looking at this piece of code you can reduce 16 cycles of execution time to only 12 cycles and that's a lot actually that's 25% right improvement and in large pieces of code empirically people have shown that out of order execution actually gains 30 to 50% performance improvement compared to a very aggressive in order dispatch machine sometimes even higher my PhD thesis for example shows that it's much higher okay so okay what do we need to enable out of order execution Uh so there are four things that you need to enable one is you need to make sure that the correct data value semantics are obeyed meaning uh the consumer uh of a value needs to be linked to the producer so that the consumer gets the correct value or a correct definition of a register and this is done via register renaming as we will see then you need to buffer the instructions until they're ready to execute because you need to prevent the dispatch stalls and you need to provide some place hardware to keep these instructions while the instructions are being buffered they need to keep track of the readiness of their source values as we have discussed and we're going to talk about that a little bit more and then when all the source values of an instruction are actually ready we need to dispatch the instruction to the functional units so these four things are necessary and when the instruction actually produces its value it's actually going to broadcast the tag such that the consumers can get the value after it executes so the first one is accomplished linking of the consumer to the producer accomplished using register renaming that we have seen we associate a tag with each data value and we're going to see that more today buffering of instructions until they're ready is accomplished using reservation stations uh you insert instructions into reservation stations after renaming it has a historical name unfortunately it's called reservation stations some processors call these issue cues i kind of like reservation station better it's not like a resting area resting station if you will uh while the uh how do how do the instructions keep track of the readiness of the source values basically an instruction that produces a value broadcast a tag or the name of the register uh when the value is produced instructions that are waiting for that tag compare this their source tags to the broadcast tag if there's a match that means that the instruction is waiting for that value then the source value of the instruction becomes ready okay and when both source values of an instruction are ready or when all source values because some instructions wait for one value some instruction wait for two in rare cases three values but if all values are ready then the instruction wakes up uh and if multiple instructions are awake there needs to be some logic to select one instruction to send to the functional unit so there's there needs to be more arbitration because multiple instructions can wake up at the same time right makeup meaning ready become ready s there all of their sources are ready okay so that's the idea basically does this sound good now we're going to implement this in hardware and this is clearly going to cost us something it's not going to come for free we've had register renaming so that's not going to cost us but all everything else over here is going to cost us and the cost is a lot that's why people have optimized this over the course of multiple decades so I will also mention that uh this registry renaming this algorithm is one way of doing out of order execution it's the dominant way today it was invented by Robert Thomas who was uh a chief architect at IBM uh in IBM 36091 floatingoint units and remember if you remember I mentioned this as an example out of order machine uh and you can read actually this beautiful algorithm that we're going to go over uh in this paper the major difference today is IBM 36091 did not have precise exceptions and as I mentioned it was not successful because of that so this was implemented in 1960s uh in fact uh when these folks these engineers wanted to implement it they said it's too costly to support precise exceptions so can we do imprecise exceptions so they went all the way to the IBM's boss big boss uh and they reluctantly said okay and that was a that was a good decision for progress of science but that was not necessarily a good decision for business in the sense that IBM did not make a lot of money out of these machines let's say because people could not program it but I mean it was actually they were the leading company at that time uh in doing hardware execution so the major difference today is precise exceptions basically what we did yesterday is really critical for the success of this sort of machine because you need to reorder the instructions at the end and this is provided by later works that actually show that you can do out of order execution with precise exceptions together i'm not I'm going to essentially describe things without precise exceptions first because it's easier to discuss uh but then we're going to talk about precise exceptions a little bit more but yesterday's lecture is directly applicable to today's lecture actually so as a result uh auto word execution variants are used in most high performance processors today starting with Intel Pentium Pro and then AMD followed up with K5 but almost all high performance processors today use it so this is how things will look like basically we're going to decode instructions we're going to put into the reservation stations this is called a scheduling buffer if you will uh abstractly so this part of the pipeline is still in order and then we're going to dispatch instructions when they're ready in out of order so this part of the pipeline is completely data flow driven and we're going to reorder instructions at the end write the results into the reorder buffer but you can see that when instruction finishes it's going to broadcast the tag and the value so that dependent instructions can get the tag and the value they're waiting for this is that's how we will actually communicate the readiness of a value when an instruction finishes it's going to broadcast oh I'm producing I'm producing this physical register X and if you're waiting for a physical register X better capture the value that I'm sending okay so that's how we will communicate this is the data flow communication internally so there are two humps over here one is a reservation stations which is a scheduling window and the other is a reorder buffer uh this also called an instruction window or active window uh basically uh reservation stations hold instructions that have not been scheduled yet reorder buffer holds all instructions that have not been retired yet that have not updated the architectural state yet i like thinking of this as a camel like this if you if you want to remember a pipeline out of order execution pipeline day this is the scheduling buffer over here and that's the reorder buffer of course it doesn't work this way in a camel I believe but it's kind of a pipeline also anyway okay so that's the general organization uh this is from your uh uh paper that I've assigned uh a lot of things will become more clear today but today's basically the purpose of me showing this is existing processes do this basically this is the renaming and dispatching for example and these are the reservation stations over here they called instruction buffers and these are the functional units and then they have separate register files we're going to get back and then there's a reorder and commits over here as you can see today we're not going to talk about the memory interface that's going to complicate things and that's the optional lecture and Thomas's machine was kind of similar except it's looking at just the floating point units over here we're going to at something like the like Thomas's machine actually we're going to look at multiply and add uh over here okay and you can see that this actually this actually had impact in the real world even though it was commercially generally not very successful ibm 36091 was used in supercomputers at the time in 1960s for space exploration for example and you can see that at the time computers used to be huge right now this thing is more powerful than the entire room it's amazing right okay so recall once more we're going to rename registers basically uh I've already gone through this slide multiple times i'm not going to go through this but basically register ID is going to be renamed to reorder buffer ID or reservation station entry ID for the rest of this lecture I'm going to use reservation station entry ID not the reorder buffer entry ID after renaming we're going to use the reservation station ID to refer to the register so basically for this we need a register rename table called the register alias table and you've seen this already bas I've kind of reordered the fields over here but basically you have register zero to register 7 this says whether it's valid uh if the valid bit is set the value in the table is correct you can it can be trusted basically otherwise tag specifies where to find the correct value that's the name that's a new name tag is a unique name for the value to be produced that's how you do renaming and recall from precise exceptions lecture we essentially have the same thing register file or register alias table you can think of it that way We're going to separate things more later on but let's uh start with so this lecture so let's let's now go through an example in this lecture we're going to look at the register file or register alias table rat tag is the pointer to the reservation station entry that will produce the value so I'm going to introduce reservation stations over here soon and for simplicity we will ignore the reorder buffer we'll we'll add it later but adding it is relatively easy actually we're just going to ignore it for now so's algorithm this is actually a full uh definition maybe I'll go through this relatively quickly But basically first you need to check whether a reservation station is available before renaming and rename an instruction only if a reservation station is available only if a buffer is available so that you can put the instruction basically otherwise you stall while in reservation station each instruction watches a data bus called common data bus or the tag of its sources when tag is seen it grabs the value for the source and keeps it in the reservation station when both operants are available instructions ready to be dispatched i think I've kind of said this before uh this you dispatch the instruction to the function unit when instruction is ready and after the instruction finishes in the function unit there needs to be some arbitration for this common data bus we will see that or you need to have multiple common data buses as I will show you the instruction puts the tagged value onto the common data bus this called a tag broadcast register file is connected to the common data bus as well as reservation stations and register contains a tag indicating the latest writer to the register as we have seen if the tag in the register file matches the broadcast tag uh we write the broadcast value into the register and set the val and then eventually a rename tag needs to be removed so that some other instruction can get allocated to the reservation station okay so this is one quick overview of the algorithm that we're going to look at so we're going to do the uh exercise basically we're going to execute this sort of algorithm on this code as you can see as you can see this code has a multiply and then it is an ad over here this ad is dependent on the multiply because it's sourcing R3 and then it is a bunch of instructions that's independent of the ad but there are dependencies uh internally between the instructions for example this last ad is dependent on this multiply we're going to assume uh the ad takes four cycles execution multiply takes I think eight cycles exe six cycles execution sorry uh if we're going to assume one adder and one multiplier so if you actually need to do the calculation in a non-pipeline machine this takes 50 cycles non-pipeline machine says you have four add instructions and two multiply instructions each of them takes let's say seven cycles versus 11 cycles and it's 50 cycles that's an approximation basically uh because you can actually make the nump pipeline machine uh nicer without the clock cycle uh without the clock boundaries but with this uh assumption it takes 50 cycles we're going to look at an inorder dispatch pipeline machine with imprecise exceptions no forwarding and forwarding and then we're going to simulate an out of order dispatch pipeline machine with imprecise exceptions no reorder buffer as we said we're going to use forwarding okay so let's take a look this is an in order dispatch pipeline machine without forwarding so you can see that this the first multiply the second ad is dependent so without forwarding it needs to wait for the previous value to be written into the register file and then it gets decoded as you can see so if you actually go through this yourself which you could it takes about 31 cycles so basically in order to dispatch pipeline machine without forwarding is much nicer let's say than the nine pipeline machine right 50 cycles worth 31 cycles this is cycle count clock cycle is a totally different issue i'm assuming that they're all equal for now now if you actually do add forwarding to it uh this instruction can get can start executing after the value is forwarded from the prior pipeline stage depending on some forwarding rules if you will and you can actually reduce it to 25 cycles i'm not going to go through these but you can actually do these uh with calculations yourself and we actually have a lot of exercises similar to this so can we do better basically can we actually reduce this 25 cycles this is this is what we have built so far can we reduce this 25 cycles further and the answer is yes basically if you use out of order dispatch out of order execution it's going to look like this so basically this instruction is going to wait over here but this next instruction can be dispatched and executed while this prior instruction is waiting so if you auto order dispatch pipeline machine with forwarding we will get down to 20 cycles this picture is actually a little bit slightly misleading because you don't really need the right backstage actually so it's actually 19 cycles if you have this imprecise if you if if you're going to use what I'm going to describe you don't really need that so it's about 19 to 20 cycles which is not bad so the key difference is going to be here independent instructions wait as you can see but here the independent instructions can execute okay so let's see how this works now in prior times I would use the blackboard uh but now we have a nice animation over here so we're not going to use the blackboard we're going to use the animation so this is going to be our first out of order machine simulation now let's first set up the stage so this is our register file initially we're going to call it the register alias table you can see that all values are valid and uh the values are the same as the register numbers if you will for now just for fun this is a program we will simulate which is exactly the same as what I showed you over here it's written in a different slightly different font and these are the reservation stations so these are things that I have not introduced so far uh but let me introduce them so this is uh we're going to assume one adder and one multiplier we're going to assume they're pipelined so you can start one instruction every cycle in the adder and multiplier and the principle of pipelining applies to functional units also if you start one instruction in one cycle the if the if the adder is for example pipelined four uh over four clock cycles you can start the next instruction next clock cycle and they could be executing in a pipeline add don't tell don't we didn't talk about how to pipeline natter but the principles are very very similar to pipelining an instruction processing engine so okay so this is the reservation station for the ad unit it's we're going to assume two sources for at clearly uh these are different reservation station entry numbers A B C D and each reservation station entry has for each source a valid bit is the source ready a tag uh if the source is not ready who will be producing it which tag will be broadcast and which tag should I be looking for and the value if the source is actually ready right for and you have two sources you so you have valid valid tag and value fields for each source So if you look at this is very similar to the register file right so basically register file stores some values here we actually copy those values over here and wait for values that are not available yet and we're going to do exactly that and that's true for the multiply unit also you have valid uh tag and value for source one valid tag and value for source two and uh when an instruction starts executing in the adding adder units we take the value two values that is that that better be ready for execution and then we actually take the destination tag that we're going to produce so for example if this instruction over here starts executing we're going to broadcast tag A saying that oh this is the value the instruction at reservation station entry A is going to produce and anybody who's waiting for A is going to capture that value does that make sense so we're going to rename these registers to reservation station entries that are holding the instructions that are going to produce the values of those registers that's the idea very similar to the reorder buffer except this is the reservation station that's true for over here also you can see basically add and multiply execution units have separate tag and value buses so we're going to be able to broadcast one tag and one value per cycle from this unit and one tag and one value from this unit as well and you will see that it's going to be a mess because these tags when they're broadcast need to let's say follow my uh what is this cursor need to be going to these tags you need to be comparing this and then you need to be comparing these tags to the broadcast tag and then these tags to the broadcast tag and then these tags to the broadcast tag and these tags to the broadcast tag so if there's any instruction waiting for the tag you're broadcasting and if the value if if it's not valid then it's it better capture the value that's also being broadcast so values are also going to be broadcast this way so there will be long buses that are going to go over the entire machine and you will see this soon and that's the cost it's expensive actually and if you're doing two broadcasts per cycle you don't have only one set of buses like this so you have two sets of buses in existing machines there are more than eight 10 functional units so we do these broadcasts every cycle eight or 10 instructions let's say potentially so existing machines are much more complicated than what what we're going to simulate but these exist basically you can see how existing machines are so powerful okay is this all good now we I have not started the simulation yet i've just set up the stage okay initially reservation stations are invalid so there I don't show the invalid bits for the reservation stations but there needs to be a separate allocation mechanism for reservation stations which reservation station entry is free or not i'm not showing that over here we're just because we're not going to actually deal with that in this particular case we have a small program that fits in the reservation stations uh right now okay so this is uh this uh setup of the stage any questions now we're going to simulate this is cycle zero clearly the no instruction has been decoded let's fetch the first instruction in cycle one we fetch this multiply instruction nothing interesting really happens because we just fetch the instruction right the interesting thing happens when we decode that instruction right now we're going to decode this multiply now what does decode mean we've already gone through the algorithm let's go through the steps multiply gets decoded and allocated in the reservation station X why X because this is the multiply reservation station and this is a multiply so that's the function of a decoder and X happens to be free so let's allocate it over there now what does that mean so okay we basically need to check if the reservation station is available in this case we assumed it's available yes that's good so we allocate multiply there the second step is we access the register alias table to get the sources we access it with the first source R1 r1 is valid that's good so we can trust the value tag doesn't matter because it's valid right looks nice right we basically copy the valid bits saying that the source is ready and the value is one tag is we don't care again it could be any garbage value doesn't matter we're going we're not going to check it basically we do the same thing for the second source uh so okay before I said that step three is put the source register into the reservation station X right we I just showed you showed this to you for the first source but we do exactly the same thing for the second source it's valid the value is two so we we basically copy uh the register file entry or register alias table entry to the source two over here sounds good now we also need to rename the destination register which is R3 to X just like we did yesterday basically R3 is not valid anymore because multiply is going to produce the value of R3 and anyone who wants the source who wants to get the value of R3 better capture the tag X when it's broadcast that's what this means basically so anyone who's going to read R3 now is going to get X as a tag so basically this is R3 is now renamed to X its new value will be produced by the reservation station that's identified by tag X and anybody who's going to consume that value is going to refer to that as X and we'll see that very soon okay this sounds good now the other realization here is that multiply in the reservation station X is ready to execute in the next cycle because both of its source are ready right so there needs to be some other logic that detects this this is called the scheduling or readiness detection or instruction wakeup logic again I'm not showing you but there needs to be some other logic to detect whether who which instructions both both sources are ready and which one should be selected out of them okay okay so that was cycle two which was a little bit more painful than cycle one as you can see cycle 8 will be the most painful for us as we'll see soon so cycle three uh we figured out the multiply can execute that logic that I mentioned figured out that this multiply can execute let's start executing it multiply and reservation station X starts executing basically you need to check the readiness both sources are ready you need to wake up the instruction there needs to be some logic that does that and then you supply the values and the tag it's going to produce which is going to be X right so this multiply is going to take six cycles because it's ready it'll be dispatch dispatched to the uh multiply unit but after six cycles it's going to produce the value as well as a broadcast attack everywhere as we will see and that's going to be cycle eight I think okay so concurrently in the decode stage we're going to decode the second ad instruction let's take a look at how that happens ad gets decoded and allocated into reservation station A because it happens to be free as you can see decoding is very similar so add sources R3 as its first source you take valid bit is zero it's not ready yet tag is X value I don't care basically this means that I'm waiting for the instruction that's producing it which is in reservation station X right the source is not ready but reservation station X is going to produce it and when it pursues it it's going to broadcast the tag and the value and I better capture that okay the second source is let's say let's pro that went very fast the second source is R4 it's very similar to earlier cases basically it's valid the value is four but this ad cannot execute because one of its sources is not ready so it's going to wait basically we do the same steps one for one to four in the prior slide for AD and we rename also R5 to A so now R5 is going to be produced by the instruction that's in reservation station A which happens to be this ad okay now anyone who's going to source R5 later on which you will see this one of these instruction I think this one it's going to refer to value A or tag A okay so add in reservation station A cannot execute in the next cycle because one source is not valid sounds good it's fine now let's take a look at the interesting case over here in cycle four this multiply here is happily executing add in and reservation station A has to wait because one source is not valid now we can decode the next ad let's take a look at what happens uh decode again we take the first source register register two it's valid and the value is two we take the second source register which is register six it's valid and uh the s the value is six now this can execute right it doesn't matter whether the previous instruction couldn't execute it got out of the way that's the good one good part now we also need to rename while we're decoding this add instruction rname the destination register R7 and we rename it to B because we allocated reservation station B for it okay so this register reservation register 7 is invalid and it's renamed to B now add in this reservation station B is ready to execute in the next cycle and then wake up logic determines this and we will see that it's going to start execute it will start executing out of order or it'll be dispatched out of order to this functional unit earlier than a prior instruction okay I'm ignoring the fetches because in the fetch state you don't do anything useful that's useful for this for the purpose of this picture okay now let's go to cycle five in cycle five multiply is still happily executing over here this first ad is still waiting now the good thing is the second ad we're able to execute it and that's the big benefit of auto execution right now this was not possible in our earlier pipeline designs here the wake up logic determines that this ad in the reservation station is ready because both of its source are ready one source is two the other source is six after four cycles it's going to broadcast its tag B and the value that it's going to compute after four cycles okay and whoever is waiting for B this one for example is going to capture that value okay we're going to see that and then we decode the next instruction i'm going to go through this relatively quickly because the process is very similar you can see that the next instruction sources R8 and R9 okay and yeah we went through it really very quickly as you can see we basically take the uh register file register al table entry R8 and R9 they both happen to be valid and the values are 8 and 9 so this instruction in reservation station C is ready to execute in the next cycle also that's also independent basically of prior instructions so that's good in cycle six this multiply is happily executing still because multiplication takes long this ad is waiting this second ad is executing because it start execution out of order and now this third ad can also start executing that's the good part again right this is enabled by out of order execution again this third ad starts executing and again we don't show that over here because you can guess uh so the reason it can start executing is multiple one is out of order execution machinery that we added and the second is adder is pipelines because the previous ad is in the sec it moves to the second stage of the pipeline adder uh and the second ad moves to the first stage in the pipeline adder okay so there's pipelining happening that also alternative is providing a second adder of course right like we have done in the earlier parts of pipelining if you remember with the uh dryer right okay so we decode this multiply this multiply is interesting because neither of its sources are ready so if you index the register file R seven it's not valid but it's going to be produced by B tag B and R10 is not valid but it's going to produce by C so neither of its sources are valid so you cannot actually start executing it and we rename our 111 uh to Y because that's the reservation station entry that's going to produce that value so zero Y so you can see that the register file has a lot of invalid values right now because we actually put a lot of instructions that are going to write to different registers into the reservation station and they have not produced those values yet okay cycle six is done now cycle seven let's take a look at it quickly multiply is executing add is waiting the next ad is executing the next ad is also executing the next multiply cannot start execution because it's not ready yet as you can see 0 0 and then the next ad is decoded if you look at this next ad it sources R5 uh R5 is invalid and the tag is D wait a second no no no sorry okay something interesting happens here i should I should have animated nicely when it sources R5 R5 is invalid and it's going to be produced by A right so we actually put it over here and then R11 is invalid it's going to be produced by Y so this is going to be zero Y and we need to rename R5 also so it's reading R5 old value of R5 and it's going to produce a new value of R5 so it's going to read the value from produced by A for R5 it's going to produce a new value for R5 coming from reservation station entry D okay so interesting thing happened to R5 basically okay so nobody will get the wrong value over here because we're always updating or renaming the registers in program order so all at this point all six instructions are decoded and renamed as you can see note what happened to R5 it was renamed twice first it was renamed to A which is here which happens to be this instruction that's producing one value one definition of R5 and then it was renamed at the end by this instruction to uh D okay so all of the instructions that are in between these two renaming of R5 are going to get the first definition of R5 because they're referring to A right because they were renamed using this definition of R5 all of the instructions that are going to source R5 after this ad are going to you get the value from D okay okay but we don't have such instructions because we're done with the instructions right now let's take a look at cycle 8 cycle 8 is the fun part because some instructions are completing right now so on cycle 8 multiply in uh RSX is done or we're going to assume it's going to broadcast this tag in this particular cycle that's the assumption over here it could be a different assumption in some other machine but basically this multiply is done it's value the result is two because it's multiplying one time 1* two and it's going to broadcast that value as well as a tag now let's see what happens with the broadcast of the tag X so X is going to be broadcast to all of the fields that have tag in them now we're animating you can see some of them are actually capturing X right so whoever is waiting for X is highlighted over here with yellow this register 3 is waiting for X because it's going to be valid after that it's it's immal and the tag is X this instruction is waiting for X because it was renamed that way and it's going to capture the value now the value also broadcast concurrently this happens concurrently actually but I'm showing it serially uh because we have bandwidth problems let's say or animation problems so we broadcast multiplies result and anybody who was waiting for X captures the value too okay so basically that's how we communicated the value by broadcasting the tag and value concurrently any uh location in the machine that had a valid that that was invalid and waiting for that tag captures the value okay so register 3 became two and valid and this source one in reservation station entry A became two value became two and valid and now this uh this entry this ad in reservation station A is ready to execute the next cycle right there's some logic that determines that that's the wake up logic so that's beautiful this is how we communicate the value this is also how we communicate the dependence so we're communicating two things at the same time dependence dependent structure you can execute and the value here's your value okay but that's not all unfortunately it also happens that add in reservations uh station B is also done here because of the length of the instructions uh basically it also needs to broadcast and it needs to use separate tag and value buses to do the broadcast so this is the ad that we're talking about it's adding two and six as you can see after four cycle is done the value is eight and it's going to broadcast this tag B same B gets broadcast to all of the locations that have the name tag there and if it's invalid it becomes valid soon after the value gets in so it turns out that instruction is also waiting for B over here but basically you have to broadcast it everywhere that may be potentially waiting for the tag now you can see the logic complexity and the wiring complexity right these wires need to run around everywhere and the logic that comparison that I kind of show is really a comparator and we've seen that comparator right equality checker essentially in lecture combination logic okay so we also broadcasted value of course and everybody who was waiting for tag B captures the value eight now that looks good our register 7 becomes eight this instruction one of its sources becomes valid but unfortunately the second source is still invalid So it cannot execute so it needs to wait for the second source okay okay so u nothing else that's interesting happens in a cycle 8 let's say uh the other instructions executing multiply uh the other instructions are still waiting okay cycle 9 uh this right back doesn't necessarily need it basically because we actually already wrote back the result of multiply uh which was actually updating uh what did we update sorry uh we updated register three right yes register three so this right back is not strictly necessary but in a in a reorder buffer maze machine you need to actually put it in the reorder buffer somehow but that could also happen in the prior cycle but I'll keep it for completeness for now so if you may have questions about this right back keep it for completeness over there so this new instruction starts exe the second instruction starts executing right now because both of its sources became valid right so it gets sent into the uh a function unit and this next instruction also does write back uh this instruction finishes and it broadcasts and updates uh it broadcasts uh yeah I I think I I don't I don't animate these further because it actually takes a long time to animate everything but this instruction if you remember it was actually this instruction reservation station C it broadcast tax C so everywhere with tax he captures value 8 + 9 which is 17 so this will become 17 and this will also become 17 in this cycle and it does that's good and then uh this is this still waiting still waiting so multiplying rsy is not ready to execute in the next cycle right because both of its sources are ready over here and the wake up logic determines that okay now you can keep the animation going further and further uh you can see that the instructions are still executing and when an instruction finishes it broadcasts its tag and value and updates things so this was this ad over here right uh well I guess this was uh uh the uh the ad that's doing R3 plus R4 uh R5 which is this one it broadcast the tag A and everybody who's waiting for A gets the value and it's still not ready to execute as you can see that instruction over here okay so I'll go through this relatively quickly at this point but basically when an instruction finishes it broadcasts this tag and value and anybody who's waiting for the tag and value gets that updated and checks for readiness in the next cycle and for example here in the cycle 15 this multiply finally becomes uh finally broadcast this tag y uh and the value 136 and everybody who's waiting for y and 136 captures it and now this instruction becomes ready to execute over here which is which happens to be this instruction and if you actually follow its execution eventually it broadcasts its tag in cycle 19 and tag is D and the value is 142 and now at the end of the execution of all of these instructions all of these are valid as you can see because we're done with the program this toy program and all of these should be deallocated but we didn't show the deallocation of the reservation stations does this make sense so that's the beautiful part and if you if you don't ignore the rights over here it's 20 cycles as promised okay so now you know however execution works right it's not that complicated once you understand the concepts of renaming and value and dependence communication let me handle some questions here and then we're going to take a break actually we should probably take a break but let me cover these questions very quickly so what is needed in hardware to perform tag broadcast and value capture basically you need to make a value valid and wake up an instruction for this you need wires comparisons and logic basically a lot of hardware and this is expensive hardware does the tag have to be the ID of the reservation station entry basically I said that this is not necessary as long as you can link the producer to the consumer with a uh correct and common tag that's fine it could be the reorb actually over here you just need to connect the producers to the consumer's actor in this case we happen to use the reservation station IP but it could be any unique name that enables linking of the producer to the consumer it doesn't need to be even anything uh specific to the hardware uh specific to any buffer it could be some name space basically that you happen to allocate and deallocate what can potentially become the critical path well this is actually uh what people have optimized over years and years tag broadcast you broadcast the tag somebody needs to capture the value and they need to wake up that loop is actually very long because the wires go through essentially entire machine internally and there are comparators all over the place and then you need to actually also determine the wake up so this is also called a wake up uh tag broadcast wake up and select logic and this is one of the most critical parts of the machine internally how can you reduce the potential critical paths more pipelining and prediction and existing machines do all of this basically existing machines are very sophisticated they they sometimes even predict which instruction is going to broadcast this tag and the reason they do it is because they uh for example loads we didn't discuss loads over here but how do you decide when a load is going to broadcast this tag is it going to hit the cache is is going to hit in the first double cache is it going to miss in the first double cache so that actually complicates the scheduling logic significantly okay so uh basically this is our example uh you can draw the data flow graph of this uh above code clearly right but when we come back from the break I'm going to give you just this picture we actually constructed this picture this is in cycle 7 at the end of cycle 7 we have exactly this picture earlier when we actually done the simulation by just looking at this you can construct the data flow graph and you can start thinking about how doing your 10 minutes of break you can also provide the executing code in sequential order So we're going to start doing that uh in the ne after the break and in in the end we'll have something like this basically okay so let's take a break until 1520 uh well no 15:30 1520 would be less than 1 minute 1530 and then we'll be back uh construct related to reservation station deallocation so let me cover that first so in this example that we went through the machine simulation I did not show you how the reservation stations are deallocated but they have to be deallocated at some point and it makes sense to deallocate them uh when an instruction has broadcast its tag and value it doesn't make sense to use them afterwards so you can deallocate them at that point unless uh that was done speculatively in modern machines actually it's done speculatively so you they really need to keep the reservation stations for longer than that but we didn't deal with that but you need to deallocate the reservation station so that incoming instructions can be allocated into the reservation stations because if an incoming instruction uh that's being decoded cannot be allocated a reservation station and you need to stall the pipeline right so these the size of the reservation stations and how you manage them is important for performance in the end you can think of the size of the reservation station as how many instruction can you buffer before you stall right if you if the size of the reservation stage is actually 2,000 for example you can buffer 2,000 instructions before you can stall so the longer this is the larger the reservation stations are the larger your latency tolerance is so it's actually important and people are trying to increase the size of the reservation stations uh today in modern machines they're actually much larger but the real the reservation stations that are most critical are load and store reservation stations which we don't deal with here but you can watch the uh optional premier lecture about that which will give you an idea of how complex the load and store reservation stations are you need to deal with addresses for example in load and store which we don't deal with over here we deal with registers regaling with registers is a lot easier than dealing with large memory addresses for multiple reasons if you watch the lecture you will see because registers are statically determined but the address of an instruction address of a load instruction is determined dynamically remember the base plus register calculation you determine the address dynamically so you don't know which instructions are dependent on that instruction right here we know exactly what the registers it's beautiful but with memory it becomes very very messy okay I'll leave it at that for now so we handle these questions so as I promised so if I give you this code you should be able to draw the data flow graph that's a forward problem uh I'm not going to do that but I'm going to do it in a reverse way uh basically this was the state of the register alias table and reservation station in cycle 7 in our simulated program and if somebody gave this to you and asked what code is in the machine right now you could actually draw the data flow graph and that's what we're going to try to do right now very quickly I have the answer in the next slide but uh I'll I'll do it over here quickly so that you can see and these are actually good problems also for exam if you will I know you're not studying for the exam just for the exam but these reverse engineering problems are actually a good way of testing understanding of the concepts uh because you really need to understand how the machine works I think to figure this out okay so let's try to switch so this is basically cycle 7 uh just to remind you this This is our cycle 7 and now we're going to draw the data flow graph of this thing uh let me see i have lots of colors i love colors i'll start with pink or purple pink okay so how do we do this it's actually pretty easy but okay you see that okay i can fit two things we're going to do this so what is a data flow graph data flow graph with a bunch of nodes right so first of all if you by looking at this picture you know that there are 1 two 3 4 5 six instructions in the machine right we're assuming that they're just somebody needs to tell you that they're just decoded but not uh uh they're just recoded and renamed but nobody start execution or something like that basically so all of the instructions are renamed so we have six data flow nodes somehow let me draw those i don't know maybe color code them okay I don't want to do the color coding right now because they're going to be connected in different ways and you can start anywhere actually uh because in the end this is data flow uh actually let me start with I don't know let's start with this ad this poor ad so this is the pink instruction that refers to this so what do I know about this instruction source one is tag X i don't know the register number yet but I have a pretty good idea that this could be the register number and it probably is and source two is uh valid and its value is four i know where it's coming from but I may have a good idea that it may be coming from here right register four so does that make sense maybe I should just uh say that there's a good idea maybe this is register four and maybe this is register three i'm probably right so somebody's going to produce X right so this ad cannot be the first instruction that's executed assuming the machine is correct so let me go to X step maybe I start in the wrong place uh I don't like this color because too similar okay let me go to X so uh I should also say that I should not I should this is an ad right what do I know about this it's an ad and it's going to produce a and what's a h it's going to produce something who's uh it's going to produce a tag that's not in the register file so somebody overwrote the register that this is writing right so I know a lot of information as you can see just by looking at this reservation station entry and now that tells me also to look at this reservation station entry because it's going to be the input of this uh instruction that's in this reservation station is going to be produced by this reservation station and and let's take a look at that it's going to produce X and I know that this X is going to get connected over here that's good and uh one of the inputs is one and the other input is two and they're both ready that's good and I kind of have a suspicion by looking at the register file that they maybe actually R1 and R2 right maybe this is R1 maybe this is R2 right because there's no other one or two value over here right it's constructed that way okay so now I know these data flow nodes this are multiply okay now who who am I going to go next lou what should we do should we do this one this looks interesting i think let's do another ad because both of the sources are ready actually so let's put this ad somewhere over here again I don't know much i'm not thinking that much one source is two it's ready the other source is six it's ready how do I know that these valid bits right and again I have a suspicion that oh this could be R2 this could be R six what do you think and it's going to produce B and B probably it's going to it's renamed R seven into B right i'm going to do that let's let's keep the question marks you can validate the question marks later right okay so this looks like this is independent of all of these over here okay let's go to another instruction another color thanks for getting this because this is a lot of nice colors okay another color what should I uh start with i want to start I do this one because I know that both in both inputs are valid right so this is another one uh this is another ad as you can see and then this ad can you see that okay barely uh this ad has eight over here as one input it's ready nine over here is another input it's ready and again I have a suspicion that this could be R8 and R9 r9 okay it's going to produce the tags C and the C is probably R10 okay that sounds good also okay now let's go to something else let's go to this multiplied can you still see everything i don't like this color again uh even with all of these colors as you can see we're running into color collision or maybe it's because my the granularity of my eye in determining colors is not very good okay we can push this up okay now we're going to uh try to figure out what that instruction is it's a multiply for sure one of them is B the other is C oh look at that this B can come over here and this C can come over here it's kind of a messy data flow graph but it's a data flow graph right and both of them are not ready and what are they r seven and R10 probably right and we already kind of guessed that r seven R10 now it's getting more solidified if you will now let's take a look at the last one uh last one last one orange last instruction is this D uh and what is that is an ad where should I situate it uh A and Y who's producing Y oh I forgot something in the previous one sorry that's why so this is also produc this is producing Y uh which is if you look at the tags over here R R1 right okay now the last one I kind of have a guess it's going to get A and Y a is being produced by this one y is being produced by this one it's a add instruction so it's going to get connected like this that's the beauty of colors you don't overlap the lines right so this is a y and the output is d which is r5 okay so now we actually have the data flow gra this is the full data flow graph of the program that is in the machine it makes sense right and I just drew this data flow graph by looking at stuff i didn't even think about it right i started with some with the second instruction actually i didn't start with the first instruction that's in the program now you can start reordering these instructions and find uh the uh instruction sequence and you can verify the uh verify which registers but eventually you'll come up with this thing uh but basically the first instruction is this multiply this multiply R1 R2 and the result goes to X X is R3 we forgot that kind of but you can also verify it from here R3 and the next instruction may be this one although we're not sure because uh this instruction is completely independent right so this instruction can be anywhere so we don't know the complete instruction ordering over here but we can guess some orderings So clearly this is before this one and this is before this one and this is before this one and this is before this one and this is before this one right so based on the dependences you know the ordering but you don't know the complete ordering of the instructions again I don't want to go through this uh and come up with uh all of the instructions since I want to cover other stuff but you can basically write all of the instructions that we have over here let's do that this multiply the accounted for and then this what is this and add R2 R six uh oh no this R2 is not correct so you got to be careful with that one right uh yeah actually it's it's correct I think yeah add R2 R six R seven yeah sorry it's correct but maybe the ordering is not correct as we discussed and this AD is uh let's see uh I should have color coded these but sorry and this AD is over here add R3 R4 oh and the result is a what is a so that's the part we don't know how do you figure that out i'll let you figure that out so somebody got renamed and tag A is kind of lost i need to figure that out somehow what is this what register does this correspond to i'll let you think about that but basically there's something over here okay uh and then uh you need to actually go through these instructions this is add R8 R seven R8 R9 and the result is R10 over here we accounted for this one did we account for this one yeah we account for this one except the destination and then this one is uh this a multiply another multiply i don't know if you're seeing this i think you're seeing it but there's a multiply what are you multiplying r seven uh yes and then R10 over here yes and then the result is R1 so it's going to be very similar to what we find out and then the final one's over here this is uh wait a second uh this is an ad one of the inputs is a I'll keep it as a a and then the other input is y which is r1 and then the result is r5 okay so then the question is what could a be can you figure out it's r5 anybody remember that R5 got renamed twice because we actually simulated this before and actually you should be able to figure it out that it's R5 why uh because all of the other instructions uh have produced their values and they're being used uh but this is overwriting some value it must be reading actually the same value that it's overwriting so you need to do this by process of elimination does that make sense okay okay okay I'll let you figure that out but this is your data flow graph and unfortunately not the perfect ordering because our real ordering was this one right so I couldn't really decide the ordering of this instruction because this is the data flow graph right yeah now this also proves that this machine is a data flow machine internally internally what we have is a data flow machine executing a data flow graph okay hopefully this was clear you can you can actually do this on your own and I would recommend doing that on your own but now we'll have to move any questions okay so now we've constructed the data flow graph this is a much nicer version of that data flow graph let's say maybe not as colorful but what we've done is we reverse engineer the data flow graph of the executing code sometimes not perfectly because we cannot reorder the instructions exactly uh as I showed you I I got the reordering wrong between these two instructions because they're not depend there's no dependence relationship between them i cannot figure out that relationship unless somebody tells me some other information right okay so basically there are a bunch of other questions over here uh we discussed one of these when is a reservation station to deallocate uh should the reservation stations be dedicated to each functional unit or global across functional units that's another question basically and this is actually interesting because uh this goes into should a buffer be centralized or should it be distributed and there are trade-offs related to this if it's centralized it's a huge buffer everybody goes through it uh the the dynamic utilization of the buffer may be good but if it's distributed it's now each of them are smaller buffers so actually the latencies may be easier to handle energy may be lower but now and also you can specialize the buffer to the functional units for example load instruction require addresses uh add instructions don't require addresses right so you don't need to store the address inside the buffer right for add reservation stage so you can specialize but unfortunately when you have distributed buffers uh the load balance becomes a problem so if everything you're executing the machine is add instructions and you may not have enough ad reservation stations right because you've distributed the buffers across the functional units so there are trade-offs over here uh another question is should reservation stations and reorder buffer store data values or should there be a centralized physical register file where all data values are stored this is a loaded question because we're going to build up to this actually uh so values are going to become problematic as we will see uh timing is also problematic i'm not going to deal with timing exactly when does an instruction broadcast tag when do you know this information uh when should you broadcast a tag and when should you broadcast the value should these be done uh in concurrently or uh in a in a serial manner so there are many many design choices that we're not going to tackle over here but existing machines actually tackle in very very interesting and nice ways there are a lot of tricks you play to make this uh more efficient uh more lower latency and less costly but again we don't have time and this is the first course you're taking on the topic okay basically we did this exercise uh that's great hopefully you can do it on your own and I would recommend that an exercise for you is to do the same exercise with precise exceptions it's not going to be that different frankly you need to add a reorder buffer stage uh into the pipeline but it's not going to be that different it's not going to change the timing that much also but let me talk about autoord execution with precise exceptions what does it require so the idea is very similar use a reorder buffer to reorder instructions before committing them to the architectural state just like we did yesterday uh yesterday we added this machinery to a multicycle execute pipeline machine in order dispatch here we have auto order dispatch with reservation stations right so any instruction updates the register aliotate table when it completes execution this also called a front end register file so I'm I'm going to describe something to you that is slightly different uh that is employed in all processors today almost all processors so basically an instruction updates the register alias table where it completes execution uh this basically says that where where can the prior instructions get the value from this is called a front end register file and we will see this with an example an instruction updates a separate architectural register file when it retires when it actually is the oldest instruction and finishes with without exceptions uh as I said earlier in other words the architectural register files always update in program order so there are two register files front end register file architectural register file and front end register files updated when you actually uh broadcast your tag and the architectural register file is updated in program order from the reorder buffer okay on an exception we flush the pipeline copy the architectural register file into the fornet register file so that's the beauty of having two register files front end register file is used for renaming back end or architectural register file Intel calls it the retirement register file is used only for updating the architectural state in order when the oldest instruction finishes execution okay and whenever whenever there's an exception you need to synchronize the architectural register file with the front end register file front end register file is speculative because lots of instructions are in the machine but you're flushing the machine so basically take the architectural content of the register file and put it into the front end so it's a basic simple copy plus the flesh of the machine of course right okay so basically uh this is our initial out of order machine we did not have reorder buffer we're going to add a front end register file used for renaming and then reorder buffer just like we did yesterday and an architectural register file you can see that the front end register file has valid bits and values architectural register file is doesn't have any valid bits because it's always valid everything over there is always valid it's just updated when the program finishes uh or retires instructions does that make sense okay so this is really the out of order machine with precise exceptions you need to have reservation stations for scheduling you need to have a front end register file for renaming you need to have a reorder buffer to update the register uh architectural register state in program order anytime an architectural register file to keep the architectural register values again to make sure that uh we're on the same page architectural register files architectural state it's updated when an oldest instruction retires front end register file is the front end state the state that is seen by decoded instructions when instruction is decoded it's renamed using the front end register file because there are lots of instructions in the machine that are potentially updating the registers right architecture register state is what's visible to the programmer the programmer can inspect this one but they cannot inspect this one okay okay so that's basically uh this picture uh in a lower level detail if you will and that's the channel that I like showing you okay so uh there are many many interesting questions over here you can optimize this a lot but one question that's actually uh very bothersome is there's value replication all over the place so these are values so you have values in the front end register file you have values in the architectural register file values get replicated into the reservation stations as you can see whenever you actually rename an instruction you copy the value tag and value into the reservation station or some other reservation station and then values also get replicated in the reorder buffer because you need to take the value that you're producing and put it into the reorder buffer so that you can later put it into the architectural file so if the sizes of these structures are relatively large the this value replication can be huge actually so imagine uh reorder buffers on the order of thousand let's say today we're marching towards there imagine reservation stations on the order of thousand also so 2,00 times 64-bit values so in some cases two of these values as you can see right so 64-bit values are actually quite large so it's actually huge amount of value replication and imagine a worst case where every instruction is sourcing and writing to register 3 that's actually basically the same reg uh well maybe not writing to every instruction sourcing register 3 let's say that's the same register three replicated all over the place right if if no one's writing to register 3 but everybody's sourcing register 3 uh they're all replicated over here right so basically you're kind of wasting uh the values so people have found out a solution to this uh by consolidating the values in a physical register file they can centralize the value storage and everything else can hold pointers so for example the front end register file not file anymore it doesn't have values front end register app has the rename registers this is how you rename the registers and pointers to the physical register file register one points to physical register file 18 that's how that's the latest definition of uh register one that you can get from an instruction that's in the pipeline register 10 points to physical register file uh four for example okay and the value is only here basically that's true for the architectural register map also here architecturally register one is stored in physical register 12 that's the architectural state of the machine architecturally register 7 is stored in physical register 11 etc and reorder buffer entry also has reorder buffer also has pointers so this is how you can get rid of the values values are only in the physical registers that are centralized and everything else stores pointers to the physical register file now this is also nice because this is your rename space now your rename space is the physical register file you can have a physical register file whose size is,024 and those are the names you have for renaming the architectural registers too so a lot of machines work this way today because they don't want to replicate these values all over the place because values are expensive for multiple reasons right they they take up space but they also consume a lot of power each value is 64 bits today sometimes even larger and whenever you're broadcasting the value is huge tags are actually much smaller if you think about a tag if your name space rename space is 1,024 registers the size is only 10 bits the value is 64 bits does it really make sense to broadcast all those values and make the system very energy inefficient right because every broadcast takes energy power also that's this is why this is really important and people actually get rid of the replicated values so basically in the end this is where we build up to modern out of order execution with precise exceptions most modern processors use the following they have a reorder buffer to support in order retirement of instructions they have a single register file called the physical register file to store all registers both speculative and architectural register speculative meaning the value uh is going to be produced by instruction in the pipeline but they may not be updating the architectural register because there might be an exception right and later we'll see branch mispred predicts there might be a branch mis prediction also potentially uh integer and f registers are still separate we've kind of been saying that also there are two register maps that store pointers to the physical register file just like I described there's a front end register map that's also called a future register map that's used for renaming it's called for future because it's not really the architectural state that's updated it's future instructions are coming and renamed using that this also called renamed register map and then there's an architectural register map that's used for maintaining architectural or precise state and this design avoids value replication in reg reservation stations reorder buffer etc and that's basically the design that we have uh seen over here uh so how does this actually impact the reservation station so let's say I didn't show you the reservation stations over here but the reservation stations also don't have values anymore right if you look at this this is the reservation station for add unit multiply unit you have the value and the physical register ID which means that you need to access the physical register register file before you access the functional unit actually because what happens is whenever someone broadcasts a tag they update the physical register file they don't with the value they don't update uh the uh value over here because there's no storage of value over here okay we've just moved the value store to the physical register file before an instruction gets dispatched to it functionate it needs to read the value from the register file so this is how the pipeline changes at decode or rename we allocate the destination physical register to a destination architectural register again at decoder and rename we read and update the front end register map okay just like we did earlier and then we wait uh we put the instruction into the reservation station it waits until both of its sources become valid before execution the instruction access the physical register file to get its source values and then after execution it accesses the physical register file to write its result values it's still broadcast the destin destination physical register as a tag so but the value doesn't get broadcast value just gets supplied to the physical register file it doesn't get broadcast what gets broadcast is this next physical register tag which is much smaller as we discussed right if you even if you have 48 registers that's only 20 bits getting to 64 bits is huge right so you need to have 264 registers which is very difficult to have if you know that number okay so at retirement we update the architectural register map with the destination physical register if the instruction is oldest and it didn't cause any exceptions okay and this is a modern processor basically uh Pentium 3 was like what we discussed yesterday netbur or Pentium 4 in Intel essentially is what we described right this is the front end register map front end rat as they call it it points to some registers in the physical register file and this is the retirement rat or architectural or backend register file it points to some other physical registers they may be the same if the architectural register did not get renamed with an instruction that's producing a new value of that register right so this is essentially the same picture that I showed you over here yeah this picture except in a different way and they have a reorderable for of course uh to keep the status and do the reordering does that make sense so this is actually well this was in Intel Pentium Pro so there was value replication intel pentium pro uh but that replication got eliminated in the Intel Pentium 4 uh and most modern processors are like this let me quickly uh go over some things and then we're going to uh conclude but basically these are the major concepts in out of word execution we link the consumer of a value to the producer using register renaming this associates a tag with each data value we buffer the instructions until they're ready this requires inserting instructions into the reservation stations after renaming inside the reservation stations we keep track of the readiness of source values of an instruction which means that uh instructions broadcast the tag when the value is produced and instructions that are waiting in the reservation station compare their source tags to the broadcast tag if there's a match source value becomes ready and when all source values of an instruction are ready you dispatch the instruction to the functional unit this means you wake up and select and schedule the instruction so I think we've kind of summarized these concepts uh uh these are nice i'm going to start with the uh later slides but I'm going to cover some things before I actually cover these uh slides i'm going to cover these uh next next time I'm going to ask you these questions later uh but let me actually show you some real designs so real designs are actually like this uh if you go through these designs these are the reservation stations you can figure out these are the memory reservation stations memory schedulers etc uh and there's a register file as you can see after you get scheduled you access the register file this Intel Pentium 4 and then you go to the function units you can see how many functional units there are right even in something that's 22 years ago you get two two a bunch of functional units as you can see so this part of the machine is actually quite complex they don't show everything this is my simplified version of it from my own paper this is alpha 21264 if you're interested in reading you can actually read the paper that describes how this operates it's a bit cleaner more cleanly explained this has four four integer execution units over here but essentially reservation stations over here and then you access the register file after the reservation stations mips was designed such that you never do something like this in hardware remember MIPS is called microp processor without interlocking pipeline stages you didn't you don't they didn't even want to detect dependencies compiler should detect the dependencies but you can see a modern MIPS 1996 about 15 years after it was really introduced first time it's using auto order execution because they quickly figured out no way we're going to get the same performance if you want to compete we got to do auto execution like everyone else so that's why this concept is very powerful this IBM power 4 which is one of the first multi-core machines it used auto execution again you can read these slides they're not that important but it basically has 100 instruction windows i'm going to mention this soon ibm Power 5 is similar this is AMZ Zen 2 more recent it has much bigger windows and I'll conclude with this slide this is again these are not official AMD or official Apple uh things but people enthusiasts go and benchmark these machines and they try to reverse engineer what is going on in the machine and according to the reverse engineering of some people over here you can read them if unticked you can see that uh they guessed that the reorder buffer on this machine is about 630 instructions and the integer physical register file is about 350 and FP physical reg file is 380 again this is there you can see that the load and store are much smaller 150 or so because this is actually the most complicated part of the machine but you can also see how many instruction execution units there are right we had a toy example where we had a multiplier and an adder but you can see this machine easily has more than 15 20 okay that's the end basically for learning more about the even more complex part of the machine please watch the lecture we're going to premiere and look at the backup size you're not going to be responsible for this but it's really fun to look at it i would say with that have a nice uh weekend

Transcript for:[Lecture 14] Understanding Out-of-Order Execution in Computing

Transcript for:
[Lecture 14] Understanding Out-of-Order Execution in Computing