[Lecture 12] Understanding Pipeline Processor Design

all good okay [Music] okay let's get started is this good can people hear me okay okay uh today we're going to finish hopefully uh pipeline processor design not all the aspects of it because we're going to continue issues with pipelining next week but hopefully the basics of pipeline processor design data dependence handling and control dependence handling will be clear by the end of today oh okay so this is I think your last reminder uh for this review which is due April 1st which is before our next lecture okay uh and uh this is what we're covering uh we've covered multicycle micro architecture and we started building a pipeline after that and now we're going to cover some issues in pipelining and these are some of your readings and I'd recommend doing these readings actually they're relatively short and they explain the concepts nicely uh especially the H&H readings uh which which we have covered in lecture last time if you remember okay just to jog your memory this was our multicycle processor and this was our finite state machine we built most of it at least and we said that we can easily handle multicycle memory access this will come up again in pipelining so don't forget these issues these are fundamental issues and we asked the question can we do better and we basically said oh we can actually have different instructions in different stages in this case four instructions in four different stages but you could have we actually built a five-stage pipeline uh as I'm going to show you soon of course we said we need to be more careful than this we need to add pipeline registers etc So essentially pipelining is uh doing multiple instructions concurrently uh at different stages uh of execution uh in a single machine and we contrasted this with multicycle and you see that we improve throughput significantly compared to multicycle processors later if you have time today we're going to do some more performance analysis a little bit more carefully than what we've done over here uh and we'll see that pipelining actually improves throughput but maybe not as 4x over here here you with with four stage pipelining over here fetch decode execute right back these are the stages uh you get 4x improvements and throughput but life is not that beautiful as we will see today we're going to cover those issues today especially but before that let me cover a little bit more about how we actually designed the pipeline we actually started with a single cycle micro architecture so that's the beauty of single cycle micro architecture i said yesterday it was there we discussed it for educational purposes we designed a single cycle micro architecture a few lectures ago and this is something hopefully you remember uh there are no internal micro architectural registers over here as you can see right everything happens in a single cycle for every instruction and you know this we designed the data path for it and then we designed the control logic for it and we went through the exercise of actually which control logic which control signal should be asserted for which instruction right okay so this was designed it's there's no magic Right so how do you do a pipeline design uh basically if you want to design a pipeline processor you go through the same process that we have gone through for both single cycle and multicycle you need to design the data path first and then you need to design the control logic later okay that's what we've done for every processor that we have designed so far so a pipeline data path design what is a pipeline data path essentially you start with a single cycle data path we've already designed the single cyclic data path let's start with that and then you decide how to break it down into stages and we decided to break it down into five stages if you remember yesterday fetch decode execute uh memory access write back we're going to keep that five stage pipeline it doesn't need to be five stages someone could decide uh to break it down into four stages eight stages 20 stages some existing processors are actually uh close to 30 stages so uh the goal is to really uh break it down into different stages so that you improve the clock frequency right that's the goal so once you decided how to break it down into stages and there's actually a lot of science in this area that we're not going to go through exactly how you do it that's not the subject of this course assume that we've broken it down into stages we add pipeline registers so that you can separate the stages from one another you need to add pipeline registers so that the thing that's done in one stage in one cycle doesn't affect what happens in another stage at the same cycle right so each stage has a pipeline register that's the output of it which is the input of the next stage and during the clock cycle you do some processing in that stage and the contents of the register doesn't change only at the end of the clock cycle at the rising clock edge if you remember we update the contents of the pipeline registers such that the instruction in the previous stage finishes the previous stage and moves to the next stage does that make sense our goal is to basically keep proing instructions keep moving instructions from stage to stage and every cycle you do something each stage at the end of the clock cycle you finish what you're done latch it at the end of the at the pipeline register at the end of the stage and the next cycle will be will start with what you have lashed in the previous cycle to the register okay so these are the that's the purpose of the pipeline registers i'm going to show you again pictorially uh this thing okay once you add the pipeline registers you need to propagate the data signals right and data signals we already have a data path data signals are clearly propagating because we designed the single cycle data path but now you need to decide where those data signals need to go not all data signals are needed by every stage and whenever you have data signals available they are not necessarily needed in the next stage they may be needed in the next next stage right so you need to propagate those data signals to the appropriate stage which means that you need to store those data signals in the pipeline registers okay and we've seen that also i'm going to show it to you again and also on top of this you need to make sure correct action is done in the correct stage what do I mean by this you need to ensure that the data signals go to the right place you cannot basically send the data signal to the wrong place and we're going to look at that basically this this actually true for control signals also uh because control signals are part of the design of the data path not the control logic necessarily but essentially whenever a wire crosses stage boundaries for example when we're writing the ALU result to the register file uh there's a control signal that says you need to write to the register file that control signal should be generated in the right stage right meaning the instruction that's in the right back state should be able to write to the register file and not the instruction somewhere else okay this will be also more clear uh when we actually go through uh the design okay once we design the data path correctly ly hopefully you need to design the pipeline control logic and we already said uh control signals are essentially the same as a single cycle so that's the beauty of single cycle you have the control signals you have the control logic that works for pipelining also not quite because pipelining does multiple instructions uh concurrently which means that when you generate the control signals you need to propagate them to the stages where the control signals are needed right for every control signal you need to make a decision this is the stage when I actually need that control signal and to be able to propagate that let's assume that you generate the control signal in the decode stage uh this is one of the options that we discussed yesterday i'm not going to go through the options again you don't need it in the next stage which is the execute stage but you need in the memory stage or write back stage you need to propagate those control signals and the pipeline registers okay then the goal is to make sure that the control signal travels with the instruction such that you take action uh uh uh you take the right action at the right time uh based on the partitioning that you've done for the pipeline okay so you need to extend the pipeline registers to do so and I've given you a picture of that i'm going to show you again okay is this all clear yes why don't we use multicycle and pipeline at the same time okay what would be the benefit of that we could increase the the frequency still have the pipeline as a benefit okay but how would you like combine both of them so multicycle's goal is to really in a sense in a multicycle uh Okay in pipelining you're really multicycle right each instruction is really multicycle pipelining by itself is a multicycle paradigm right true yeah each instruction is going through five stages for example in our fivestage pipeline so it's by itself multicycle does that make sense okay good okay good questions though i mean you you should you guys should think and the thinking is good that that makes you ask questions like this and actually there's a combination but I'm not going to go into that combination right now that's a different kind of combination okay so let's go through what I just said uh I I went through this uh yesterday but I'll go through it a little bit more uh today but this is our single cycle processor uh it's really the data path mostly uh one version of it but we built it uh earlier you remember uh we have uh the instruction memory over here register file alu data memory and uh the writing back action over here and we basically first need to divide this into stages right that's the first thing you need to do let's assume that we decided uh the first stage should be the fetch uh the second stage should be register file read decode uh and control signal generation which I'm going to show later the third stage should be alu execute essentially the fourth stage should be memory access the last stage should be write back so that's what the stages each instruction goes through and each instruction does fetch here decode here that's why it is called if ID uh and then execute here memory access here and write back here so we don't write back data uh from anywhere else but the right back stage into the register file make sense we write to memory in the memory stage that's a different write but we write to the register file in the right back stage okay and every instruction goes through these stages so by definition it's a multicycle thing but every instruction is going through the same stages okay so we also added by uh once we divided uh our single cycle data path into five pieces let's say we also added these pipeline registers as you can see these are registers that are going to store both both data signals as well as control signals let's take a look at the sum of the data signals these may not be all of the data signals but for example here after you fetch you store the instruction bits we called it the instruction register in the past you can think of a part of the pipeline register storing the instruction register right we also store PC plus4 because we're actually going to need it in address calculation as you can see so any data signal that you're going to need later on for the execution of the instruction you should really propagate down the registers so in the first cycle you fetch this instruction in the next cycle you decode in the decode stage this PC+4 gets propagated directly to the next register because you don't use it here we're going to use it later on actually but you don't use it currently here uh but you'll use it in the next stage which is the execute stage for some things like branches right uh and also we read the register file and we store the outputs of the registers uh essentially the the data values of the two registers that are read and we also store the sign extend immediate and these are going to be needed for execution of different instructions over here in the in the next stage okay so I could keep doing this essentially uh here you calculate the next PC based on PC plus 4 and uh sign extend immediate left shifted and then you store it at the end of the execute stage and the execute to me registers execute/me stage registers a output is also stored over here and look at there's curiously something called B over here what is that B this B is really the register the second register that you have read so we directly propagate it from the execute stage input register idex register to execute the memory register without changing it because it's needed for stores whenever you're storing you need to actually take the data from the register file and write it into memory right so you propagate all of the data signals that's needed across the pipeline stages yes not problematic that this goes back um so is it not problematic this Mox uh it's it's not actually it goes back into the registry that's there might be something else in the background that's true uh so we're going to talk about that if that something else is not reading that register it's not problematic so what does this Mox do let's take a look at it this Mox is or this this wire yes you should always be careful with these wires the goal of this wire is to really write the data into the register file and the uh you write the data from this instruction over here right so the instruction once an instruction reaches the right backstage if it's writing to the register file it writes the data if it's a load you write the data that is latched in the previous cycle while the load was in the previous stage from memory if it's an ALU operation uh then you latch uh you write the data that is the result of the ALU right as we have seen in the single cycle data path actually so basically you should be writing the data no no question about that uh this is not problematic as long as nobody else is reading the data right does that make sense as long as there's no other instruction previously uh that uh there's no other instruction in the previous stages of the pipeline meaning there's no other younger instruction then you can write that data without problem right and we're going to get to this we're going to cover all of these issues okay okay okay so hopefully it's clear so now we went through almost the entire pipeline uh and this is how you construct the data and you need to ensure that no resource is used by more than one stage so uh here for example when we're writing the data over here this right port is not used by anyone else only this uh register now uh sorry only this instruction the right backstage now this becomes problematic if someone else is trying to read that register right or if there's another instruction that's over here that should have already read that register think about it we're fast forwarding a little bit but so far we're kind of assuming an ideal pipeline Okay so control points in the pipeline this is essentially our single cycle uh pipeline processor that we just designed it's the same thing with a slightly different view with some control points added so this was a data path design now I added the control points over here so control points are essentially the same as what we did with single cycle if you remember the single cycle processor we had a regrite signal we had some other signals over here to make sure the the instructions execute properly we have exactly the same signals nothing has changed over here you still need to generate the control uh signals and that control logic is essentially also the same you still have the same control unit over here based on the instruction register ex the the major difference is you delay the control to the proper pipeline stage right so for example uh when you're in this stage you're not writing to a register the instruction in this stage is actually reading from registers right so the control signal uh rag write should not be coming from the instruction in this stage the control signal rag write should be coming from the instruction in this stage because that's the only stage where an instruction can write to a register okay which means that when you generate the control signal uh when you decode the instruction you figure out oh this instruction writes to a register so regrite should be set you place that control signal into uh this uh pipeline register and propagate it and use it when it's really needed make sense so you can see that this signal over here is the regrite signal and it's used only it's connected to this pipeline register over here which is the rightback pipeline register which says that the instruction in the right back stage is supposed to write to the register file because that control unit when when it decoded the instruction uh three cycles ago it set that signal appropriately hopefully because it's a it's an instruction that drives to a register so we're going to connect that register that control signal in that register into the reg write signal of the register file make sense okay similarly we need to propagate the signals that are needed in the memory stage so we generate well this control unit generates some signals that are needed in the memory stage especially important for memory instructions but for other instructions also other instructions should do no harm to memory let's say uh and when an instruction reaches the memory stage it takes the appropriate signals that are needed in the memory stage and uses them for example if it says store instruction memor signal should be set memory signal should not be set and as a result we write the address coming from this pipeline register which was calculated in the previous cycle previous stage through the ALU uh we write to that address the data that's coming from uh this register which was latched directly from this register in the previous cycle which was read from uh the second part port of the register file hopefully this all makes sense right the instructions are moving in the pipeline I need to move the control signals appropriately as well yes we have one long part like for example if reading takes longer than expected okay I I'll I'll say patience again sorry very good So essentially you guys are thinking ahead there so if there's something for example if the memory takes more than one cycle right that's what you're saying essentially this pipeline currently doesn't handle that but we're going to talk about how to handle that essentially okay let me answer the question if we assuming everything is moving so the assumption right now we make is everything is moving you keep the pipeline always moving there's always instructions coming in they keep going through without any trouble right the pipeline is flowing in other words if something blocks the pipeline or stalls the pipeline we're going to call the stall you need to stop the pipeline meaning this instruction here is accessing data memory it takes five cycles meaning that this instruction cannot move anymore what do you need to do you need to make sure nobody moves nobody previously uh nobody in the previous parts moves this instruction over here doesn't move this instruction doesn't over here doesn't move this instruction over here doesn't move and what happens here this instruction cannot move to the next stage so you actually insert a no op no no operation you kind of basically insert a bubble but we will get to that that's a very good question and this is our this going to be our strategy for handling any kind of blocking or stalling whether it's because a long latency operation or it's because of a dependence as we will see soon okay so I've already discussed this control signal so hopefully it's clear this is actually where we left off uh yesterday but I want to discuss things a little bit more so that hopefully everything's a little bit more clear right now any questions thoughts okay now that we've covered some of the basics let's go through another pipeline over here a little bit more basics this is what your book develops actually your book is also very nice in pipelining it's 7 chapter 7 7.5 uh if you're interested in reading it it does essentially the same thing I did this is the single cycle processor that your book developed and it basically breaks it down into five stages fetch decode execute memory write back as you can see and it adds pipeline registers and it has it has this instructor part you can see that there's the right drag signal over here then we ask the question is this correct so basically assuming an instruction is supposed to write back to uh register file over here because the data value is available only after here right for memory instructions for uh alu instruction is really available over here writing the data value into register file here is wrong right because you'd be writing the wrong data value which is coming from here so the control signal over here is coming from this instruction whereas the data signal is coming from this instruction Now you should never do that right if you're writing some data uh assuming it's writing it at at the appropriate stage you should use the control signal at the appropriate stage so basically this is how you fix the problem you propagate the right rag signal over here and you assert that right signal from the uh from the right backstage essentially the control signal must arrive at the same time as the data signal okay and this is all to based on the semantics of the instruction clearly right if you're actually using a control signal that belongs to some other instruction it makes no sense because we've divided the machine into five different instructions you should use the control signals that you propagated for the right instruction okay I think I've harped on this enough so this is the pipeline control that your book develops it's essentially the same as single cycle processors control unit and control is delayed to the proper stage and again I would recommend uh that you take a look at your book okay so basically uh we've covered how to design the pipeline data path and control logic now the question is this sufficient and as multiple of your colleagues already kind of hinted at this is not sufficient basically there are two major questions at least two major questions i'm going to expand this how do you handle dependent instructions if an instruction uh that comes later in the pipeline is dependent on a previous instruction for example you have a load instruction load instruction goes here in the pipeline and add instructions depend on it can the ad instruction be here while the load instruction is over here and the answer is no because load instruction did not generate its result yet ad needs that result so AD cannot be computing anything if the if it didn't get that result yet right that's the idea so you cannot keep the pipeline moving you need to stop the pipeline when you detect a dependence between instructions and as you have also mentioned if there's like multicycle memory access it's a similar issue you need to stop the pipeline we're going to talk about that uh but there are other issues so I'm going to talk a little bit more about this ideal pipeline you've seen the slide uh before essentially an ideal pipeline requires identical operations same operation repeated independent operations there should not be an ad depend on a load and uniformly partitionable sub operations basically you can divide uh everything uniformly to suboperations unfortunately an instruction pipeline violates all of these uh you don't have identical operations you have different instructions not all of them need the same stages and we're forcing different instructions to go through the same pipeline stages now this is an this is not a bad evil let's say we're going to also fix this a little bit uh when we talk about multicycle execution uh next week uh but it causes external fragmentation meaning some pipe stages are idle for some instructions and we saw that yesterday right an a an R type AU instruction an ad instruction goes to the memory stage and it does nothing in the memory stage right okay so we've seen this that's called external fragmentation okay there's also uniform suborpations let me cover that first different pipeline stages are really not the same latency and we also talked about this yesterday but you have to force each state to be controlled by the same clock and as a result you have some internal fragmentation in the clock which means some pipe stages are fast but still have to take the same clock cycle time so you lose some performance from there also okay and now let's get to the independent operations part uh which will essentially consume quite a bit of our time uh in this lecture and the next lecture and the next lecture as well essentially the a huge problem is instructions are not independent of each other right this is not the like the laundry loads laundry loads are independent of each other there's no dependence between my laundry and your laundry and your laundry or something like that right or my differenties like a mix and match also all right uh so basically you need to detect and resolve inter instruction dependences to ensure the pipeline provides correct results so our pipeline that we built is not correct yet it is correct under some conditions where instructions are all independent of each other and memory is one cycle quick there's no stalling condition basically so basically this leads to pipeline stalls meaning pipeline is not always moving there will be conditions under which the pipeline stops okay this brings us to a lot of exciting issues in pipeline design that has enabled a lot of creativity and creative solutions and we're going to see a lot of these creative solutions so balancing work in pipeline stages that's not something we're going to see how many stages and what is done in each stage this is still an issue it's always an issue uh but it needs to be dealt with so this is the part that's really interesting and exciting I think uh and this will be uh this will enable us to build up to out of order execution so we want to keep the pipeline correct moving and full and we want all of them actually these are actually if you think about it these are different things but all of them together makes the pipeline good uh in the presence of events that disrupt the pipeline flow so what are those events that disrupt the pipeline flow data and control dependencies we're going to spend a lot of time on these resource contention if two things need the same resource at the same time what do you do handling long latency multicycle operations like long latency memory accesses long latency instructions and multiply maybe longer latency than an ad for example how do you handle that right and then handling exceptional conditions like in or uh where uh a problem happens in instruction execution you divide an instruction you have a divide instruction and you divide by zero that's an exceptional condition what do you do right what happens when an interrupt comes in and interrupts a processor we're going to talk about that next week so basically we want to improve pipeline throughput by minimizing stalls but let's talk about these stalls first how do we actually handle stalls so what are the cause of pipeline stalls so what's a stall essentially uh I kind of defined it earlier but to more formally it's a condition when the pipeline stops moving right or it should stop moving if you keep it correct right so there are three major cause of stalls one is resource contention i'm going to quickly talk about it but we're not going to dwell on it that much the second one is dependencies we're going to talk a lot about that these are dependencies between instructions data dependencies where an instruction needs the result of another and control dependencies where an instruction is uh the fact that you need to fetch an instruction is dependent on the fact that you fetched and executed the previous instruction okay and then there's the long latency multicycle operations okay dependencies are also called dependencies or your book calls it less desiraably hazard hazard is a very alarming term I think uh I think it's just a dependence basically you have essentially these are there because uh you need to have ordering requirements between instructions right if an instruction needs the value of another instruction there's clearly a dependence flow dependence uh between those two instructions so there are two types of dependencies data dependencies and control dependencies resource contention is sometimes called resource dependence i think this is correct actually you're dependent on the same resource it's not like a hazard uh but it's a very different form of dependence it's it's it's not it's not inherent in the semantics uh of the program right it's basically a hardware dependence so it's really not fundamental to the program semantics so we will treat it separately in fact that's what I will treat first so let's talk about handling resource contention this happens a lot uh when you design a pipeline uh it happens when instructions in two pipeline stages need the same resource so whenever you're designing the data path you need to be careful about it also the control logic of course so how do you actually handle this first eliminate the cause of the contention right basically replicate the resource we kind of did this with the dryer right earlier so that we could keep the pipeline moving we actually had two dryers uh duplicate the resource or increase its throughput use separate instruction and data memories for example caches as we will see or use multiple ports for memory structures this is obvious hopefully the second solution could be okay I don't have enough let's say uh money to replicate my resources so I'm going to use a single resource but detect the resource contention and stall one of the contending stages that's also possible right this we're going to of course add the stalling logic later on but it's possible to handle it this The question is which stage do you store for example what if you had a single read and write port for the register file and an instruction in the writeback stage needs to write to the register file and an instruction in the read stage or register read stage needs to read from the register file clearly you should stall the instruction that needs to read right because you need to meet you need to keep the pipeline moving the older instruction is the instruction that needs to write to the register it should once it writes to the register it'll get out of the pipeline right so prioritize that instruction and make sure it writes to the register file first using the single port that you have and other instruction uh will read from this that same port later on in another cycle for example so you need to add control logic to detect these things okay there's one trick that your book assumes and a lot of books actually assume also uh which may or may not be realistic and depend on the design i'm going to tell you that trick uh so with careful design the register file can be read and written in the same cycle i will not go through exactly how this is done but uh you can uh essentially a register file write takes place during the first half of the cycle you can design the register file to be that way and register file read takes place during the second half of the cycle okay this way you kind of handle this resource contention that I kind of talked about right at least this particular one so a read or write from and to the register file has only half a clock cycle to complete now if you do this there is one benefit let's take a look at these instructions these instructions are not really ideal but for example this instruction writes to register 2 in the first half of the fifth cycle now if another instruction needs register 2 they can do it in the second half of the fifth cycle okay this is our pipeline uh load instruction first cycle in the fetch stage second cycle in the decode stage third cycle in the address generates ALU stage fourth cycle in the memory stage fifth cycle it writes to the register file in the first clock cycle half first half of the clock cycle or the fifth clock cycle it does this writing and then you can see that the instructions are pipelined if an instruction needs that register it can read it in the second half of the fifth clock cycle so this instruction can basically read from S2 in the second half without any problem okay later instructions can of course read from the register file right yes is it cached or simply it's written basically it's it's simply written we're assuming that the structure of the register file simply takes the value is written and at the same time reads puts it into the the other instruction that reads it so uh basically uh you have a register file right you have uh eight registers let's say you're writing to register two the right to register two completes within first half of the clock cycle and anything that needs that reads that instruction the same clock cycle will get the correct value in the second half okay so the assumption is that you you need only half of the clock cycle to write to the register couldn't we simply pass it okay we're patience i like saying that to you sorry but you're you're you're thinking ahead yes you could also pass it directly we're going to see those direct passing direct forwarding we're going to call it forwarding okay very good okay but this is a trick this is actually a forwarding trick i would I call it internal register file based forwarding uh but this enables you to actually uh uh do do the write in health uh of the cycle and read in the second half okay okay let's talk about data dependences now but before I talk about data dependence may maybe in this figure I will illustrate something this instruction is writing to register 2 in the in the first half of the fifth cycle if this instruction were dependent on that it cannot this pipeline cannot be moving like this right this instruction needs to wait until it can read the value from the register file or somewhere else as we may get to later okay so let's talk about these data dependencies uh so there are three types of data dependencies one of them is real the the other two are fake the real one is flow dependence it's also called true data dependence read after write dependence this is where an instruction is writing to a register and then later instruction reads from that register so you truly have a data dependence between the producer of the register and the consumer of the register and then we have anti-dependence and out output dependencies these are not these are fake but they still affect the pipeline depends on how you do the dependence checking etc uh essentially we need to be careful about them still and we're going to actually uh eliminate them when we talk about out of order execution so which one causes stalls in a pipeline machine essentially for all of them we need to ensure semantics of the program is correct i'm going to show you the anti and output dependencies with a pictorial example soon flow dependencies always need to be obeyed because they constitute true dependence on a value right they you communicate values between instructions anti and output dependence exist due to limited number of architectural registers for example anti-dependence or output dependence means right after writing an instruction is writing to register five a later instruction is also writing to reg writing to register five these may have no relationship with each other they are just names dependence on a name which is register five right you don't have enough architectural registers so compiler allocated register five to both instructions okay so that's what I mean by depends on a name not a value this will become a lot more clear when we talk about register renaming and out of order execution later on so we will later see what we can do about them so let's take a look at these dependence types so flow dependence so these are uh um our operations let's say uh it can be any operation over here uh pick your favorite operation but the first one uh does something to register one and register two and stores the result into register 3 the second operation that comes after this reads register 3 which means that it's dependent in a true manner to the previous on the previous instruction and this is a true dependence because this register is really communicating a value produced by the first instruction to the second instruction that needs that value for computation right this is absolutely necessary you could change register three to register 1,00 here if you have it you have to change this to register 10,00 also right does that make sense because one instruction is producing the value the other instruction is consuming the value so you need to supply the value from somewhere and the realization is that we're going to supply this value not necessarily from the register file soon okay so anti-dependence looks like this the first instruction the older instruction is write reading from register one the younger instruction the next in happens to be the next instruction is reading from register one oh sorry writing into register one read uh write after read there's a right that happens after read to the same register now if you think about this these registers have the the values stored inside the registers have nothing to do with each other right there is some value in register one that's needed for the previous instruction this instruction overwrites that value it doesn't need register one so if you had more registers let's say I have my favorite register 1,00 if I change the first one to register 1,00 I don't need to do anything to this that way I've eliminated that dependence that's why this is not a true dependence that's a false dependence it's fate the reason it's there is because we don't have enough registers and we're going to see this problem over and over why don't we have enough registers because we cannot have a huge register file that's a one good answer the other good answer is if you have thousands of registers our encoding of the registers in the instruction set will be huge so our instructions will be huge right so there are multiple answers to this yes even if we don't have enough registers why is this a problem if we write to the register after we have read okay so why is this a problem basically the problem is you need to still design the pipeline to make sure you don't get the wrong value okay okay so this is one of the reasons why we write to register one at the end of the pipeline stage okay but then we always write after well not necessarily if you have if you have if you have a pipeline processor if you didn't design it correctly you might actually let's say this is the next instruction this somehow anyway this maybe this not the best example but it's possible to imagine that there will be some case where you could write to this register and this one incorrectly gets the value okay yeah I cannot demonstrate right now but yes okay the output dependence is another example of fake dependence uh and here you see this this uh this instruction is writing to register three and there's something that's happening there's an instruction that's dependent on register three and then this instruction is also writing to register three and again the realization is that if you had enough registers you could rename this one the bottom one to R 10,000 and nothing would have to change and there would be no right after right dependence okay so basically uh you need to be careful with these anti and output dependencies uh they still consume some of your resources and also if if you somehow you don't want to reorder the in writes right if you always write to a register at the end of the pipeline when the instruction is the oldest then hopefully you will not run into any problems in terms of anti and output but with flow there's still a problem you need to supply data value okay so this is our pipeline basically let's see two instructions flowing over here a load word and an independent sub if it's independent there's no problem right okay everything writes back at the end so basically the way to solve the anti and output uh dependences is to make sure rights happen at the very end of the pipeline and in an in order manner right okay so what if sub were dependent on load work uh load word and that's what we're going to deal with right now and uh these are some readings actually these are these readings I already mentioned but I'm going to also recommend this reading because it's going to be useful later on when we talk about interrupt and exception handling out of order execution supercaler execution all of those we're going to cover and it also covers the dependences and their types and their handling in a nice way so I'd recommend you take a look at that okay so anti and output depends are easier to handle as you kind of guessed right to the destination only in the last stage and in program order uh flow dependence are more interesting and challenging and there are actually six fundamental ways of handling flow dependences we're going to cover some of them uh a lot of them require detection you detect the flow dependence and you wait until the instruction that's producing the value writes the value into the register file okay detect and wait basically you detect the flow dependence and forward the data from the producer instruction to the consumer instruction and whenever the producer produces the instruction you don't wait until the data is written to the register file these are two different things as we will see you detect uh that there is a dependence and eliminate the dependence at the software level so software figures out that two instructions are dependent on each other and I should separate them a little bit so that they don't cause problems to each other in the pipeline this is actually a very powerful technique if you can do it and then you detect the dependence and move it out of the way that's going to be out of order execution we're going to talk about that uh there's another approach which is predicting the value which we're not going to cover you say "Oh I have a dependence i don't know what to do but I'm going to guess i think my value is going to be zero." Right or based on some records I've kept in the past which is going to be what we will do later in lectures 16 etc i'm going to guess and you execute the instruction but if you're wrong you need to do something about it okay you need to basically fix the problem yes why do you want to move them far apart why not move them closer together kind of forward assuming you're forward yes yeah if you can forward yes if you can forward you can also try to move them closer but moving them far apart eliminates the dependence essentially in the pipeline you don't need to do anything in the pipeline okay yes i cannot hear you very well go ahead yes uh this not this one it's a different prediction we'll talk about branch prediction later on okay and then there's do something else which is one of my favorite solutions uh we're going to talk about that hopefully later today this fine grain multi-threading and the idea let me give you the basic idea because all of these are actually creative and nice ideas that have each led to a lot of different types of research do something else means execute from a different thread that you know is independent so the next instruction comes from some other thread some other execution context and you know that they're independent so you don't need to do anything right of course there are upsides and downsides to this modern GPUs do this for example and we will see that in later lectures okay so let's talk about this detection because a lot of these require detection right how do you detect the dependence but before that let me show you the problem basically so these are this is a five-stage pipeline and I have these ads over here and the first ad is writing to register A and all of these other ads are kind of implicitly reading from not implicitly explicitly reading from register A but I'm showing this as an example right so if you actually we're assuming that uh write happens at the end of write backstage we don't have this clever trick where you are able to write to the register file in the first half of the clock cycle right if you have that clever trick you can do a little bit better uh but basically if uh if this instruction this instruction comes into the pipeline it's decoding it needs to read the register A if it reads register A at this point clearly there's a problem right it's going to read the wrong register A because this instruction writes to register A only at this point right similarly these instructions will get the wrong value also if you don't do something about it of course only this instruction will get the right value right make sense and the later instruction will hopefully get the right value also so basically you need to do something about these instructions you cannot just basically let them flow through the pipeline right okay so how do you do that so basically uh here u let me finish this uh you have a right back this is instruction I in the general case you have instruction I instruction I is writing to a register over here if instruction J is dependent uh as a flow dependence on instruction I you need to separate them enough in uh in the pipeline so that you instruction J gets the correct correct value so what does that mean basically instruction J needs to wait you detect that there's a dependence and make instruction J wait what does that mean instruction J stalls it stays in the decode stage one cycle not enough because this instruction still hasn't written finished writing to the register that instruction J needs stole one more cycle not enough and then stole one more cycle now it's enough so basically we insert bubbles in the pipeline it's instead of this instruction moving in the pipeline and no up moves in the pipeline no operation moves in the pipeline and you don't do anything clearly that instruction doesn't do anything you're basically stalling and that's the idea basically the idea of a stall is to make the dependent instruction wait until its source data values available so what does it mean to stall you stop all upstream stages upstream means uh younger instructions younger instructions stall stop and all of the older instructions get out of the pipeline right until the instruction that is actually causing the stall gets out of the pipeline then you stop stop stalling okay uh this is a good place to uh take a break I think uh and then we're going to talk about different ways of detecting as well as handling these data dependencies and hopefully we'll get to fine grain multi- threading as well okay everything is good online okay let's get started uh So now we have defined pipeline stall uh you know exactly what it means you make the dependent instruction wait until it source data values available right and to do so you stop all down upstream stages meaning you stop all younger instructions including fetch so the program counter doesn't change you basically are at the same instruction same program counter at the fetch stage same instruction you're decoding same instruction you're executing wherever you are basically and you drain all the downstream stages so that the stall condition uh gets removed right you handle a long latency memory operation or long latency operation the same way if an instruction needs to wait in the memory stage you wait meaning all upstream stages and drain downstream stages and whenever the memory returns the result you move on okay so it's a very general way of handling disruptions in the pipeline okay so now let's talk about interlocking or detection of dependence essentially interlocking is the dep detection of dependence between instructions in a pipeline processor to guarantee correct execution and as we will see more and more in later lectures uh there are ways of doing this in software and there are ways of doing this in hardware so there's softwarebased interlocking meaning software detects the dependence between different instructions and does something about it reorders code for example such that uh dependencies are separated from each other such that they don't really cause a problem in the pipeline software ensures that the pipeline is always correct moving and full if it's possible it's a difficult task but as we will see there are ways of doing it and if you cannot do anything you can insert no instructions no op are no operation instructions that don't do anything right clearly that's not good for performance right uh or you can do it hardware based hardware based meaning hardware detects the dependence between instructions in modern processors both are employed we're going to talk about both but we're going to especially look into hardware based soon does anybody know uh the uh longer version of the acronym is the ISA that we've been dealing with and we never said what it anybody it's not millions of instructions per second no one I don't expect anyone to know this but it's essentially it actually stands for microp processor without interlocking pipeline stages so people who designed this instruction set architecture their goal was to design a pipeline processor that's really simple and they designed the ISA based on this assumption and they wanted this microp processor to not have any interlock locking pipeline stages meaning hardware interlocking their goal was to handle as much as possible in software now that didn't work out very well the latest MIPS processors are all very heavily doing hardware interlocking meaning hardware dependence detection and we will see the reasons because that's better for performance okay so there are two ways of detecting dependencies in hardware one is scoreboarding the other is combinational dependence logic check very quickly I'll go through these but you can think about these on your own essentially scoreboarding means each register in the register file has a valid bit associated with it we're going to see this later on in out of order execution also an instruction you you're decoding an instruction and whenever you're decoding an instruction you you figure out that this instruction is supposed to write to that register uh it sets that valid bit at that point uh sorry it resets that value bit at that point saying that this register value you cannot trust because I'm going to go through the pipeline and at some at some point I'm going to write to this register right so for each register you have some metadata saying whether you can trust the value in that register currently in the decode stage or not right an instruction the decode state checks if all its source and destination registers are valid so in the decode stage whenever you're decoding an instruction the first thing you do is you check whether all your sources and destination registers are valid meaning you can trust the value there's no one else in the later parts of the pipeline that are writing to this register if that's the case then you can continue right there's no dependence but you reset your destination register okay so this is how you detect dependence basically uh if if if all of your destination registers are valid in the decode stage when you read the register when you read these valid bits you don't need to stall there is no dependence otherwise you stole the instruction and of course you need to make sure when an instruction actually writes to the register that it's supposed to write you set the valid again right correctly so that later instructions can continue okay so this is a very simple way of uh detecting dependence and stopping instructions uh that depend on instructions that are later in the pipeline that's going to write a reg write right to a register the advance is very simple one bit for register some people actually call these full empty bits is the register full or empty empty means somebody's going to write the value and it's not written yet the disadvantage is unfortunately this requires you to stall for all types of dependencies not only flow dependencies and you can think about why is that the case uh there are ways of fixing that but I'm not going to fix that this is just an idea i just want to give the idea of one way of doing it okay so if you do that what happens is in the instruction decode stage this load instruction okay let me go into the okay yeah okay okay something happened over here but this load instruction uh is assume that it's writing to register 10 it's going to set the it's it's going to reset the valid of register 10 when the subtract comes here if it's reading from register 10 it'll see that the valid is zero so it'll stop meaning it will not go uh there is a stall condition in other words it this register uh this this this subtract will stay here this other instruction that's being fetched over there will stay here load will move and there will be a bubble inserted here as we will see and then load will move again and once load actually finishes the right backstage it's resets the valid bit to one at that point the subtract can move okay so that's one way of doing it a more aggressive way which is a little bit more costly in hardware is combinational dependence check logic this is what's done in your book and in many books there's a special logic that checks if any instruction in later stages is supposed to write to any source register of the instruction that is being decoded this a mouthful but essentially it says whenever you're being decoded you have some source registers you check whether any of those source registers are going to be written to by any instruction that is in later stages in the pipeline you can think of basically you compare there should be some comparators for each source register you have a comparator to the destination register of every instruction in later pipeline stages right okay assuming those instructions are writing of course you need to check that also okay so if the answer of that combinational dependence check logic is yes for a given source register there's at least one instruction later pipeline stages is writing to that register or that's going to produce the value of that register you stall the instruction in the pipeline otherwise there's no need to stall because there's no flow dependence so the advantage of this is there's no need to stall on anti and output dependencies it's very clean you just check for flown dependencies the disadvantage is logic is more complex than a scoreboard and logic becomes actually more and more complex as your pipeline becomes deeper and deeper we have we have a very simple five-stage pipeline here if you make it 15 stages and you need to check all of those stages downstream and actually later we're going to see something called supercaler execution the idea is not have a single pipeline uh the idea is to fetch multiple instructions in a given cycle so replicate the pipeline so now you need to check even more okay but that's a later lecture so patience on my part okay so a pipeline operation basically what happens in this case uh so subtract for example subtract uh when it when it gets to this stage there is some combinational hardware uh dependence check logic uh uh that's and and we will actually briefly describe that logic later on that checks whether register 10 is being written to by any instruction over here over here over here so how do you how can you check that basically you get the destination register ids of instructions here here here and compare them to the source register 10 okay this an equality checker we built this in combinational logic right you have an equality checker you have one equality checker that compares this to the destination register that's coming from here so there needs to be some arc from here to here that gives the destination register ID here so three equality checkers not only that you should also say you should also check whether the instruction that is writing to that register is actually writing to that register meaning uh there's a right signal uh to the destination register you should also do the check for the other source so there needs to be three more equality checkers so there are a lot of equality checkers as you can see to check whether there is another instruction in the pipeline that's writing to the source that I'm reading at this point and if yes then you stole this because the value in the register file is not correct for you you just need to wait for another instruction to finish okay so that's detection you once you detect you can stop but what else can you do basically the observation is that dependence between two instructions detected before the communicated data value becomes available you can do this you can say okay this uh there's a load instruction ahead of me that's in currently calculating its address and I'm dependent on it because I know the destination register of both instructions but I don't know the data value yet because load instruction did not access memory yet it did not even generate address yet it's in the uh execute stage right so one option is to stall the instruction right away dependent instruction right away the other option is to stall the dependent instruction only when it's really necessary meaning you know whether the data value is available or not because the store uh for example the load instruction may actually have already accessed memory but not have written to the register file yet you can supply that data value early to the instruction that really needs it this is called data forwarding and bypassing and this going to complicate the pipeline even more but it's going to increase our performance also there could be other options as we will see uh later on too hopefully that makes sense we're going to cover data forwarding bypassing a lot today okay so basically the problem we're going to solve is a consumer or dependent instruction has to wait in the decode stage until the producer instruction writes its value in the register file that's the most conservative assumption we've had right uh so basically we want we don't want to do this our goal is to not stall the pipeline unnecessarily we want to keep the pipeline moving full and correct the observation is that the data value needed by the consumer instruction can be supplied directly from a later stage in the pipeline instead of only from the register file if it's available if and when it's available okay so a later instruction produces value provide a bus or wire that supplies that data value to the instruction that needs it and we know whether or not an instruction needs it because we built the data dependence checking logic earlier data dependence detection logic we know exactly what we're dependent on right so we can get that data value also when that data value becomes available does that make sense yes would it make sense to add a few registers to cache directly in the pipeline uh I mean that's you're you're trying to optimize it maybe it's possible yes depends on the pipeline design potentially you could actually have some internal registers to communicate these values yes potentially but let's not go there yet okay so the idea here is to add additional dependence check logic and data forwarding paths and buses or buses to supply the producers value to the consumer right after the value is available so we're not going to add a memory but we're going to really directly supply with wires if it's a 32-bit value we're going to supply with 32-bit 32 wires uh that value to the consumer to yes to the consumer right after the value is available and the benefit is that consumer now can move in the pipeline until the value can be supplied so there's less stalling you don't have to wait until the value is written into the register file okay so this is this is a problem in a five-stage pipeline this becomes a much bigger problem if your pipeline is 20 stages you don't want to wait in the decode stage when the value is actually available but it's not written yet to the register file because you write to the register file at the end of the pipeline right so all of these ideas become much more powerful as your pipeline becomes deeper and deeper and there's a reason why pipelines become deeper to increase clock frequency right okay so now we're going to go into a little bit more detail to implement uh these data dependence handling concepts so how do we implement stalling so this is our pipeline that we constructed actually today we actually spent a lot of time constructing it uh stall how do we stall essentially to stall uh we disable uh let's say we want to stall the fetch stage over here uh we disable latching the PC we disable writing to this uh writing into the PC uh we disable writing into this register and if you actually disable those the instruction over here stays right the fetch stage doesn't move because you don't change the PC and you don't change the instruction over here oh actually the instruction over here also doesn't move because we don't change we don't write over here so that way you stall these two stages you ensure that the stalled instruction the recall stage stays here uh then the question is of course what goes here right you detected that instruction is depend on some later instruction you stalled here because you don't change this register you stalled here because you don't change the PC so these two stages are stalling right now what goes into this register in the next cycle well a bubble insert an invalid instruction or no operation into the stage following the stalled one this is the one that's stalling because it's waiting for the data value to be available in the register file assume that we don't do forward for forwarding right now you need to insert a bubble so how do you insert a bubble well there are multiple ways of doing that but you can basically uh uh invalidate or zero out the entire register so that control signals are all zeros it's like a no operation basically okay so let's take a look at this uh data dependency example and then we're going to build more of this stalling logic so in this example one instruction writes to register a register S0 and next instructions read this register so all of them read S0 S0 S0 as you can see and this is the destination S0 in the first so uh this is a read after write dependence we're going to assume the trick that I mentioned earlier ad can write into S0 on the first half of cycle five so we have this optimized register file where you can write into the into a destination register in the first half and someone else can read in the second half okay so you can see that these are the dependencies and the instruction reads on cycle 3 obtaining the wrong value clearly we did something wrong we didn't stall the end or reads on cycle 4 again obtaining the wrong value only subtract will get the correct value so basically subsequent instructions read the correct value of S0 from the register file including the subtract so basically wrong results happen only if the pipeline handles data dependencies incorrect that's why I don't like calling this a hazard right you need to make sure this doesn't happen so okay this is this was one option right one option is to say okay I'm not going to add dependence checking logic i'm going to punt to the compiler and tell the compiler this this is how my pipeline looks like schedule instructions such that this doesn't happen right this meaning instructions getting correct incorrect results doesn't happen this is one way of doing it right the same set of instructions add over here writing its resultant destination register in the first half of the fifth clock cycle compiler was smart enough to insert two noops to make sure that the end instruction that's dependent actually gets the correct value basically you you needed to uh stall for two cycles you need to so one way one way is actually stalling and for two cycles in hardware we're going to see that the other way is actually compiler detecting this and saying okay I'm going to actually insert two bubbles after this at so essentially in insert enough independent instructions for the required result to be ready by the time it's needed by a dependent one this requires a smarter compiler that understands the data dependencies clearly but also the structure of the pipeline and how data can get communicated through the register file so ideally you don't insert noops of course right inserting a noop means you're adding an instruction that is useless right ideally what does the compiler do reorders instructions such that independent instructions come after each other and if you're able to do that you can actually improve performance right and there are techniques to do that we're going to talk about some of those when we cover very long instruction work word architectures a few lectures later uh but if the compiler cannot find any independent instructions to move or reorder then it inserts no ops in the end but this is a powerful technique as you can see and this is purely software the only thing that needs to happen is software meaning the compiler needs to know the structure of the pipeline which is a little bit more than what we have assumed so far uh from compilers but if you have a smart compiler it could know you need to communicate of course okay so let's talk about data forwarding so that was uh essentially uh detecting and eliminating the dependence so that you don't encounter it in hardware right okay so data forwarding is also called data bypassing and the idea again is to forward the result value to the dependent instruction as soon as the value is available and actually we've already seen the basic idea before not in the same context remember we talked about data flow execution in dataf flow the principle is that data values supplied to the dependent instruction as soon as it's available and instruction executes when all its data operands are available so essentially data forwarding brings a pipeline closer to data flow execution principles you really supply the data value when it's ready to a dependent instruction we're going to see more and this more and more we're going to actually bring the pipeline more and more into the data flow principles as we cover autoord execution also so it's very it's going to be very fascinating I think okay so data forwarding let's take a look at the locations in the data path so these are this was the same example that I showed you ad is writing to S0 register S0 we're assuming that right happens into the register file in the first half of the fifth clock cycle and is reading from that and needs to read from that here like and is really beginning its execution really here right so actually and uh operation needs that value at the beginning of clock cycle 4 now let's ask the question is the value S0 is the value uh that is named by S0 basically that is produced by this ad instruction available at the beginning of clock cycle 4 and the answer is yes yes okay yes and the answer is yes so let's take a look at what happened to ad got fetched decoded executed meaning it produced the value at the end of clock cycle 3 actually and it latched the value to the uh to this register to this pipeline register now if you detected and is depend on a zero you could take that value and put it as one of the inputs the right correct input of course of the ALU and discard what you fetched from the register file right so ant went through the pipeline while AD was executing it fetched from the register file S0 the wrong value the old value of S0 which is not updated yet but we also realize because we designed the pipeline that AD actually produces its value here and we can supply that directly to the input of the ALU at the beginning of the fourth clock cycle so if we can provide that value directly over here and add a multiplexer and that multiplexer chooses between the value coming from the value that we read from the register file or the value coming from this other register as we will see later on which is the pipeline register uh that's at the end of the ALU then we can choose the value that's coming from the pipeline register at the end of ALU to get what was produced by AD makes sense right so we add another path another mox another control yes usage yeah you can do it but that's a different concept right yeah this is just to supply the value that's produced uh to the consumer but yeah if if a value is not needed later on there could be ways of detecting it but that's a very different concept yes can we really connect we're going to see it soon yes so you're going to see soon yes but this is just conceptual conceptual view of the thing so essentially uh we can supply this from the latched output of ALU to uh sorry lashed output of the ALU to the input of the ALU so we're going to see this implementation but essentially yes you do connect you need that thing uh also let's take a look at another example over here which is this or this or needs that value when it's actually in the execute stage and we know that add already produced this value it's actually add is right now in the register file stage it has not written to the register file yet but it can supply the value directly from this uh pipeline register right back to the input of the ALU so that this or can actually execute with the correct value right so this is another forwarding path there's one forwarding path from the uh output register of the ALU stage to the input of the ALU another forwarding path from the uh output register of the uh memory state to the input of the ALU and then there's an internal forwarding path that we kind of assumed by a smart register file design whenever you write over here to the register file in the first half of the cycle uh another instruction that needs the value can get it in the second half of the cycle right but you could have added a max also multiplexers to accomplish that as well so these are the three forwarding paths in our pipeline simple pipeline now let's take a look at how we implement Essentially this is the implementation what I just said realized using wires and multiplexers so this is ALU remember that add it produces result it's rotated into this uh register which is the pipeline register at the end of ALU we take it we add a wire that goes into either of the inputs of the ALU because a dependent instruction may need it as the top input or the bottom input right and then we need some logic to select between them of course right so basically from the last output of the AL to the input of the AL that's this example meaning you're forwarding the value produced by the instruction that's here to the instruction that's just starting executing okay two instructions uh back to back and then there's another forwarding path from here to here also we're widening that Mox now multiplexer because the value can also come from an instruction that has moved over here and also has done the computation but has not written back to the register file yet but we want to capture that value with another path so we send that value widen this multiplexer change the control logic such that we actually decide which one which value to take and then we also get it from the register file which we assume there's some internal forwarding which uh makes it makes this picture easier let's say assumption of that internal forwarding makes us picture easier or a smarter register file half of the clock cycle you write half of the clock cycle you read okay so now you can see that this is not that bad but it it increased our uh complexity and you can also figure you can also specify the logic uh of which how to how to control this mox essentially if the instruction here is reading a register that's written by an instruction that's over here you should get the data value from here otherwise if it's reading a value that's written by an instruction that's produ that produces the value over here you should get the data value over here otherwise you should get it from the register file right why because you should get the latest definition of that value latest uh version of the register that's written this is the oldest instruction uh this is the youngest instruction that writes to that register that you may need so this is older younger youngest if you will right okay so this is essentially the design of the forwarding we we basically now forward to execute stage from either memory stage or right back stage and then I think I kind of answered this question earlier when should we forward from either memory or writeback stage if that stage will write to a destination register and the destination register matches the source register that we're sourcing from if both the memory and rightback stages contain matching destination registers which is possible it's valid it's allowed memory stage has priority to forward its data because it contains the more recently executed instruction basically it says that you give priority to this red arrow clearly because it is the younger instruction that's producing the value okay okay so uh if you want to specify this using some logic your book does it i will not go through it but essentially you can specify it you can write the invary log there's no magic after this as long as you can describe the logic you're doing in a logical way you can put it down as a invary log right but you can read this in your book it's basically in pseudo code you can write it very very easily okay but you need to do it for both source registers source register A source register B right there are two source registers potentially of for a given instruction okay so that said forwarding is not always possible depending on the structure of the pipeline and the latencies of instructions uh you may not for be able to forward data in the in our pipeline there's one case where we cannot forward data when you have a dependent instruction that follows a load instruction immediately forwarding is usually sufficient to resolve these dependences but there are cases when it's not possible due to pipeline design instruction latencies for example the load instructions producing S0 when does the value become available it becomes available after the load instruction actually accesses memory and gets the data value right when does end instruction actually get to the execute stage in the same cycle so basically at the end of cycle four s0 becomes available but and instruction moves that moves to the execute stage at the beginning of cycle 4 so it really needs the data value at the beginning of cycle 4 right essentially its result cannot be forwarded to the execute stage of the next instruction right well it can be but we break the critical path principle meaning what we can do is exactly what this trouble arc says we take the value from here forward it's here and now we have a very long critical path right as opposed to having one memory access latching the results we have one memory access get the data supply to the ALU do the AL operation and then latch the data right so you don't want to do this forwarding because it doubles your critical path probably something like that essentially it breaks our critical path design principle essentially so what happens is you need to stall the And in this case this is the only condition where you need a stall unless you have like a long like long latency operation which we're ignoring for now okay so essentially stalling is necessary for this memory to execute dependence so these instructions need to stall uh when that dependence is detected so let's take a look at that stalling and dependence hardware detection hardware this is again from your book you can see it more in your book there's some dependence detection logic uh so how do you actually stall uh that's what I would like to point out over here uh your book has the complete picture but essentially stall means uh the instruction the decode stage should not go to the execute stage why because this end instruction is in the decode stage over here it should stay in the decode stage as you can see over here right it stays in the decode stage for two cycles until the value of the load word is available and it can be forwarded here this is a nice forwarding path clean forwarding path you get it directly from the lashed output into the beginning uh or input of the ALU we already added that forwarding path this also needs to stall of course right uh okay so in order to stall you need to make sure later stages get inserted no ops so your book does it by clearing the pipeline register uh in the stage where the instruction is stalling this instruction is stalling over here it cannot move there should be something that moves into the pipeline reg it's a no-up essentially so you clear the pipeline register such that no harm is done it's a bubble and then you actually have these enable bits uh we have which I already discussed actually that disable latching for uh this register such that this instruction doesn't move and for this register which is the PC uh so that this instruction doesn't move okay so we've already seen this that put it in words stalls are supported by adding enable inputs to the fetch and decode pipeline registers and clear inputs to the execute pipeline register in the simple pipeline or you can have an invalid bit associated with each pipeline register indicating that contents are invalid or valid and taking action based on that validity that's there are multiple ways of implementing this that I will not go into detail over here but the concept is simple you need to insert a bubble basically if you're stalling you cannot move but something else needs to move to the next pipeline stage and that's a no-up okay this is very similar to a compiler insult inserting a bubble which is a no-op instruction but the hardware is inserting it right now okay so when a load word stall occurs you keep the values in the decode and fetch stage pipeline registers stall d and stole f are asserted and you clear the contents of the execute state register in in producing a bubble okay flush is also asserted but I'm not going to go into that so your book actually does a good job covering this if you are interested in learning more but I've covered essentially everything you need to know okay let's talk about control dependence right now because this is going to actually occupy a lot of our time uh for some lectures after out of order execution data dependencies we've handled hopefully right memory I think you know how to handle right now if memory takes multiple cycles you do the same thing stole everything that comes before in inject bubbles to everything that comes after right anytime the pipeline cannot move you know how to handle this right now there's a general technique what about control dependence there's an elephant in the room over here because control dependence is actually arguably bigger than everything that we have seen so what is the control dependence Control dependency is really some dependence on the program counter it's really data depends on the instruction pointer or the program counter meaning okay this is the question in control dependence what should I fetch in the next cycle i fetch this instruction at most stage what should I fetch in the next cycle right well you would say the address of the next instruction right and all instructions are depend on previous ones in a onement machine you have sequential execution you increment the program counter by four right in my so that's the reason why all instructions are controlled depend on previous ones so basically by fetching the next instruction PC plus4 we're implicitly making an assumption the next instruction is the next sequential instruction most of the cases this is true meaning if the fetched instruction is a non-control flow instruction next fetch PC is the address of the next sequential instruction this is to determine if we know the size of the fetch instruction we already do a PC equals PC plus4 and then we fetch the next instruction that way but what if the instruction that is fetched is a control flow instruction how do we determine the next fetch PC okay that's a good question first in fact how do we even know whether whether or not the fetch instruction is a control flow instruction we haven't decoded it yet so we have a pipeline we have the fetch stage we have the recode stage we fetch the control flow instruction we don't know if it's a control flow instruction mostly decode stage only at the end of the decode stage we know it's a control flow instruction but we're already fetching the next instruction right if this is a taken branch then we fetch the wrong instruction right so that's the problem that's why this is so fundamental because of the sequential control dependence that we have and control flow dependence we need to be very careful so essentially uh you can do something called branch prediction which we will see more this is a special case of data dependence dependent on the PC remember BEQ conditional branch is not resolved until the fourth stage of the pipeline when we actually execute in the ALU produce the result latch the result and then uh and then actually change the program counter from the last output of the ALU instructions after the branch are fetched before the branch is resolved so we kind of implicitly speculated essentially prov fetch instructions into the pipeline without knowing that we should really have fetched them right so this is a simple branch prediction example you always predict that the next sequential instruction is fetched it's also called always not taken prediction if you are wrong you flush the not taken path instructions if the branch is taken basically if the branch you figure out whether or not the branch is taken when the branch executes and you actually determine the condition if the branch is actually taken you flush all of the instructions flush means clearing all of those registers so we know how to actually clear registers now right okay so basically there's a misprediction penalty number of instructions flushed when branch is incorrectly predicted penalty can be reduced by resolving the branch earlier this also called early branch resolution now let's take a look at that a little bit this is our pipeline so far a branch instruction look over here it's actually conditional branch is resolved over here essentially this is it's really updating the PC over here so branches here assume that the branch is not taken we fetch the next instruction here next next instruction next instruction here everything is good branch is not taken so we f we did the right thing if the branch was taken when we execute the branch it's taken we fetch the next instruction over here PC+4 but we should have fetched from the target so this is wrong clearly this is wrong clearly this is also wrong because we fetch PC plus4 PC plus 8 PC + 12 whereas we should we should have fetched target target + 4 target + 8 assuming none of those are branches either right okay so basically that's the problem we uh this is branch instruction it resolves in cycle four uh and we have fetched 24 28 2C as you can see consecutive instructions and when the branch instruction resolves we need to change the program counter to 64 because you can calculate the target address 40 + 20 + 4 PC + 4 is 64 uh and that happens in cycle 5 when the branches resolve in cycle four that gets communicated to the PC in cycle five you fetch from the correct PC and in the meantime you you need to get rid of these instructions that you fetch into the pipeline those are wrong so you flush meaning you clear those registers right that's it you just need to clear those registers and you're done because those instructions did not do any harm so far they did not write to any register they did not write to memory because of the structure of our pipeline as a result we can do this that sounds beautiful right so now you've seen the example of a simple branch misprediction okay now let's try to make this a little bit better we want to resolve the branches a little bit earlier we can actually do that at the expense of some things so branches uh you actually know the register value so let's look at the branch equal right branch equal it tests two registers register one register two you compare two registers if they're equal you basically set the program counter to the target program count uh target PC that you calculate so basically you can have an equality checker over here that you add after these registers and then basically if it's a branch and if it if if the equality checker says equal then you basically select the program counter target address from the sign extended left shifted immediate added to PC+4 so you do the target address calculation here and you also do the branch condition calculation in the decode stage we move everything to earlier so that we don't flush more instructions than needed and then you of course uh if it's a branch you need to clear uh this one right that sounds good right you could do it clearly this will have some benefit if you do that you need to flush only one instruction right you don't need to fetch these instructions basically you fetch the next instruction and the branch resolves at the end of clock cycle 2 so at the beginning of clock cycle 3 you're at the target address so you fetch only one useless instruction you flesh only one of those the question is is a good idea yes i mean wouldn't we increase latency exactly yeah so advantage let's start with the advantage you reduce the branch misprediction penalty you reduce the cycles per instruction which is good potentially you increase the clock cycle time in fact probably you have increased the clock cycle time meaning you have a higher clock period and lower frequency you also have additional hardware cost in addition to increasing the clock cycle time you have a specialized hardware that's like that's likely not used by any other instruction who else is going to use these this hardware let's say no one other than the branch we do it to reduce the latency of the branches but we increase the clock cycle time for every other instruction is that a good idea well you need to evaluate essentially you need to evaluate based on this you need to figure out how much did I increase did I reduce the average CPI how much did I increase the clock cycle time if the result is net positive maybe this is a good idea assuming the hardware cost is not too much right but the based on what I know and based on this pipeline it's probably not a good idea in this sort of pipeline in some other pipeline it may be a different thing okay that's not the end of it we forgot the data forwarding meaning where does the how can we resolve this branch this works what I just showed you works actually only if the data values of registers are available for the branch to correctly calculate what if they're not available what if there are instructions over here that are writing those and they're very likely instructions there because branch is actually dependent on some instructions uh that produce those uh registers so you need to forward those registers so whenever you add whenever some computation earlier you need to actually forward the data also so there needs to be a forwarding path from here to here as well okay so it adds even more complexity it increases your critical path even more potentially this is from your book forwarding logic essentially this is a full store and forwarding logic you you can you stall because of a an instruction is dependent on a load that has not produced its value yet or you can stall because there is a branch right and that uh branch uh you you mispredicted the branch and this is your final pipeline misprocessor from uh uh your book it includes always not taken branch prediction early branch resolution forwarding and stall logic it's complete it's nice what it doesn't include is multicycle memory or multicycle operation that requires a little bit more work so it's still not a realistic full pipeline it's assuming magic memory with one cycle access so okay doing better smarter branch prediction uh this was not so smart right we basically said we're going to implicitly assume that PC will be incremented by four right this is called always not taken prediction but you can be a bit smarter you can guess whether or not the branch will be taken right and existing processors actually are quite smart for example backward branches are usually taken because loops iterate many times if you see a backward branch you basically predicted taken right you can take keep a history somewhere in the process as we will see in later lectures of whether branch was previously taken and you can improve the guess based on that history if you've seen branch was taken a million times you say okay probably it's going to take be taken again you can be even smarter as we will see so accurate branch prediction requires the fraction of branches or reduces the fraction of branches that that causes a flush there are many sophisticated techniques that are employed in modern processors to do accurate branch prediction and the reason hopefully is clear because if you are incorrect in your prediction you fill the pipeline with junk junk meaning useless instructions you're going to flush them so you'd better fill the pipeline with useful instructions and the only way to do that is to predict the correct instructions to fetch from so in fact existing uh machines use simple machine learning methods to learn as well simple because hardware implementation of complex machine learning methods is too slow for high frequency operation and we will see a lot of these in branch prediction lectures so it'll be a fascinating overview of what people have done in 60 years of branch prediction that's it if you're impatient you can watch the earlier versions but hopefully not we'll get to it okay i think I cannot uh I do not have time to cover pipeline performance example i will leave these slides for you to study it's a very simple example it's in your book also uh and next week we're going to start with some other issues in pipelining and then we're going to build up to autom execution so have a good weekend i'll see you next week

Transcript for:[Lecture 12] Understanding Pipeline Processor Design

Transcript for:
[Lecture 12] Understanding Pipeline Processor Design