Transcript for:
[ Lecture 11] Multicycle Processor Design and Pipelining

okay good okay [Music] [Music] it was yesterday [Music] [Music] okay I think it's time it's good to see a lot of people fridays are a bit lonely here i hope to see more of you during Fridays okay today we're going to continue uh what we started in the last lecture which was multicycle processor design remember we're still covering micro architecture uh we have been designing the micro architecture of a very simple MIP processor for the last couple of lectures and we decided single cycle processor design is not a good idea it violates three major fundamental principles which we will very quickly get to today just to jog your memory today we're going to design the full multicycle processor well full full for a subset of the Mips ISA uh uh following your book there's more in your book that talks about it and then we're going to jump to pipeline processors which are actually uh the type of processors that we have uh today but will actually build on pipelining also okay you have an assignment extra credit assignment due if you want to get extra credit April 1st uh and this is where we are i kind of covered all of this my goal uh by the end of this week is to finish pipelining we're going to cover a bunch of issues in pipelining and then next week we're going to talk about state maintenance and recovery precise exceptions and out of order execution does that sound good okay yeah by the end of next week you will know how a modern microp processor actually processes instructions in an out of order manner we're just building up to it right now but we need to build up to it because you need to see what is data and control dependence handling what's pipelining etc and these are some of the readings for multicycle micro architecture i'm going to follow H&H uh chapter 7.4 which actually does a good job i think Patton Patel appendices especially appendix C builds a full LC3B micro architecture it's a very good one i'm going to briefly talk about it but uh if you're really interested in looking at it I would suggest looking at it because that gives the concepts very nicely and then we're going to talk about pipelining okay multicycle just to recall uh basically our goal with multicycle micro architectures is to let each instruction take only as much time as it really needs as opposed to the worst case instruction determining the clock cycle time as we as happened with single cycle architectures right and this is kind of the pictorial view uh you have an architectural state at the beginning and uh every clock cycle you do something and update some microarchitectural state to process an instruction until at the end of the instruction you finish the processing and you update the mic update the architectural state so we can process we can use multiple clock cycles to process every instruction and you can use different number of clock cycles to process different instructions and you can determine the clock cycle time independently of how long it takes to process any instruction and we said that you could do this by building a state machine and we're going to build that state machine today a small version of it and this was the full state machine for LC3B that we partially covered but we we're going to build a new one uh not for LC3 for MIPS And this is the LC3 multicycle implementation if you're if you read appendix C you will see a lot of this okay so basically these are what we're what we aim to uh accomplish with multicycle design we want to have better critical paths reduce the clock cycle time we want to optimize the finite state machine uh for uh important things that we want to execute for example the most common workload most common instructions and we want to have a balanced design and we don't want to provide capability no more than that's really needed okay you have already seen these but this will come at an expense as every idea there's an upside and a downside if if you find some uh u if you if if you're given an idea uh and you you want to analyze it always look for the upsides and the downsides there's nothing without an upside there's nothing without a downside upsides or downsides may be different in terms of weight but uh they exist okay so in this case we're going to need to store the intermediate results at the end of each clock cycle and we're going to pay both the hardware overhead for that as well as the sequencing overhead meaning we're going to waste part of every clock cycle as we've seen in timing and verification and we're also going to see that we're going to have limited concurrency and we're going to solve that problem partially with pipelining does that sound good okay so you've seen all of this and this is a slide I stopped at uh last time basically we wanted to have a closer look at a multicycle micro architecture this is actually an idea that was developed by Maurice Wilks who is a pioneer in the field of computer architecture and computer micro architecture and he developed the concept of microcoded microprogram machines that we're going to see soon but a lot of this these concepts are based on uh his uh descriptions okay so basically the key idea of realization for realization uh for realizing a multicycle micro architecture is to see the process instruction set step as a finite state machine that sequences between states and eventually returns back to the fetch instruction state so you keep doing this so you have a finite state machine that implements your entire ISA a state is really defined by the control signals assigned asserted in that uh uh state basically and control signals for the next state are determined in the current state so that we don't uh we can parallelize processing we don't uh need to waste part of the clock cycle for determining control signals make sense we've discussed a lot of these actually so now we're going to uh implement the basic one so we're going to uh you've seen actually a lot of these concepts so hopefully a lot of these are familiar but the implementation will be similar to what we have done with single cycle we're going to start from a basic instructions and we're going to we're going to start from the data path and we're going to start adding things to the data path but essentially we're going to see instruction processing cycle uh to be divided into states and each stage in the instruction processing cycle can take multiple states and we're going to sequence from state to state to process an instruction that's what a finite state machine is right and you've seen this a long time ago in these lectures and the behavior of the complete machine in a state is uh the behavior of the entire machine in a state is complete determined by the control signals asserted in that state similarly to the control signals we had in a single cycle that essentially determine what the data path elements are going to do to the data uh and the instruction etc we're going to uh we're going to do the same thing control signals are going to determine everything in that particular state and the behavior of the entire processor is specified fully by a finite state machine so in a state or in a clock cycle uh control signals control two things how the data path should process the data as we have seen in the past and how to generate the control signals for the next clock cycle and this is what we have seen so let's build uh another example multicycle micro architecture I'm going to follow your book so if you're interested uh you can look at your book I don't remember if it's 7.3 or 7.4 4 maybe someone can verify this maybe the figure is in both places i felt it was I thought it was 7.4 but this says 7.3 so there might be a bug over here anyway but this is a multicycle machine that is built by your book that supports a subset of the MYIPS instruction set architecture oh actually this is the single cycle micro architecture sorry this is correct this is a single cycle that we developed in your readings so what is the issue with the single cycle right so we're going to build uh chapter 7.4 from here so in the single cycle we already discussed this a lot so uh single cycle times limited by the longest instruction you have low clock frequency but we also said you have a lot of adders and alus and uh memories this gives you high hardware cost uh because you have to provide the all of the combinational resources to process a a single instruction a given instruction in a given clock cycle if a given instruction needs to access memory twice you need to provide the resources for that if a given instruction needs to do multiple additions which many instructions need to do they need to increment the PC for example by four they need to do an addition they need to do something else you need to add adders all over the place right you cannot reuse the same adder in a multicycle we fix some these problems you get a higher clock frequency you get simpler instructions that taking only a few clock cycles and you can reuse expensive hardware across multiple cycles so I'm going to show this when we build the data path but of course we know the downsides also i I'm not going to repeat that so when you want to design a multicycle micro architecture you go through the same steps as single cycle what did we do in the single cycle last week we basically designed a data path right how did you design the data path we took in every instruction we said okay let's start with RT type instructions let's build a data path for that and then we added I type instructions and then we added load instruction and then we added store instruction and then we added branch instruction and then we add a jump and then we stop there because you could keep going forever doing that you you do the same thing basically you design the data path you add the control signals and then you design the control logic so there's no magic you do you follow the same steps to construct a multicycle processor except it's a multicycle processor it's not a single cycle cross okay so what do we optimize of course when you construct a data path you need to think meaning what am I going to do in my data path right now I can optimize things as opposed to using two memories uh we can only use uh a single memory and we're going to see that uh it's going to be uh much much lower cost uh as opposed to using three address uh one for ALU which you need one for PC incrementing and the other for branch address calculation uh we're going to use only one ALU for all operations and this is going to lower the cost again as opposed to having each instruction take one cycle we want to determine the clock cycle independently of instruction processing time so when we actually design the data path plus the finite state machine we're going to divide each instruction into multiple clock cycles okay make sense okay so let's construct the multicycle data path in this uh case we'll start with a load word instruction because why not last time we started with an ad instruction why did we start with an ad instruction why not basically there's no real reason to pick one instruction over the other uh because in the end you're designing a data path for all instructions and if you're going to execute all instructions you might as well start with some random instruction also right and it's also good to think about okay maybe you start with common case instructions right maybe why not is not the best answer a qualified answer is which instruction is going to be the most important one for me loads ads those are pretty common instructions so you can start with those without any let's say downsides so remember the load instruction Essentially what this load instruction does it has a destination register it has a source register essentially it computes the address by taking the source register uh adding to it an immediate sign extend immediate and uh reads the content of the location in that memory address and then gets the data and puts the value into the destination register i'm not going through the encoding etc because we've seen this but the encoding is kind of here it's an itype instructions in MYIPS uh essentially uh fetch is really the same for all instructions for load uh step one is really fetching the instruction but to construct the data path you need a program counter you supply the program counter to the memory that we have seen before if you remember from le earlier lectures you input the program counter to the address port of the memory and then uh you get the uh data uh from the output port of the memory and this data is really your instruction and you we're going to latch it in an instruction register we're going to call the instruction register write signal to enable or write enable this register so we added one control signal here there's another control signal here which is the right enable of the uh memory over here and we have one memory to begin with does this make sense so this is our first cycle essentially uh and first cycle is defined by the control signals associated with it ir right should be one all other control signals over here should be do no harm we have not constructed that yet so we're going to build the control logic later okay so that was the fetch that was easy clearly clearly this is common in all instructions all instructions do this okay so the second step is to read the register uh and the way we do it is essentially uh you take some bits from the instruction over here which is the which is one of the source registers we're going to calculate the address using that this is the base register if you think about it uh that's 26 through 21 supply to the register file and get the read uh get the data out of the register file from the first port because we supply to the first port and latch it into a a pipeline uh register over here so we're essentially adding registers as you can see so that at the end of a clock cycle we can store the output of what happened in that clock cycle so that's the difference from the single cycle we have not constructed the entire thing we are actually doing it step by step of course you need to also determine uh uh what to do in a cycle but I'm assuming these are the things that we determined to do in a cycle for now okay so you need to assert the control signals to enable this which are not that many actually this is register files automatically enabled there's no write enabled you need to do no harm to the register file at this point when you're actually reading a register for load etc and of course I'm not connecting other things that I don't need for now but we're going to use them uh later on okay so load word actually uh you you you what what load word does is it takes a register which which we actually latched over here in this uh it takes a general purpose register base register which we latched over here concurrently you take the bottom 16 bits of the instruction sign extend it and that's going to be our sign extended immediate value that we're going to add to the base register but we're not ready for that yet for that we need another cycle the next cycle we basically take whatever was uh stored in this uh uh let's say let's call it the A register register file uh a port register uh send it to the ALU through a wire and send the sign extend immediate to the ALU and set the ALU control bit to do addition because we do base plus offset if you remember and the ALU result will come out and you store it into the into this register now this register is going to be the ALU output register okay so that's our third cycle so one cycle fetch the second cycle register read third cycle generate address we're not done yet we generated the address what do we do now we're going to do something that we cannot do in a single cycle architecture we're going to basically take that address loop it back into the address port of the instruction/data memory and we're going to select that one in this cycle so that we actually access the memory with the address that is generated by the load instruction in this particular state okay so earlier in the fetch state we used the program counter to access the memory here we want to use uh this address that we calculated to access memory so we need to add a multiplexer for that and we need to add a control signal to control that multiplexer basically are we doing an instruction access or a data access makes sense right okay now you can get away with using a single memory as you can see that's the beauty of multicycle architectures one of the beauties okay yeah I think I've said this but of course you cannot do uh you cannot use this memory for both instruction fetch and also data fetch in the same cycle you can only do instruction fetch in one cycle and data fetch in some other cycle right and you control how do you know what you're doing control signals decide that what are the control signals determined by they're determined by the control logic which are based on which state you're in okay okay so now we read the memory oh okay this is the address that we in input into memory we get the data after some point we latch it in a different register and that data is going to go to the register file in the final state of execution of the load you basically take this data put it into the data input port of the register file and you take the this register identifier which is a destination register identifier 20 through6 RT uh and essentially you assert the reg write signal because at this state we're going to write this data that we latched in the previous cycle to this register ID uh because uh we are asserting the register write signal and because that's what we're supposed to do at the end of the execution of the load right and that finishes the execution of the load that's the entire data path you need uh to execute the load not really we forgot to increment the PC right Okay so we're almost done basically with the load this is what happens uh with uh memory access writing the data register address calculation etc it took a number of cycles as you can see now let's also increment the address now your book does a fancy uh thing that's why I follow your book it basically shows that you can use the same adder alu uh that you used to calculate the address in a previous cycle to also increment the PC so how do you do that basically you take uh this program counter uh go through here add another multiplexer over here because we going to input the program counter into the ALU earlier we input this register into the ALU so you select between the program counter or the register so there's an ALU source A multiplexer there's also an ALU source B multiplexer that got expanded a little bit more than load needs over here which is fine because it's going to be needed by by other stuff but basically source B can come from this register which we're not used yet uh but now we need four uh well we use actually sign extend immediate so we can select between sign extend immediate or four oops okay increment PC uh basically this ALU source selects PC which is zero this AL source B selects four because we are going to do PC plus 4 and ALU control should be set to addition over here and we take the ALU result and put it into program counter so we need a PC write signal to write enable the program counter register so to be able to increment the PC we jump through a lot of hoops because we didn't want to add a add another another adder right this is a design choice that the designer of this processor made i wouldn't make this choice and LC3B designers didn't make this choice in appendix B if and appendix C if you look at it they basically have an incrementer for the PC that's basically that can be done concurrently with everything else because of this design choice essentially PC can uh uh uh there's more complexity over here but this a design choice it's a perfectly valid design choice in the end right they avoided adding an an adder make sense okay so it's beautiful right as you can see you can get away with a single adder uh okay as long as you control it so that uh you do the right things at the right time right and don't forget to increment the PC okay that was load now we're done with the load but load actually enabled us to fetch an instruction decode an instruction and also increment the PC so we don't need to add any of those four other instructions now let's take a look at store for store uh we know how to access memory right uh okay um basically uh what we need to do is uh we do everything very similar to the load generate the address but at the end uh uh we need to actually access another register which is the store data register using bits 20 through 16 and also store it somewhere when we actually access the base register we also access this register and then get the data value and write it back into memory at the specified address when the address calculation happens at the at the same at that cycle and we need to assert the memorite signal so this is very similar to what we have done uh with uh single cycle we moved from store load to store if you remember over there similarly we need to we have the the entire data path almost we just need to read another register which is going to be the data that's to be written by uh the store to the memory and we need to provide the data path element and storage for that this register and then we need to take it and write it to uh the destination uh to to the memory at the address that is computed the same way the address of the load is computed i'm not going to compute the same address but you already have the data path elements and the and essentially essentially this address right uh yeah the ALU result that address over here you just need to assert the me right signal and do the iOd signal accordingly okay so store is very similar to load except you all you write to memory sounds good okay now let's look at R type instructions so this was store word uh R type changes the uh pipeline a little bit again I will not go through the entire thing because otherwise it will take forever uh but basically uh you need to read from both uh registers RS and RT uh and also you need to write uh to RD destination register so to enable this uh so what does our type instruction do rype instruction actually uh does two uh reads two registers these two registers in this particular case so we already have the data path elements because of store actually and then we send the second register to the ALU also store didn't need that in store the second register goes to memory here we're doing basically add R1 R2 store the result in R3 for example so we get one register over here into the ALU the second register over here into the ALU and at that state you do the calculation store the result in this register now uh at the rightback stage uh of this uh of this part of an R type instruction uh we take the result in the ALU and we need to put it into the register file so before we were actually doing something else into the register file right uh this one was coming load was loading this data into the register file so you need to add this multiplexer to decide and this multiplexer is very similar to the metog register that we have in a single cycle design if you remember that mem toreg register you either take the alu output and put it into the register file or you take the memory output it's essentially the same register that you have well same multiplexer that you have except it multiplexes between this register and this register okay one one register stored the data that we read from memory in the load instruction and the other register stored the data that we uh calculated as a result of the ALU operation okay and then we already had this max in the single cycle architecture also because of the spec specific uh encoding in the ISA in our type instructions remember the destination register is really uh I think 15 through 11 as opposed to 20 through 16 so you need to select 15 through 11 as your destination register in IT type instructions it's 20 through 16 this is based on the encoding so it's very similar to what we had we just need to build the data path again sounds good okay so let's do branch branch is going to be also interesting over here so you need to provide support uh for uh two things over here one is target address calculation and the other is branch condition calculation right and we already have most of this uh now the interesting thing is your book uses the the same ALU uh to do both the branch condition calculation as well as the target address calculation this is also another design choice that uh Pat and Patel book makes differently they have a special adder for branch target address calculation here everything goes through the ALU so let's let's first uh let's first figure out how to calculate uh the branch condition remember the BEQ instruction in MIP's ISA it checks whether two registers are equal r1 and R2 or R source one and R source 2 whether they're equal so you read these two registers they're stored in A and B we already have the data path elements for that you check whether they're equal so AL control is checked set to subtraction for example and then there's a signal coming from ALU now saying whether it's zero that means they're equal this is asserted if they're equal branch is asserted when you're executing this instruction so if they're equal and branch is asserted then this an or PC is enabled and then you can write to PC whatever you compute uh in a cycle so this is assuming that alou can do a lot of operations as you will see yeah uh so basically uh the calculation of the branch is done this way okay yeah calculation of the branch is also done through the ALU uh in your book uh basically the the way you calculate the target addresses you take PC +4 which was stored over here at some point so based on the design of the finite state machine you already stored PC plus4 uh in your PC you take that and you input that into the ALU in a different cycle you need to do that in a different cycle because otherwise AL is doing this computation also that computation or you actually have two ALUS or you replicate the capabilities of the ALUS but you basically take PC +4 add to it the sign extended immediate left shifted by two you select the right things of course uh with these two multiplexers and then take the ALU result and that gets lashed over here and that ALU result uh gets input into the program counter by selecting the right PC source okay make sense so you can use this data path but you need to make sure these actions are done in the appropriate cycle so there needs to be a finite state machine that says I'm I'm calculating the branch condition this cycle and storing it here and then uh I'm uh writing to I'm calculating the target address in the ALU in this cycle and storing it here and then you actually do the update of the PC after all of this is done so there needs to be an orchestrator which is really the control unit that goes through the states every state state by state sets the appropriate control signals to make sure that this data path does what we want to do but with this data path you can actually do it okay and this is the complete multicycle processor in your book again your book goes through this similarly to what I did uh if you're interested you can study it uh now what we have done is we've added a bunch of extra registers so these registers did not exist in a single cycle design now the upside is we have only one memory and only one ALU and adder so you kind of stretch the limits and with only one ALU it did things but of course if you have only one ALU maybe you increase the number of states that you have uh in the finite state machine for this in for different instructions right because when you're doing PC plus uh PC plus4 you cannot do something else when you're doing branch condition calculation you cannot do a target address calculation right okay hopefully that makes sense okay now let's construct the multicycle control logic we have constructed the data path we have added the control signal signals these blue things are the control signals in this case now we can construct the control logic i'll go through the control logic relatively quickly but essentially we're going to construct the finite state machine this is going to be our control controller uh we're going to supply the up code to it and there will be a finite state machine uh that determines all of these control signals uh including the ALU operation there's also a fun if you remember in MEIPS that determines what the instruction should do i'm kind of ignoring it because we kind of covered it earlier but these are multiplexor select signals and also register write signals register write enable signals okay so let's start with fetch we've already seen this actually but you need to do this after you design the complete data path of course because you need to make sure you set every control signal to an appropriate value so in this case uh what uh so fetch looks like this basically uh iOd needs to be zero so that we select program counter and input it into the uh instruction register address port in memory address port we don't write to memory so uh memorite should be zero uh clearly IR right should be one meaning we're going to write to the instruction uh register and we're going to update the program counter so uh that should be enabled uh meaning PC See right signal over here should be one this is not a branch when you are fetching there's clearly not a branch so that should be set to zero uh and PC source over here should be set to zero so that uh we get uh yeah uh we essentially uh let me see yeah we essentially set the LU result over here right uh okay so basically uh this is the fetch state over here you're fetching the instruction and you're also incrementing the PC right uh because incrementing of the PC happens this way as you can see right you basically take the PC which was here and you add to it four and you get that into the PC so the control signals are needed to access memory as well as increment the PC so what are the increment the PC control signals one is ALU source A this multiplexer source should be PC as we have discussed earlier al source AL source B should be four as we discussed earlier alu operation should be add 010 that's the code of AD and we should take the AL result over here and send it to the PC and PC write is already enabled does that make sense so that's the fetch state you need to do this for every single state which clearly we're not going to do because we don't have time but there's no magic over here as you can see and your book is not complete uh because it doesn't list all of the control signals over here but you can actually list you can actually figure out the control signals uh on your own okay so that's fetch we fetch the instruction it's in the instruction register right now now we go to the decode state uh decode state actually uh is is relatively simple you basically read the registers because it doesn't do any harm assuming it's needed and you basically don't don't do anything in the decode state in your in your book decode state is kind of empty except do no harm of course meaning we don't write to any register except uh for this one but this doesn't have a right enable signal anyway okay basically we always read from the register file according to this design okay so the next state uh let's assume that the op up code is load word or store word we jump to this state uh both of them go through a memory address calculation uh stage clearly so we kind of consolidate the finite state machine for both of them the in the same state over here so this is the address calculation stage we set ALU source A to essentially come from the register file base register ALU source B to 1 Z which is sign extend immediate and we set the LU operation to add and essentially the results get stored over here there's no right enable signal needed for this so you can see now we're actually using the data path by setting the control signals accordingly and all of the other control signals are set so that we do no harm for example we don't update the program counter we don't update the instruction register that would be terrible because we're processing this instruction right uh we should not update instruction register until we're done with this instruction uh some of these are X's as you can see i don't care because you're not updating the memory for example i don't care what I do with it i'm not updating the program C i don't care what this multiplexer selects etc so it's very similar to what we've done in the single cycle except we do it now state by state by state okay okay so this is the complete uh load word uh state machine again I will not go through this but you all after this if the operation is load word you go to a memory read state and memory read state basically has some control signals that you can also select and I'd recommend you doing it on your own if you're interested uh IOD needs to be selected to be one so that you actually get the address from the ALU output just like we have built the data path and then you go to a memory writeback state meaning you write back the data value that you get uh from uh the memory into the register just like we built the data path and you set the control signals accordingly this is the complete load word execution 1 2 3 four five cycles as you can see five states in the finite state machine but we also played some tricks to merge states as you can see right we decoded and then we said if the operation is load word or load if the up code is load word or store word we jump to the state and then if we're in this state and the up code is load word we jump to the state so there needs to be some sequencing logic that basically determines which state you go to from this current state and we've already seen that in the finite state design right finite state machine design okay store essentially after calculating the uh memory address if the up code is storeward you go to a memorite state if you will and here ID is set to one again because the uh the address is coming from essentially uh sorry addresses is coming from here not from here not from the PC it's coming from the ALU result and we set memorite to be one so that we can write to memory right here memorite is not set to one uh when we reading from memory and then uh store word doesn't have an extra state because that's store is done after you write to memory there's nothing you do with the registers after that point right so store word takes 1 2 3 four cycles okay now we could keep doing this for all instructions so if the op is our type After decoding you go to an execute state you set the appropriate signals in the ALU so that ALU gets the right inputs uh and you set the ALU off etc and then after the ALU generates the result you write it back to the destination register by going into another state right uh so you set the control signals accordingly just like we did so I'm not going to do that again uh but you can convince yourselves that this works okay that was our type and then there's uh BEQ over here uh BEQ is a little bit more complicated but you can also do that but I'll leave that to you so there's only one state over here that enables BEQ and uh in BEQ actually there's one trick that your book plays uh it's it's it determines the branch condition over here uh which is interesting over here it determines the branch condition over here and then it basically uh calculates the target address over here and updates the PC using that so this is all done to determine the branch condition so that you can actually essentially if you can look at it you it sets ALU source A to zero which is PC which is the updated PC already because we updated the PC in the fetch state if you remember ALU source B to 11 one which is sign extended immediate left shifted dot dot dot uh uh and uh basically does an addition in the ALU or something like that in your book and then in the next cycle it determines that uh it's uh actually is that true yeah I think it it does it does it one way or the other probably it actually first determines the it first it calculates the target address that's what it does first and then it calculates the branch condition okay that this makes sense because if you look at these control signals this is really calculating the target address and these control signals are actually uh checking whether the branch condition holds if two registers are equal and if that's the case then uh you take the target address that you've calculated and place it into uh the uh program counter okay the question okay no yes question i cannot hear you very well can you speak up state for example fetch instruction one state and the next state use that instruction this this fetch instruction has somewhere we do not store it but it has to so we we store it right we store it in the instruction register so it's here oh yes there for example Slide 40 slide 40 okay let's go back so we store it in the instruction register right this is the instruction register oh and you never update it yeah that's why yeah you cannot do this in a single cycle machine in a single cycle machine there is no instruction register right but yes the instruction bits need to remain everything that you need to use later needs to remain that's why you need to store them in registers yes okay okay great okay so that was BEQ now we didn't uh uh do I type instructions but I type instructions like add I you can also add it over here they require control signals again you can go through it and figure it out basically this is the construction process to do jump you need to extend the data path a little bit your book has this nice uh description that okay if you want to do jump you extend the data path slightly uh so how does jump calculate address if you remember uh we take uh uh the bottom I guess 26 bits from the instruction register and uh left shifted by two and then extend it and then add it uh uh yeah I think we'll need to yeah basically concatenate it with a PC uh all of all of these are done in this logic and then you basically uh update the program count so there's another way of updating the program counter which needs to extend the multiplexer and also needs an extended control signal for the PC source we did this actually in the uh single cycle processor as well but if you want to add instructions you essentially need to add data path elements and also new control signals potentially okay and then you actually need to change the control signals that you've done over here so if you forgot to forgot an instruction for example no worries you just need to add the data path elements add the control signals and make sure all of those control signals are updated in all of your states accordingly so that you do the right thing so that's the that's one of the other beauties of this machine in the sense that you can easily modify things this this sort of structure finite state machine enables you to modify things easily okay okay so basically this is the construction maybe I took a bit more time than really I wanted to on this topic but we construct a multicycle MYIPS architecture right now uh which is a similar process to constructing a single cycle architecture but it's a little bit different because now we break things into multiple cycles so this was our single cycle MEIPS architecture micro architecture and this was the thinking over there and this is our multicycle ar micro architecture they look very similar but now we have a lot of registers to store things like the instruction register pointed out uh by your fellow student and if you want to learn more chapter 7.4 four does a nice job over here so I'll ask you one question one of the issues over here is uh it assumes something about memory and this is what we have assumed in single cycle also we assume memory takes a single cycle right actually memory doesn't say single cycle memory takes multiple cycles uh and so there are some memory read states for example memory read over here memory right even here fetch we assume memory takes one cycle right so how do you actually accommodate multicycle memory basically it's very easy in this design actually if your memory has a signal saying that I'm not ready I didn't supply you the data yet then stay in the same state right you need to be a little bit more careful you don't want to update the program counter multiple times there's a danger over here but you need to stay in the same state if the memory is not ready over here stay in the same state until the memory provides a value right so this is a conditional change of the state again uh yeah I already said this and memory needs to provide some uh bits or input to the control logic that determines the next state and actually you've seen this kind of before when we looked at LC3 there's a memory access state and if you look at this memory access state you wait for memory until the memory is ready if the memory is not ready this is not R or R our complement you stay in the same state if if it's R then you go to the next state another memory access another memory access another memory access so there are a lot of memory accesses that you do in a real instruction set architecture and you do the same thing in all of those so that's a more realistic way of doing it your book kind of assumes that okay it's one cycle okay this is something I will not go through uh but I will recommend people to look at uh the book uh uh appendix C for Pat and Patel uh basically very quickly what they uh introduce in that design is slightly different the principles are the same it's multicycle etc uh but it's also a bit more structured now let me quickly describe uh what it is essentially uh you have control signals it's a microprogram design again I will not build this i will not go through this in detail but I'll give you the key idea you have control signals associated with the current state they're called a micro instruction and the act of transitioning from one state to another is that requires determining the next state and the micro instruction for the next state that's called micro sequencing and there's a control store which stores control signals for every possible state it's like a small memory it stores the micro instructions for the entire finite state machine you can think of this like a program and microsequencer determines which control signals will be used in the next clock cycle in a sense what we're doing as hardware designers after we design the data path and the control signals if you think about it this way you're microprogramming a machine you're not programming it in the ISA level you're programming it at the control signal level and the program is really stored in that finite state machine your finite state machine describes your program and your control store stores stores your program and you sequence from micro instruction to micro instruction and each micro instruction determines the control signals in each state now if you think about it this way it becomes very interesting right we're programming a machine at the very very lowest levels it's called microprogramming and this actually enables a lot of interesting things so this is what uh patel book shows you have a micro sequencer that determines the next state address of the next state and uh essentially address of the next instruction you can also think of it that way address of the next micro instruction you access using that address the control store which gives you the micro instruction which are control signals that you will use in the next state and that micro instruction is supplied to the data path so that you do whatever you need to do in this state and while the data path is happening you also micro sequence to the next instruction right and this is a full uh system over here uh that they design including memory and IO which is pretty powerful okay I think essentially uh the control signals for the current state control processing in the data path as well as generation of what's the next micro instruction to process and data path and micro sequence actually operate concurrently i think I've already said this harp this on multiple times so I'm not going to ask that question but this is for your own study it's beautiful uh I will not cover this particular slide you can find it in appendix C uh but you will see that there is a control store this is actually the program this is how you program the control signals of a machine uh each entry in the control store is a micro instruction corresponding to the finite state machine state and you actually use the fi address of the finite state address of the state or address of the micro instruction to get the relevant micro instruction u before I conclude let me say a couple of things so this makes the data path very powerful why am I telling you this so you can actually this is microprogramming you can actually go uh so how do you design a machine you design the data path you design the control signals and then you go fill out this table for example this is the uh finite state machine part for load instruction that someone designed it's very similar to what we designed earlier give or take a couple of things this is the micro sequencer someone designed and now you go and fill out this table this is your micro program these are your micro instructions so this enables an abstraction right so earlier we had assembly programming now we actually control signals but uh going directly from assembly uh to control signals now we introduce some other abstraction which is micro instructions right uh the hardware designer has this microprogramming abstraction now this way the designer can translate any desired operation to a sequence of micro instructions all the designer needs to provide is three things the sequence of micro instructions needed to implement the desired operation the ability for the control logic to correctly sequence between micro instructions that's micro sequencing and also any additional data path elements and control signals needed if you already have the data path and control signals you don't need anything so if your data path and control signals are powerful enough you can implement new instructions new micro instructions internally you don't need to tell anyone and that's what brings us to this thing that we have looked at earlier right remember we had the semantic gap picture you have high level language ISA with complex stuff complex complex ISA complex instructions complex data types complex addressing modes these get translated into micro instructions in existing processors so existing processors actually operate this way a complicated ISA like X86 gets translated using this hardware translator micro sequencer that I kind of uh went over quickly to actually go through translate every instruction to a sequence of micro instructions and there any instruction can be translated to many micro instructions let me cover this slide and then we're going to take a break basically there are other advantages to this you basically uh can take an ISA and translate to microode and implement things in different ways uh you can have a minimal data path to emulate the ISA and you can think of this as a user invisible ISA like we've been kind of thinking of it right this also enables easy extensibility of the ISA you can support a new instruction by changing the microode someone creates a new instruction you basically try to figure out how to include it uh by changing the microode itself you can support more complex instructions as a sequence of simple micro instructions for example string copy can be supported by as a sequence of smaller instructions even though you don't have a native string copy operation you can change the microcontroller to actually do this multi-dimensional array updates can be supported it also enables update of machine behavior for example if you have a buggy implementation you have a bug in your processor you it gets discovered in the field some smart security person or smart some smart person discovers that there is a bug you can basically fix that bug you basically say "Okay this particular instruction is buggy i'm going to execute in a different way by changing the microcontroller uh by changing the u um uh microp program uh microprogramming this requires changing the microode of course it's called microode updates so you need to have a programmable micro uh microode engine for this okay this is where I'll stop uh let's be back when the bell rings uh we're going to we're going to continue with pipelining and a lot of ideas if you have questions uh let me know is there anything good online okay okay uh so now uh I've kind of covered the microp program control relatively quickly but this is really just for your benefit because I think this is a very nice and clean way of designing a system you have a clean microode and you can update it even after uh the processor is in the field right you can patch the microode basically so when now now you know what it means when for example a major company released a microode patch what they're doing is they're really updating the micro instructions in that control store table meaning they're updating how they proc instruction using these control signals right and the reason they can do that is not everything is hardwired completely there's some layer of translation like this uh that enables them that gives them some flexibility uh to translate from the ISA to micro instructions and there's some programmability in this layer essentially right and what we've kind of done is uh when we actually were doing uh determining the control signals for every state in the finite state machine we were really microprogramming the machine we didn't call it microprogramming but it's really microprogramming the machine you set the control signals and somebody does that and if there's a bug they actually fix it and they release a microode patch and you download it and your processor works better or more correctly or high performance also in some cases okay uh so essentially uh the ability to update or patch microode in the field enables ability to add new instructions without changing the processor which is very nice because of that indirection that we discussed ability to fix buggy hardware implementation ability to uh provide mitigation for security concerns for example etc and there are many historical examples of this for example early IBM machines had microode stored in main memory and they could update after a reboot uh similar things happened and there are very nice papers that talk about it ibm calls this millode and uh they actually describe how they do it in their Z series processor z series processors are actually pretty heavy processors they're really huge they're millions of dollars and a lot of uh like big institutions buy them uh and use them for many many years and they may need to update the microode many times over those many many years okay interesting thing in some processors where actually you could update the microode while the processor is running this is very similar to a program right a regular program if you think about it you program and you can change the program while the processor is running clearly right that's nothing new but here updating the microode is really changing how the hardware behaves and you can actually do that also so it was a more user microprogrammable machine and there are actually interesting papers on that also and I already said that systems today use microode patches to fix hardware bugs and issues there's more if you really want to learn about microprogramming there's a longer lecture uh that talks about it but uh I'm not going to cover it more here but I'm going to ask the question can we do better basically we've designed some things and uh we've designed the multicycle myips processor and we designed the multicycle myips finite state machine and we also said even though your book doesn't handle multicycle memory access we can also handle multicycle memory access very easily so we can handle a lot of really realistic things relatively easily with this multicycle machine so we're getting compared to single cycle machine this is like heaven like you're really dealing with real things right now single cycle is kind of a exercise that you do uh but it's going to be useful actually in what I'm going to say soon uh which is pipelining okay basically can we do better uh what limitations do you see with the multicycle design I've already given you actually this so I'll go through this relatively quickly but basically one limitation is hardware cost of course hardware cost unfortunately we're going to add more and more hardware cost to the machines later maybe in lecture 18 or so we're going to look at paradigms uh where hardware cost will be reduced but that will come at the expense of lower performance so we'll have low performance but limited concurrency is something we can fix by adding more hardware so basically if you have uh carefully followed what we have done we added hardware resources they're less than uh a single cycle machine uh but we're not using those hardware resources all the time some hardware resources are idle during different phases of the instruction processing cycle depending on which state you're in in that finite state machine you're using only a small fraction of the resources it could be a very small fraction actually so for example fetch logic the logic that you added to the data path for fetching instructions is idle when instructions being decoded or executed or calculating memory address etc fetch logic is used only when you're fetching instructions right uh most of the data path is idle when a memory access is happening you're just accessing memory waiting and everything else in the data path is idle right even though those resources are there they're waiting for the memory access to finish so that they could be used okay so basically uh people ask this question can we use the idle hardware to improve concurrency right and the goal is to really get more concurrency enable utilization of a bigger fraction of the machine uh essentially what is more concurrency right more concurrency is you do more things in the at the same time and this gives you higher throughput higher instruction throughput in other words more work completed in one cycle okay I think I give you the idea over here but essentially uh the idea of pipelining is very simple when an instruction is using some resource in its processing phase process other instructions on idle resources that are not needed by that instruction for example when instruction is being decoded fetch the next instruction when instruction is being executed decode the next instruction when instruction is accessing data memory execute the next instruction when instruction is writing its result into the register file access data memory for the next instruction this way you can actually have many instructions in the machine at the same time right does that sound good even better but uh essentially uh the idea is to potentially have let's say six instructions right it depends on how many stages you break this down to so it looks kind of like this one instruction is writing its result another instruction is accessing essentially doing ALU computation another instruction is accessing the register file and other instructions maybe fetching and also some other instructions updating data memory which is not shown over here so all these different colors are different instructions but of course we need to be a little bit more careful for reasons that we're going to study this lecture and in the next lecture and maybe even in the next lecture after that does it sound good does it sound like it will enable better performance it will right because you're as opposed to doing one thing at a time only one red thing you're doing many different things colorful things over here right okay so let's go into pipelining this is actually an idea that's used in uh I'm going to show you a picture uh uh factories a lot right you pipeline uh the processing uh or the construction of a part a car for example or part of a car and uh essentially the piece starts from somewhere and then somebody does something to it passes along to the next person and that person does something to it and makes it bigger passes the next person etc etc and you pipeline the construction of the car today robots do a lot of that so robots actually pipeline the construction of a car so essentially systematically we're going to do something similar pipeline the execution of multiple instructions uh the analogy is assembly line processing of instructions or in factories uh the idea is we're going to divide the instruction processing cycle into distinct stages of processing ensure there are enough hardware resources to process one instruction in each stage for whatever we're doing in that stage and process different instruction each stage not the same instruction different instruction okay and instructions consecutive in program order are processed in consecutive stages okay the benefit is this in this increases the instruction processing throughput we defined for example CPI as part of the performance equation if you remember the performance equation execution time is equal to number of instructions times average cycles per instruction times clock cycle time how do you improve throughput forget about number of instructions we cannot do much about it uh in the micro architecture level uh clock cycle time that's one way of improving execution time reducing execution time but if you actually process more than one instruction per cycle you're really reducing cycles experience per instruction right on average because you're really overlapping multiple instructions at the same time right so you're really improving IPC which is the inverse of CPIs ipc is instructions per cycle cpi is cycle per instruction okay so clearly there's a downside and we'll start thinking about this and we're going to see a lot of the downsides one downside is going to be more hardware essentially we're going to add more hardware to enable this sort of processing and then there will be other downsides as you will see does that sound good yes okay i see some heads nodding let me give you the benefits pictorially because a picture is worth a thousand words maybe in this case uh so this was multicycle i'm assuming we did four cycles per instruction four ads for example ad operations uh each ad takes four uh uh four states or four cycles fetch the code execute right back and then you start the next ad fetch the code execute right back and then you start next ad you can see that we serialize the execution of four different ads in this case when you do pipelining this is what you do while you're fetching uh the while you're decoding the first ad you're fetching the next ad while you're executing the first ad you're fetching you're decoding the next one you're fetching the third one while you're writing back the result of the first one you're executing the second one decoding the third one fetching the fourth one so it's kind of beautiful like this uh assuming the instructions are not dependent on each other which is something we will discuss but you can see that the our throughput has increased we get four cycles per four instructions in other words we complete one instruction per cycle right and in the steady state you can see that there are four instructions that are concurrently in the machine right the steady state means uh when the when the pipeline is full with instructions this is the steady state picture but basically the key takeaway is you're completing one instruction every cycle right whereas here you're completing one instruction every four cycles right so that's the throughput improvement so it improves your throughput it improves your execution time in this case and you can calculate the execution time improvement Here it's 16 cycles here it's four five six seven cycles right okay so it improves your instruction throughput it improves your execution time it doesn't improve the latency of a single instruction a single instruction still takes the same amount of time as you can see right here each instruction was taking four clock cycles assuming clock cycles are the same in multicycle and pipeline which may not be the best assumption but we're going to assume that for now they're close enough uh each instruction still takes four cycles in the pipeline machine right so we didn't change the latency of the instruction but we improved the throughput of the machine and the execution time of the entire program by overlapping the execution of different instructions so this is actually a critical distinction between latency and throughput your latency of an individual uh operation may not improve but your throughput of the overall thing that you're doing in your program may improve by overlapping latencies of different operations that you do so that's the power of oper overlapping latencies okay so the question is of course is life always this beautiful this looks beautiful this is perfect pipelining kind of uh because because you can we can do what what what I show you over here this doesn't happen if the instructions are dependent on each other for example so we're going to look at uh when life becomes not as beautiful and we're going to try to solve those issues make sense okay so uh another book uh the Hennessy and Patterson book actually has another analogy that I like which is the laundry analogy which you may actually empathize with uh you laundry is a pipeline process basically you have a bunch of loads and each load uh takes some time you basically put them into a washer and then into a dryer and then you fold them and then you put into the cabinet assuming each step step takes I guess 30 minutes over here you finish one load in two hours right and then you start the next load right multicycle machine looks kind of like that you wait until you finish the entire load until you use the washer again this kind of doesn't make sense if you want to save time right because you can you can start using the washer over here for the next load okay so you can read this on your own but basically steps to do a load laundry load in this case are sequentially dependent on each other but there's no dependence between different laundry loads different steps do not require the same resources either and a washer and dryer are completely disjoint from each other of course you could have a washer and dryer that's put together you can either do washing and drying and there are machines like that and those machines are terrible for pipelining and they're actually not good for latency also i have one so I know so it's good to have a separate washer and separate dryer frankly specialization there are multiple reasons actually you cannot specialize a washer and dryer to be as good as a dryer specialized for just drying very similar actually trade-offs happen in architecture you design an ALU you can much more easily design a adder that's specialized for addition as opposed to adder plus multiplier that's good at doing both right okay anyway I digress so basically this is what every one of you do laundry right how every one of you do laundry if you have four loads hopefully not what you do is basically if you do this you finish one load every 120 minutes or two hours based on the parameters that I just described uh but if you pipeline which is what reasonable people would probably do after you finish the first load and start drying it you also concurrently do the washing of the second load right so you do four loads of laundry in parallel in the steady state which happens to be over here in this case yes and this way you can start and finish one load every 30 minutes so you increase your throughput by 4x as you can see even though the latency of every single load did not reduce latency of the entire four loads actually reduced significantly makes sense right it's the same example that I gave you instead of ads now you have laundry loads so okay I've already said all of this no additional resources in this particular case you don't need additional resources you may need slightly more amount of time to switch between one load to another but that's the additional overhead that we're going to kind of see soon okay in practice that damn dryer takes always long right who who suffers from the dryer which one's the longest dryer washer nobody does laundry here anymore robots do laundry for me it's the dryer dryer is always long especially a high performance dryer consumes a lot of energy and a lot of time so you do one load every 150 minutes with that dryer right that dryer takes one hour as you can see now how do you fix this problem this is essentially a problem that you have in real machines also and addition takes and multiplication takes long for example right how do you deal with it what what what this dryer has done to us is basically made our life worse for pipelining also right so as opposed to having one load every 30 minutes now I have finished one load every 60 minutes because I have a slow dryer and slow dryer is really my bottleneck if you think about it i cannot really uh I cannot really start the second drying process until this first dryer first load finishes in the dryer so your dryer is really the bottleneck your execution right okay so how do you fix this problem anybody yes second dryer okay okay there are multiple fixes second dryer is one fix uh okay i already said this the slowest step the dryer decides the throughput buy a second dryer dryer A dryer B alternate between them for consecutive loads and this is something that you do in real life also an adder A adder B multiplier A multiplier B so the same concepts in life actually apply nicely to architecture as well now this is one solution the other solution is make your dryer fast right so spend a lot of effort to have a fast dryer that's another possible solution so either way your cost increases your cost essentially increased over here right okay okay you basically restore the throughput using two dryers okay this is just fun for you these are automobile assembly lines uh and uh one of the main uh uh things this was tested on was a particular part this is a Ford assembly line a long time ago uh like early 20th century maybe late uh uh yeah before everybody here was born including myself yeah basically they had this component uh that supplied ignition energy to the engine before generators become became common and they called this the magneto it was a pretty complex and innovative component at the time and they basically had to build an assembly line to assemble this component i'm not going to go through the uh anything over here you can read it clearly but you can read the numbers over here on average a worker could assemble 35 of them in a 9h hour shift or roughly one every 15 minutes so it's a throughput and latency thing over here by by moving to assembly lines like this which is essentially pipelines uh they were able to reduce the assembly time from 15 minutes to 5 and the required workforce decreased from 29 to 14 so with robots you can do even better probably but uh I like Magneto for another reason does anybody know Magneto for some other reason no nobody follows X-Men am I too am I still too old in X-Men there's a Magneto and it's inspired by this Magneto actually that particular uh device that's used in cars early cars inspired something in X-Men okay anyway this is more like a regular automobile assembly today uh okay a real computer that had pipelining was IBM Stretch this is actually in 1960s and you can see that computers were a little bit larger at that time good luck carrying this in your pocket uh but and also a little bit more expensive as you can see uh but anyway pipelining is real and it's been used for a while okay now let's talk about uh what it means to have an ideal pipeline and how a modern uh microp processor pipeline is not ideal essentially our goal is to increase throughput with little increase in cost but we saw that if you already have a slow component you need to increase cost in this case we want to not increase the hardware cost also you don't want to increase latency we're going to talk about that as well so ideally you should be repeating the same operation many many times right the same operations repeat on a large number of different inputs for example all laundry loads go through the same steps right ideally you should your operations should be completely independent of each other there's no dependencies between different loads in your laundry for example right that should be nice and ideally you should be you should be able to uniformly partition them into suboperations so you can divide processing into uniform latency sub operation that do not share resources again in the uh in the laundry example we saw that we had 30 minutes for each uh each step that's a uniformly partitionable but ideally we would like to be we would like to make it even more partitionable we would like to have a 5m minute dryer that would be nice right okay so but doing laundry is not bad actually assembly line is not bad what about instruction processing cycle uh we're going to talk about that but let's talk about this ideal pipelining element instruction processing cycle you have basically some combination logic this a single cycle kind of processor right you have uh you store the result uh in uh uh in registers essentially this is your input registers your output register they can be the same uh but essentially you have some combination logic let's assume that it takes tposconds to process some instruction throughut is 1 / t we divide it into two ideally we would like to be 2 over t multiplied by two and we divide into three we would like 3 over t now this is not that realistic because uh we cannot divide work nicely this way uh what happens is uh when we actually uh look over here uh there's some amount of useful work over here in the tcond and there's also a register sequential logic delay so when we actually try to divide T even if we can divide T equally into K components or K stages the register delay stays the same you still need to latch things uh into a register at the end of a clock cycle so your throughput is really dictated by this equation over here t divide T gets divide by T that's also ideal if you if you can exactly divide it but you cannot get rid of this s this sequencing overhead or sequential logic delay is there so ideally if as k goes to infinity this let's say goes to almost zero maybe one gate delay whatever it is so your throughput maximum is really dictated by the sequencing overhead of the registers so always keep this in mind register delay through erases throughput and that's your limiter why don't we pipeline things to very small amount of combination logic this is the reason well this is one reason essentially you're do you get dominated by the sequential logic transitioning overhead and most of your clock cycle gets wasted on essentially putting things into the register and taking things out of the register as opposed to doing useful work in computation of an instruction right okay this picture still assumes perfect division of work between stages we're going to get back to that you may not be able to perfectly divide T over K right because you have granularities of processing like accessing a register file accessing a memory uh doing something in the ALU you may not be able to perfectly pipeline all of them right okay cost cost is also interesting basically if you look at a non-pipeline version with some cost you need Gates to implement things you also need some register cost G plus R as you pipeline perfect pipelining assumes you divide G by K and without you don't increase the compilation logic we're going to see that that's not true but even if that's true you need more registers so you need to multiply the register cost by K at least linearly maybe it's not exactly but registers essentially increase hardware cost you need to add pipeline registers at the end of each pipeline stage okay this still ignores the fact that you may need to actually replicate resources and also some registers uh but register cost dominates if you pipeline too much essentially hopefully this makes sense right this is why people don't pipeline too much let's say if you could actually pipeline a lot maybe we could have like 50 GHz processors by now you could get that potentially but you would increase frequency your cycles per instruction would be uh much worse uh today if you do that okay now let's go into a bit more detail uh we're going to do more pipelining and you've seen this many times that's why I'm going through this relatively quickly just to jog your memory we have an instruction processing cycle and we're going to essentially pipeline it this was our single cycle micro architecture that we developed in an earlier lecture not the multicycle so I'm going to actually start with a single cycle micro architecture as I said there is a good reason to teach single cycle micro architecture one reason is to show you that this a bad idea it's always good to show bad ideas also the other reason is it's actually enables a good educational purpose to transition into pipelining because a lot of what you will see in the single cycle will be very similar in pipelining also the control logic control signals are actually going to be exactly the same uh but if you look at the uh pipelining over here this is our throughput 1 / t in a single cycle right this is one way of pipelining so we basically add registers somehow uh we need to do some register right etc but let's assume that people somebody divided the stages like this and ignore these back edges for now those back edges are always problematic in pipelining and you need to be very careful because there may be another in there's another instruction over here right there's another instruction over here the different instructions are in different stages so you cannot just take something over here and put it into the program counter because it affects the other instruction that's over there right now you need to think about the hardware as catering for different instructions in different stages you you you need to be very careful what you do when you actually cross different stages right if there's a line that crosses different stages you should be very careful okay anyway uh basically uh instruction fetch is 200 picosconds decode is 100 this is 200 200 100 okay right back so these five-stage pipeline somebody designed it it's not a perfect design right now we're going to build it later on in some way is this the correct partitioning this always something that you can ask maybe this not the correct partitioning right maybe a six-stage pipeline maybe you need a four-stage pipeline why not choose different boundaries right this is always a problem in pipelining we're not going to talk much about this but keep this in mind when you're designing a pipeline processor you need to make some choices and there are no easy choices in a sense but let's analyze this design earlier uh that that we have in this picture so if you assume that the these are the instructions that I have different loads completely independent of each other uh if you don't pipeline if you somehow design a processor uh that takes let's say single cycle uh you need you need 800 picosconds for this load 800 picosconds for this load 800 picosconds for this load okay because register file is only 100 and uh picosconds to access when you pipeline every stage needs to take needs to take the same amount of time because you move one instruction from one stage to the next stage uh in consecutive cycles you cannot have different stages taking different uh cycle times that breaks the clock cycle time uh in this case you see that this load takes actually uh 1,000 cycles right uh 1,000 picosconds sorry 1,000 picosconds to complete because it goes through five pipeline stages and each pipeline stage is 200 picosconds so even though instruction uh this didn't need one uh 200 picosconds completely it only needed 100 picosconds you still need to ensure that you spend a full clock cycle in this pipeline stage okay and the next instruction is also similar so what happened here is uh our throughput increases we have one instruction every 800 picosconds previously now we have 100 one instruction every 200 picosconds but our throughput is not perfect meaning we pipelined with five stages if it was perfect pipelining we would expect a speed up of 5x right but we have a speed up of 4x in terms of throughput well because we wasted some of the clock cycles right basically half of the clock cycles wasted in two stages whenever you're accessing uh in is essentially the decode and register file access stage over here and the right back stage make sense so basically you don't perfectly partition work in general there's no easy way to solve this problem also uh and as your pipeline stages do more work uh are not balanced so this called balancing the pipeline if your pipeline stages are not perfectly balanced meaning work is not perfectly partitioned across stages you will not get the ideal speed up so this is a first immediate hit at our pipeline let's say we expect 5x speed up we get 4x it's not that bad if you actually pipeline more maybe you'll get a little bit more but you will not get the ideal okay we're going to see more and more hits uh to the pipelining and we're going to try to solve those okay so if you want to enable pipeline processing you need to add pipeline registers right so this is a stripped down version of a single cycle processor that we have seen earlier meaning you need to do what I show you at the bottom over here so this is a pipeline version of the single cycle or the basic single cycle processor basically you fetch the instruction you increment the program counter we keep this nice incrementing program counter over here as opposed to multicycle because this helps pipelining actually uh and then you store the intermediate results in this instruction fetch instruction decode pipeline register we're going to call these pipeline registers and then the decode and register file access does something and you store the intermediate results over here and then the execute stage and the branch calculation stage does something and you store the intermediate results over here and then the memory access stage does something and you store the intermediate results over here and then the right back stage does something and you store the result in the register file etc and then there are also arcs that go this way as you can see arrows that go this way this is basically to change the program counter the branch finishes execution over here and the branch uh condition is calculated target is calculated they're stored in this register in the next cycle if the branch is supposed to be taken you change the program counter to the target address although those paths are not shown over here but you need those paths to execute the instructions but the key takeaway over here is that if you want to pipeline the processing you need to have pipeline registers to enable one instruction to be fetched store its result while another instruction is being decoded to store its result at the end of the cycle while another instruction is being executed and while another instruction is accessing memory while another instruction is writing back to the register file make sense so we added these registers these registers also kind of exist in the multicycle in a different way okay here you need to think differently because what we're doing is we're pushing new instructions into the pipeline in multicycle we still have single instruction to process okay so for example these are some example registers you have some instruction register you have PC plus4 over here uh you have registers that you've read you have the immediate over here you have the ALU result uh you have one of the registers so whatever registers that you're going to need later you need to keep uh in in your pipeline register for example PC+4 pc+4 is needed here why because we're going to calculate the target address of the branch by adding a sign extended left shifted immediate to it and this is the next PC of the branch as you can see there needs to be also a branch condition that says are stored this is not showing the entire picture but these are the things that you need you need data elements that you that are required to process the instruction later in the pipeline stages and you need to propagate them across these pipeline registers so we will later see control also you need to do the you need to do the same thing for control so these registers can become quite large make sense okay so of course you need to make sure no resources used more by more than one stage while you're reading the registers you should make sure nobody's updating the registers that's going to be an important concern for us later on uh okay so basically each stage is kind of exclusively reserved for the instruction that's being processed in that stage okay so also all instructions classes must follow the same path and timing through the pipeline stages because of this design right so for example you fetch the load uh at the end of the cycle you store whatever you need to store PC+4 and the instruction register etc some control signals maybe and then load goes to the decode and in the next cycle you do the processing over there store the results into the next pipeline register execute next pipeline and memory next pipeline register uh and write back uh you write it back to the register file okay so uh even Oh something's happening here uh even though uh a load actually is a good example because load goes through all of the five pipeline states but if you're doing for example an ad you don't really need the memory state right but ad still needs to go through the memory state right because it's a pipeline and this pipeline just uh you cannot just bypass because if you bypass there's a lot of complexity that's added you could actually bypass in very advanced designs but not in uh classical pipelining so basically uh this had this has a performance impact uh meaning an instruction goes through some pipeline stages even though it doesn't need it basically consume some cycles per instruction right essentially it's very different from the multicycle design if you think about it in the multicycle design an instruction takes as few state as few states in the finite state machine as possible right that's the ideal design choice here every instruction takes the same same number of uh stages in the pipeline But it's not as bad as single cycle because single cycle is insurmountable as we've discussed right okay so this is in a pipeline operation example you fetch the load and then you fetch the subtract for example and then they flow through the pipeline kind of like this i'm showing you two instructions right now it's nice while load is doing memory access subtract is doing execution while load is writing back data to the register subtract is going through the memory stage doing nothing because subtract doesn't access memory in the memory stage and load gets out of the pipeline after it's done and subtract goes to the right backstage and writes it results so hopefully this makes sense right basically you fetch one instruction per cycle and every cycle an instruction moves from one stage to another stage so that other instructions keep coming into the pipeline so I kind of shown you every clock cycle over here clock cycle one clock cycle two clock cycle three clock cycle four five six these are simple concepts so hopefully this is simple we're going to get to the more complicated concepts like is life always this beautiful meaning uh what if that subtract was dependent on the load right what if that subtract needed the value produced by the load then it cannot keep going okay we're going to get to that but let me give you an uh operational view of the pipeline also because this kind of fun to think about so these are different time units time zero time one time two time three time four time five this instruction zero doing fetch instruction fetch uh at time one there's some concurrency at time two three instructions are in the pipeline fetch decode execute as you can see at so basically uh you can fill the pipeline this is called the steady state of the pipeline and you keep basically filling new instructions into the pipeline as you can see instruction zero is the oldest instruction it finishes after it does write back instruction one is the second oldest third oldest etc so you can see that we are assuming a five-stage pipeline here that's divided similarly to what I've shown you earlier and this is called the steady state meaning this is where your pipeline is full all of the stages have an instruction at once your pipeline is full that's good you're you have the highest throughput at that time this is what we would like ideally your pipeline always full and you keep being full just like in this picture but life is not going to be this beautiful so this is an operation uh resource Also if you look at the different stages over here fetch decode execute memory write back and if you look at the time over here this is how the instructions flow again this is just you for you to think about some of this is actually used for formalizing the pipeline and actually building the pipeline etc but I'm not going to go through that in this course so you can see that instruction zero kind of flows at different time units to different stages and this is where your pipeline is full and all of these uh cycles T4 through T10 are very useful for performance okay let's talk about control points in a pipeline this is where the single cycle is actually quite useful uh you basically take the single cycle design and you have the identical set of control points as the single cycle data path except you have some more registers the pipeline registers so this is actually very similar to the single cycle processor we designed except it has these pipeline registers also this is a five-stage pipeline fetch decode and read registers execute memory access write back and you have four pipeline registers in between stages but the control points are exactly the same as the single cycle data path and you can confirm this uh on your own if you would like so what do we do with control in a pipeline for a given instruction we have same control signals as the single cycle single cycle processor that we designed earlier but we don't use all of those control signals at the same time because as a single cycle processor you have a single cycle you use all of those control signals they're distributed to everywhere at the same time here control signals are required at different cycles depending on the stage so you have the same control signals for a given instruction but you use them at different cycles if the instruction is in the fetch stage you use some control signals if the instruction is in the decode stage you use some other control signals make sense so how do you do this basically one thing one thing you could do is you could decode an instruction once using the same logic that we use as single cycle and buffer the signals until they are consumed and this is what things kind of look like uh this is again an abstract view over here this the fetch stage you know the instruction after the fetch state there's some decoding that happens in the decode state and this is where you generate the control signals if you remember uh these control signals can be generated for all of the downstream stages these are the control signals that I'm going to need in the execute stage these are the control signals that I'm going to need in the memory stage and these are the control signals that I'm going to need in the right back stage so I generate all of those control signals as much as possible of course some control signals may not be easy to generate because they're dependent on processing as we've discussed like the branch condition ignore those whatever you can generate based on the up code you generate you put them into the pipeline register and designate where they're needed maybe and then use them when they're needed so these control signals that are only needed in the execute stage are not needed uh in the memory stage you don't need to propagate them into this register that's between execute and memory stages and these control signals that are needed only in the memory stage and not needed in the me right back state you don't need to propagate them to the pipeline register between memory and writeback stages okay but the control signals that you generate are the same as a single cycle control signals you just delay when you use those control signals because of the definition of pipelining right an instruction does something a part of the instructions process at different cycles and different instructions are processed uh on the same hardware at different places okay this is not the only option but you could also have another option which is instead of decoding all of the control signals that you need ahead of time in the decode stage carry part of the undecoded instruction let's say instruction word or field down the pipeline and decode locally so basically distribute the decoding to be to happen locally across later stages clearly there's a benefit to that that that way you don't need to store all of the pipeline registers you basically trade off sequential registers with combinational logic but we're going to assume kind of the first option uh for the rest of the discussion but the second option is equally viable okay which I kind of gave you the answer it depends basically so here if you this is actually from uh yeah this is actually the same single cycle processor divided into pipelines and the control signals delayed as to when they're really needed for example uh the execute stage needs ALU source this these multiplexers over here they need to be controlled and these control signals are generated in the previous cycle stored in the pipeline register and used in the next cycle uh these are the control signals that are needed in the execute stage memory stage some controls like branch whether this a branch is needed because we calculate the branch condition in the previous cycle uh like that's zero if the branch condition if the if if we're with what we're processing is a branch and the branch condition is satisfied meaning it was zero then we basically set PC source to one which means that uh we get the PC from the target address that was computed earlier right so there's no magic we've actually built all of this data path the data path is already there uh we just delay the control to the appropriate time Right does that make sense okay this may require some more thinking and going through the figures but essentially this is your single cycle processor divided into different stages control signals delayed to the appropriate place so let's take a look at memory also uh memorite signal for example this memorite signal is generated if the instructions are low if the instructions are store and it's stored over here in the pipeline register the store instruction moves uh that signal that signal is moved to the next register store instruction calculates its address over here and then that address is used to write some data value that's also latched uh by that store instruction in the previous cycle into this register and in the memory stage the store instruction uses the address that's latched the data that's latched the memorite signal memory is of course zero at that point if you remember this signal and then basically writes to memory okay so while store instruction is moving from fetch to decode to execute to me uh memory stages necessary control signals are also being propagated with the instruction okay but the data path is very similar to what we've designed this the data path does essentially the same things except it does in different cycles okay so this is one example over here uh this is for the register write signal okay uh essentially the register write signal needs to be propagated so that you write to the register in the right back stage as opposed to earlier stages okay tomorrow we will pick it up over here we'll go into a little bit more depth uh and then hopefully we'll finish pipelining i'll see you tomorrow