[Lecture 16] Understanding Branch Prediction Techniques

no but now All good online it's very loud okay we have a sparse lecture today is there some event going on that you guys are missing out on maybe a big party okay I guess maybe you'll learn about it later i have a big deadline so I know that but nothing else other than that okay uh we're going to cover a topic that's quite exciting that you guys yesterday started uh this a topic that uh has fascinated many people for many many decades it's a tough topic right now because it's very difficult to improve on it but it's very important to improve on it as Muhammad mentioned yesterday and as I'm I'm going to harp on today uh also which is predicting branches basically predicting the direction of branches and so that we can keep the pipeline full uh without stalling uh the fetch engine right you guys studied this yesterday and we're going to go into uh some more techniques and hopefully we're going to finish branch prediction even though there are a lot of other things that we can talk about in branch prediction I'll leave you with the backup slides if you're interested you can study the backup slides and these are some of the readings that I would recommend uh your books don't go into this topic as much as we do uh books usually are older uh whereas topics move and you will see that more and more uh as we go into a little bit more interesting topics like GPUs uh GPUs systolic etc in these lectures those those topics are not covered well in your books unfortunately even though they are the cornerstone of how we do computing today right okay uh so basically we were talking about guessing the next fetch address and uh we talked about branch prediction and we talked about the importance of this problem we we talked about especially what does it mean to guess the next fetch address right you need to basically guess that the instruction that you fetched is a branch you need to guess the target address You need to guess whether that branch is taken or not taken assuming this a conditional branch right and if you don't do it well if as we have studied with this basic example yesterday even with 90% accuracy in your guess uh you will basically not uh utilize your uh 20 stage pipeline five pipeline very well your fetch IPC is only 1.6 six as opposed to five and you would fetch 200% extra instructions so 90% is a bit large number today we're going to cover how to get to 90% but getting higher than 90% especially in difficult workloads where branches are hard to predict is not easy today okay so it may actually some of the some of the implementations in uh real industry look like black magic because you're trying to guess what's going to happen and you need to really use a lot of information and we're going to see some of the information that goes into the prediction mechanisms that are actually used in microprocessors in hardware today okay this was essentially our fetch stage with a branch target buffer remember yesterday we introduced a branch target buffer which basically uh is indexed with the program counter of the instruction you fetch if it's a branch you get a hit over here and if it's a branch that was executed pre previously you get a hit and you get the target address so it's really a cache of target addresses if you have a target address over here it gives you the information about two things one is you executed this program counter before and it's a branch and here's its target address because you computed it if you miss in the branch target buffer you don't know it could be a branch that you never execute it or it could be something else right then you use the prediction PC plus instruction size which means you go to the next sequential instruction because you don't know the you don't know whether it's a branch or not concurrently you use the program counter to index something called the direction predictor which we're going to develop today it tells you whether the branch was taken or not taken the previous time you had seen it we're going to enhance that today and if the branch is taken then you use the target address that you get from the BTB which you had cached based on the previous execution which you had kept uh stored based on the previous execution and use it as your next fetch address if the direction prediction says oh you should predict this branch has not taken then you use PC plus instruction sites which is PC+4 in the MEPS ISA so hopefully this is clear this was covered yesterday so I will not go into it today we're going to enhance this a little bit more and yesterday actually we didn't even uh build this direction predictor yet we're going to build it today but this direction predictor can be compile time which we have covered some of examples of yesterday i'm going to finish the compile time predictors today and then we're going to go into runtime or dynamic predictors which are implemented hardware so hardware actually figures out what to do when you fetch this instruction based on what has happened in the past make sense you could of course compile combine compile time and runtime but we're not going to do that today okay so this was one slide that was covered yesterday by Muhammad essentially we saw that uh there are simple static branch prediction techniques that don't require much hardware for example always not taken uh you always predict uh the branch to be not taken you always fetch PC plus4 basically unfortunately this has very low accuracy 30 to 40% for conditional branches because it turns out most of the conditional branches are tight taken and the main reason for that is loops whenever you implement a loop it's a conditional branch that is taken right if you if you actually need to iterate over the loop it's a taken branch you take the branch to the beginning of the loop and you keep taking it unless you until you go out of the loop and since loops usually iterate more than once twice three times four times five times 100 times uh most most uh branches tend to be taken uh but this is very simple clearly but it's it's not very good performance if you you can do the opposite you can always predict the branch as taken here you don't need direction prediction but you need target prediction and this gets better accuracy clearly because it's the opposite of not taken uh and essentially this is better but 60% as we have seen earlier is not enough right and then people play the trick of backward taken if if you if you see a branch that goes backward in code and basically the target address is earlier in code in terms of address compared to the branch you predict it as taken because this may be a loop branch otherwise you predict as not taken because this probably is not a loop branch and this turns out to be a little bit better than either of these mechanisms but these are all static mechanisms still meaning you don't need uh direction prediction hardware uh for this okay there's another static mechanism uh that's profile based and this is the last thing that uh was covered yesterday i'll go through this relatively quickly again the idea is compiler supplies a prediction uh based on the information that it gathers during compile time compiler essentially looks at all the direct branches it profiles the code using some input set basically it runs the code using some input sets that's provided by the pro programmer or the user and it says under these conditions under these input sets I say this branch is likely to be taken this other branch is not likely to be taken this other branch is likely to be taken okay and it encodes that direction as part of the instruction encoding instruction encoding is a single bit saying that uh it's a hint bit uh that is filled by the compiler if the compiler thinks that branch is likely to be taken it sets that bit to one otherwise it sets that bit to zero and this serves as a hint to the hardware saying that compiler supplied this bit as a prediction for the branch right now this has the advantage of being per branch you for each branch you can individually determine this bit bit right for each program counter basically so this could be more accurate than the mechanisms that we have looked at over here assuming the profile is actually representative of the actual execution of the code because you determine these bits based on profiling the code the pro the compiler runs the code several times and says okay this branch is like to be taken how does the comp compiler run the code it assumes some input sets right uh if the input set that you actually run uh uh in in real life when you actually execute the program uh is different completely different from what the compiler uh profile the code on then the input set may not be representative and those direction bits that are encoded by the compiler may not be good for the thing that you're running uh in in the field so this profile representativeness is a problem that's always there if you actually do compiler based optimizations because compiler doesn't know what's running right what's going to run you compile the code you generate instructions you put these hints based on some information if that information is quite different from what actually happens and when you actually run the code in the field then your input set may not be representative does that make sense there's always an issue and keep that in mind whenever you have this uh so the Downside the upside we discussed that the downside is this requires hint bits in the branch instruction format meaning you need to change your ISA most ISAs have this in bit today and accuracy depends on dynamic branch behavior and I'm going to show you uh several examples over here so these are two instances of a branch of this is one instance of a branch assume that the branch in the first 10 executions it was taken in the next 10 exe executions it was not taken the compiler can encode one bit only in this case it it doesn't matter what it picks it can this it can say this branch is likely to be taken or likely to be not taken pick whatever you want you get 50% accuracy on this right but you can see that this is actually quite predictable right if you actually have a runtime predictor dynamic predictor that kind of recognizes pattern you can do much better than 50% okay another example is a branch uh when it executes it alternates between being taken not taken taken not taken taken not taken again if you encode this with a single bit in the compiler the whatever compiler selects you get 50% accuracy but again this is quite predictable when I look at this I can say I do the opposite thing that I did last time right potentially okay so uh and also accuracy depends on represent of the profile input set to demonstrate this you can see a branch over here that's taken 18 times and that's not taken two times assuming the compiler had a representative input set it would encode the branch to be taken right the likely uh direction to be taken and it would get 90% accuracy if that's the case which is not bad but if the compiler makes a mistake because it's input that is not representative and it says okay branch is likely to be not taken in real life you'll get only 10% accuracy so your accuracy can be terrible as you can see if you make a mistake at compile time because the profile that you have is not good enough so that's the downside of profile based mechanisms okay okay let me give you another idea this is actually a a cool idea i think uh it doesn't work in many cases but it works in some cases but you can actually do the analysis of the program you can analyze the types of branches in the program and uh decide this bit hint bit that you supply to the hardware i call this program based or program analysis based prediction the idea is to is to use huristics based on program analysis to determine the statically predicted direction by the compiler again the compiler can do this or you as a user can do this uh for example uh this this paper from PLDI 1993 says that if the branch is branch op code is branch less than or equal to zero predict that as not taken because usually negative integers are used as error values in many programs and if you predict as not taken this is better than predicting it as taken okay so this is a huristic right you don't need to do any profiling you just look at the up code and the up code says this and you set the bit based on the up code another loop heristic right uh if if there's a branch that's guarding the loop execution if some condition execute the loop otherwise don't execute the loop then predict the branch to be taken to execute the loop why because programs usually are written based on loops and if there's a guarding branch that's stopping the execution of the loop it's probably not taken meaning you're probably going to execute the loop again this is a huristic that you develop based on your analysis of the code okay no need for profiling here so if you compare a pointer a pointer that points to a memory location an address with a floatingoint number that's a complicated number then predict that this is not these are not going to be equal for example who does that i don't know but if if someone does that probably you're not going to get an equal number between two uh between do these two things okay so basically the key idea is this doesn't require profiling uh which is uh good because input set representives goes away but now you're very much depend on huristics and these huristics might not be representative or good and it may not apply to many branches right this requires still compiler analysis ISA support but that's true for other static methods as well so if you're interested in this I would recommend reading this paper from 1993 and it's very nice I think and it shows that at the time they got uh 20% misprediction rate which is not bad at the time okay one last uh static branch prediction method uh is programmer based basically as a programmer whenever you're programming you guess whether the branch is going to be taken or not taken and you communicate this information for each branch to the hardware using some constructs in the programming language and many programming languages have these constructs called pragmas uh essentially you can specify I'm going to show you in the next slide you can specify whether a branch is going to be likely not taken or likely taken right if you're a good programmer maybe you can do that but again this not also perfect right programmer may not know the data data input coming into the program so this doesn't require profiling or program analysis uh programmer may know some branches and their program behavior better than other analysis techniques if they have a good hunch of what's what's going to happen potentially unfortunately this requires even more support it it requires support from the programming language It requires support from the compiler to take what's provided by the programming language and encode it in the ISA so it's requires support from the ISA as well so clearly this burdens the programmer now right because programmer needs to think about whether their branch is going to be taken or not taken okay let me get into the notion of pragmas basically how does the programmer convey this information essentially a lot of languages programming languages provide keywords that enable programmers to convey different types of information uh to lower levels of the transformation hierarchy meaning hardware in this case you can convey this information to the instructions set architecture and instructions like architecture encodes it and you can actually communicate with the hardware from the program uh saying that oh I think this is going to be taken for example so this is an example if likely x this is a pragm language essentially in this case you're testing x if x is true you execute that code in the brackets and you say I think this is going to be taken meaning execute the code in the brackets so you're providing a hint to the hardware so the the compiler that compiles this pragma basically sets the bit in the branch that executes that if statement uh to be one meaning that that's going to be taken right and you can do unlikely error for example if there's an error do something and most of the time you don't get errors in the programs hopefully and if you get errors you don't care about performance anyway uh so basically you can set that to be unlikely there could be many other hints and optimizations that can be enabled with this sort of pragmas this is an example but another Pragma I will give you is whether loop can be parallelized this a pragma that the compiler that the programmer says I think this loop is parallelizable so compiler please try to parallelize it with your techniques this is a pragma that happens that is there in many parallel programming languages openmp for example and this is how it's specified and this is description directly from the openmp manual this parallel directive explicitly instructs a compiler to parallelize a chosen segment of code of course the program compiler may or may not be successful but now the programmer guides the compiler to try to do it right make sense that's another performance optimization yes why does the compiler not try to parallelize loops anyway like that's I mean it may be uh it may be too difficult it may actually lead to some uh uh code bloat for example if the loop is not really parallelizable the compiler may need to do a lot of tricks that actually increase the code size etc so that's so they're undesirable effects uh if it's guided by the programmer you may actually avoid some of the undesirable effects basically that's a very good question compilers are not good enough yet because they don't have a lot of information about dynamic execution right and dependencies okay very good so now we are almost done with static branch prediction essentially all previous techniques can be combined you can be you can have uh some branches predicted based on profile some branch based on program analysis some branches based on programmer information the question is of course how would you do that uh there's work to be done here and uh there's a common disadvantage of all three techniques which I've given you actually already even though branch may be quite predictable based on dynamic information none of these techniques take into account dynamic information meaning they cannot adapt to changes in branch behavior they basically encode one single direction for the branch whereas the branch may change behavior while it's executing right it may be taken for the first million times it's executed in the next million times it may be not taken right i give you an example 10 10 but imagine it's million and million right essentially we want techniques that can adapt to changes so of course if you keep recompiling the code while it's running meaning having a dynamic compiler binary dynamic binary translator we have seen at a high level earlier you could mitigate some of this right you could basically the compiler compiles the code after some point it recompiles the code while the code is running uh and it can change it can adapt to some of these changes in dynamic behavior in branches right assuming the branch changes behavior at a course granularity this may not may be okay like every million branches every million executions you switch from taken to not taken but what if you switch from taken to not taken every other execution that's very difficult to account for even with a dynamic compiler that recompiles the code once in a while so and it's a dynamic compiler also has its overheads and what's a dynamic compiler essentially a compiler that generates code at runtime that changes the code and there's support in modern machines for these for example Java language as a just in time compiler Microsoft is a common language runtime In this class we've seen a binary translator actually multiple binary translators that actually compile code dynamically and adapt to hot code that may actually be executed a lot nvidia for example the dynamic code optimizer that we discussed briefly again in the past does that it basically finds out which parts of the code are executed a lot and it optimizes them if you actually know information about which direction the branches are going you can change the hint bits when you're compiling the code or when you compiling the code okay so okay so there is a way to make the static prediction more dynamic by having a dynamic compiler that keeps recompiling the code but there are overheads associated with it so now because of that we're going to move to runtime techniques you can combine runtime techniques with static techniques also and choose the best uh we're going to talk about hybrid bench predictors later on but you can you can be a hybrid of static and dynamic as well but let's not get ahead of ourselves okay now I'm going to move to runtime techniques which can adapt to uh behavior of branches dynamically and we're going to start with the simplest mechanisms and we're going to build advanced algorithms that are used in current systems like perceptrons and uh geometric history length predictors etc this will become more clear uh when we discuss it okay what is the idea i think I given you the idea already basically predict branches based on runtime dynamic information collected uh by hardware in in many of these cases so there's an the big advantage here is you can predict based on the history of execution of the branches if a history indicates a pattern uh that could be predictable you can do better than static prediction and you can adapt to dynamic changes in branch behavior if the branch behavior changes you can adapt to that change there's no need for static profiling so all of the input set representiveness problem goes away right or relying on the programmer relying on huristics all of those problems go away basically of course there's a disadvantage here clearly this is more complex because now you need to have additional hardware to do this prediction and if you really want to do this prediction very accurately the hardware becomes larger and larger and larger and that's where we are today okay the hardware is actually quite large in existing systems i'm going to give you an example of that uh when we go to later slides make sense okay now let's jump into the simplest predictor you can design this is the last time predictor uh the basic idea is yes that the branch will take the same direction as it was executed the last time you executed it okay it's the same branch same program counter uh you keep one bit for that program counter you could store this bit in the BTB or you could store it in the direction predictor as I showed you earlier this bit indicates which direction the branch went last time it was executed so you execute the branch and you figure out that oh it's taken you record that bit has to be taken you next time you fetch the branch you check that bit because you have the same program counter it's the same branch oh it says taken meaning that last time I executed was taken so I'm going to predict as taken right i'm going to do what I did last time essentially what I or what I should have done last time you can think of it that way essentially if you look at uh this uh thing uh this this execution profile I'll call this an execution profile branch is exe uh taken for the first exe 10 executions branch is not taken for the next 10 executions you get 90% accuracy in this case right uh assuming you start with predicting the branches taken right so basically you are accurate until uh this branch over here not taken uh actually uh 90% accuracy assuming that you execute in a loop let's say so you basically uh you b what happens is you basically predict all of these correctly the last one the first time the branch changes behavior you'll predict it incorrectly because the last time you executed it was taken you think it's taken but it's actually not taken so that's a mistake you make and then when you actually go act like this usually we think of it as executing in in a loop when you actually go back and execute it when the branch changes behavior again you get one more mistake so you make two mistakes out of these 20 executions that's 90% accuracy which is much better than 50% with a single static prediction bit right okay we're going to do better than 90% soon okay so basically uh what what I just said with an example is written over here you always mispredict the last iteration and the first iteration of a loop branch right because last iteration you you keep iterating in the loop you keep taking the branch once you exit the branch once you exit the loop you need to not take the branch but because you took the branch many times you think you're going to take it again so you miss you make a mistake when you enter the loop again last time you executed the branch it was not taken because you exited the loop but you should predict is not you should predict it taken because you're going to take it uh while iterating in the loop so you make two mistakes okay so accuracy for a loop branch with n iterations is n minus 2 / n clearly if n goes to infinity you're almost 100% accurate right so basically this is very good for loop branches for loops with larger n number of iterations okay now if you have a small n then you have a problem so you go back to 0% even though this is again very predictable as you can see right so if n is two in this case uh yeah you go to 0% as you can see right you can actually check with the equation so basically this is not good last time is not good if the branch is always doing the opposite thing what it did last time right but you can probably predict this with a pattern matcher uh I'm not going to go into the pattern matcher we will get to perceptron etc later on uh but we're going to fix this problem partially uh with the next predictor that we will see okay so how do you implement the last time predictor there are multiple possible implementations this is one example basically now we actually need a direction predictor you take the program counter and pro index index an array and this array records which direction uh the branch went last time it was taken based on uh a part of the program counter us as the address right so this direction predictor says whether the branch was taken last time whether this program counter uh led to a taken outcome last time okay now you could actually you don't have to have uh this direction predictor separate from the BTB btb remember you use the program counter to index a memory structure this memory structure gives you whether this was a branch uh and whether you have a target address for the branch whether you execute it it could also tell you whether it was taken last time right so it could actually be part of the branch target buffer and this is one way of implementing it basically this is your program counter what you could do is you index into the program counter with a part of your address it gives you an entry over here this entry tells you the target address and also the last time what happened uh to that branch and a tag also we're going to see the structure later on when we talk about caches so don't worry about it if you don't quite understand this yet but you can think of this as indexing a memory array using an address address is determined by part of your program counter and uh essentially you get the taken bit and you get the target address and if the bit here is set that means the branch was taken last time so you use the target address as your prediction otherwise you use PC plus 4 as your prediction make sense okay we're going to look at the structure later on when we get to the caches this is a very classic cache structure uh but again I don't want you to think about implementation details right now because our focus is really on algorithms to do branch prediction today okay you're going to have a lot of fun with this sort of structure later on okay so basically uh you need to update this uh branch history table that's part of the BTB or the direction predictor uh that is outside the BTB depending on how you implement it uh after each execution of a branch so if you think about this for each branch you have kind of like a state machine right uh one state says predict the branch has taken because that was what did we did last time the other state says predicted not taken and you can basically think of uh moving between states if the branch is actually taken you stay in the same state predicted taken next time right if the branch is actually not taken you go to the other state saying that next time predicted not taken so you can basically fill out the state uh diagram for every branch You basically have a state machine when I say every branch of course every branch in the program may not be possible for every entry you have in this direction predictor table and different branches uh with different addresses can potentially map to this table right because you're using part of the program counter as your address okay but in the purest sense it's for every branch okay so there is a problem with this uh and people have figured this out uh essentially it changes its mind very quickly with a single opposite outcome it basically moves from one state to another state basically it doesn't believe in staying in the same state if you think about it right this is kind of like a broken air conditioning so if you actually have air conditioners uh you set the air conditioning to be 20° C for example right you want to stay there and if it's a little bit higher than 20° C your air conditioner may start uh cooling down right if it's a little bit lower than 20°C the air conditioner may start heating up now what if you're right around 20 2000 what should your air conditioner do should it start cooling down immediately or should it wait a little bit right so basically people have figured out that it's a good idea to wait a little bit why because there is a margin over here around 20° C and if you too quickly change your mind and start cooling down you cool down and then you go to 19.99999 degrees and then at that point what do you do quickly change your mind again and heat you could do it but basically you would be switching between cooling and heating and cooling and heating and cooling and heating and cooling and heating all the time right and that's exactly the same problem we have over here and people have a name for this it's called hysterosis essentially you don't change your mind quickly you add a hysterosis to the system saying that you wait a little bit you basically monitor what's going on if you actually go to 20.5 okay now maybe I'm doing now actually maybe the temperature is increasing so maybe I should kick in right so that's the same idea over here don't change your mind quickly okay that's what basically the idea of improving the last time predictor means and this actually made a huge difference that's why I'm actually harping on it this makes a huge difference in energy consumption of air conditioners also as you can imagine right but basically a last time predictor changes prediction from taken to not taken or not taken taken too quickly even though the branch may be mostly taken or mostly not taken right so you may imagine a branch taken taken taken taken not taken and then taken taken taken taken and then not taken once in a while if you change your mind you make two mistakes if you don't change your mind maybe you'll make one mistake right that's the idea here add hysterosis to the predictor so that the prediction does not change on a single different outcome wait so that you actually accumulate some more information and some more evidence that you should really change your mind okay so how do you accumulate that evidence well people there there are multiple ways of doing it but the simplest one of the simplest ways of doing it is use two bits to track what happened as opposed to using a single bit right so essentially you can have two states each for taken or not taken instead of one state for each and this was introduced by this seminal paper in ISKA 1981 and after this all of the microp processors implemented this and they actually had a huge boost in branch prediction accuracy so this is also called two bit counterbased prediction each branch is associated with a two- bit counter uh one more bit provides hysterosis a strong prediction does not change with a single different outcome it's also called biodal prediction so there are multiple words but basically the idea looks like this at least a two- bit counter implementation so you have four states with two bits and four states uh encode uh so these two states for example encode that the branch should be predicted taken right these two these two states encode that the branch should be predicted not taken so there's a maximum value over here that you can go to you saturate at that value and then there's another max minimum value over here and then you saturate at that value this also called saturating arithmetic meaning uh you basically have two maximum and minimum values and if you're actually at a state where you're at the maximum value and you want to reinforce that maximum you stay at that state essentially so this means in branch prediction terms if you're in this state it's also called a strongly taken state meaning that I strongly believe that this branch is going to be taken right you can think of it as belief right and if you see that the branch is actually taken you stay in the state if you see that the branch is actually not taken when you execute it your belief is modified a little bit and you go to a weekly taken state oh I think this branch is going to be still taken i'm not going to change my mind because I've seen this that it was taken for at least twice now I'm going to move to a state where I'm a little bit less believing that it's going to be taken but I'm still believing that it's going to be taken right okay so one one opposite outcome takes you to the weekly taken state now if you see that you predict that branch taken again but if the branch is actually not taken again then you change your mind right so basically you change your mind after seeing two opposite outcomes to what you kind of believe based on past information does that make sense and you can do the same analysis for the predicted taken these are actually balanced states as you can see okay so that's the hysterosis basically and as as I said uh if if the if you're in this predicted taken state over here you're thinking the branch is strongly taken so you don't change your mind if you're in the weakly taken state you change your mind now of course the same problem exists right here you can basically flip-flop between predicted taken weekly taken state and weekly not taken states if the branch actually uh changes over there so you don't perfectly have hysterosis in that case but you eliminate a lot of the downsides if you actually add more bits three bits three bit counter you can solve that problem uh but if as you add as you add more bits it takes time to actually train the predictor as well right because you need to see a lot of outcomes we're going to see an example with three bits later on from real processors but that's the basic idea if you want to add more hysteresis add more bits right then you change your mind even less frequently but that may or may not be good depending on the branch behavior of course right does that make sense okay okay so basically uh each bit each branch is associated with a two-bit counter this can be stored in the BTB or it can be stored in the direction predictor that we've shown earlier i'm going to show that again this one more bit provides hysterosis and a strong prediction does not change with one single different outcome now let's do an analysis of this uh uh this is a loop with n iterations uh in this case for example the uh you have 19 of these uh uh you execute the branch 19 of them are taken and the last one is not taken uh and you get 95% accuracy over here right basically you get all all of them correct except for not taken one right okay assuming the counter is initialized to weekly taken there's an assumption over here if you look over here assuming the counter is again initialized to weekly taken or in this case strongly taken also uh you basically get 50% accuracy so basically you recover from 0% to 50% right that sounds good and you basically for for small for loop with few iterations you make one less mistake if you think about it and that may be actually quite large if the if n is small if n is very large none of this matters that much right okay so basically now we have better prediction accuracy but we have a trade-off we have more hardware cost right essentially counter can be part of btbry but still it's more hardware cost okay then the question is this good It turns out in at least in 1980s in many programs you get 85 to 90% accuracy with this prediction that's why it was implemented immediately in many processors today this sort of prediction doesn't work well in modern workloads modern workloads are actually much more intensive in terms of their branches they're much more complicated people actually as you add as you provide more uh performance in hardware uh at the software level people actually add a lot more complexity so workload becomes more complex so it becomes a chasing game essentially right you provide more performance and people need more performance because they make their software more complex right and branch prediction is an example of that branches are becoming more less predictable compared to 40 years ago but okay let's assume that this is 85% 90% is it good enough well basically we go back to what we did yesterday assuming uh a 20 stage pipeline with five fetch 80% well you get an IPC of fetch IPC of one as opposed to Doesn't sound good right and you're fetching 400% extra instructions not good for energy not good for performance okay assume 85% still not good assume 90% still not good right so we want to get as close as possible to 100% especially as N increases and W increases and these are actually higher in modern processors both of them okay so basically can we do better and we're going to talk about an idea called two-level branch prediction that has again revolutionized how things are done about 10 years after that 1981 paper this is a seminal paper in 1991 and basically uh the idea is to exploit another level of history history or predictability so last time and two bit counter predictors exploit last time predictability right but uh people realized starting with this paper that a branch's outcome can be correlated with other branches outcomes meaning if you execute at some other branch at some other program counter that can give you an indication of where this branch will go it's actually pretty cool as we will see with some code examples so this is called global branch correlation and you can actually build predictors uh that take advantage of it people also realize that a branch's outcome can be correlated with past outcomes of the same branch other than the outcome of the branch last time it was executed so you can look back in the history a little bit more that's the idea basically this also called local branch correlation so let's start with the global branch correlation because global is a totally different type of correlation even though you can say local is maybe similar to last time but you add a little bit more history right okay so what is global branch correlation the basic idea is if you're executing a branch a bunch of branches you have outcomes and recently executed branch outcomes in the execution path are correlated with the outcome of the next branch let's let me give you an example so if you look at these two branches The first one is testing some condition and doing something based on that the second one is testing that condition and some other condition now if you executed the first branch that gives you an idea of where the second branch might go right if the first branch is not taken second is also not taken right if the branch is not if the first branch is taken then you don't know the second one but at least you have some information now okay if you look over here there are two branches one is testing X less than one the other is testing X greater than one uh and then another branch over here so if you look at this one if the first branch is taken the second is definitely not taken right that's true for here also if the first branch is taken then A is set to two then second is not definitely not taken right so this is the reason why global branch correlation works of course we're not going to track everything that enables us to do this prediction we're going to summarize the information about where prior branches went using a bit vector we are going to have a bit vector each bit in the bit vector will indicate the direction of different previous branches and based on that information we're going to index into a table and that table will tell us whether we should predict the current branch based on this past history as taken or not taken that's the idea basically okay one more example over here uh these two branches if y and z the first two branches are both taken then x is also not taken you can study it if if y or z the first two branches either of them is not taken then x is also not taken so you can see that actually there's a lot of information that you have in the program but summarizing this information in a way that's hardware friendly is not easy right the key kind of revolution of this particular global branch correlation is two things first realizing that there's global branch correlation second summarizing this information a hardware friendly way such that it's easy to implement and if you have an idea that needs to go into hardware you should really do both basically okay let me give you a fun story this was a benchmark that people used to benchmark it's it's equation to truth table it's actually part of the electronic design automation suite that people used to design processors at the time and this was a benchmark in the spec workloads that people used to uh essentially compete uh measure the performance of different microprocessors with and most microp process failed predicting this uh third branch over here which is testing if AA is not equal to BB right now after people start implementing these global branch predictors this became very easy to predict so if you look over here if B1 is taken it's testing whether A is equal to two uh if it's taken then A is set to zero and if B2 is taken then BB is set to zero then this is going to be not taken right because test is checking whether A is not equal to BB right so if you know where these two branches went taken taken then you know that you're going to predict that not taken right and it's interesting that this benchmark was removed from the suite after people came out with global branch predictor because this benchmark was not interesting anymore because everybody got it right right so that's an interesting lesson also you can develop benchmarks to benchmark performance but once the processor becomes good enough to execute that benchmark that benchmark becomes not interesting this is another way of actually saying what I just said earlier processor performance improves to adapt to workloads but old workloads are not that interesting after some point because processor is optimized for them new workloads actually challenge the processor much more right that's why it's always good to be forward-looking as an architect because that's what that's what matters what matters is the new workloads not the old workloads in the end unless old workloads have something fundamental that doesn't change of course right in this case it was broken let's say okay so how do you capture this global branch correlation let's take a look at an example as I said you need to summarize this information about global branch outcomes so basically we're going to summarize this in a bit vector we're going to call the global history register it's a global taken not taken history of all branches 01 essentially you encode each branch with a zero or one uh you make a prediction based on the outcome of the branch the last time the same global branch history was encountered i'm going to show you a picture soon but basically you keep track of the global taken not taken history of some number of branches i say all over here but we're going to limit it to some number 16 four let's say 16 we're going to call the global history register 16 bits each of them identifying a prior branch that was executed and each branch is encoded as a zero or one if it was not taken it's a zero if it was taken it's a one i'm going to show you a picture again and you use this bit vector as an address to index into a table that recorded the outcome that was seen for each uh for for that GHR value in the past so last time you saw this global history register value last time you saw this pattern of outcomes the branch was taken or not taken right you basically keep track of that so we're not using the program counter to decide what happened to the last time to this branch we're using this pattern uh the uh the the the pattern of last 16 branches that were executed taken not taken pattern as an index into what happened when I saw this history last time right if I saw this history last time the branch was taken then I'm going to predict as taken if I if if uh the branch was not taken when I saw the same history last time I'm going to predict not taken okay that's the idea basically but of course you may still use to have two bit counters as opposed to one bit counter like I just said you may still use history in this one and many works actually use that so this is the global history branch predictor it's also two level meaning it it uses two levels of history global history register and plus the history at that global history register meaning what happened to that branch when you saw that global history register value last time so let me give you the picture essentially this the first level of history you have a global branch history register M bits this is the direction of last N branches taken or not taken outcomes basically it could be 16 bits uh let me finish this and then we're going to take a break uh but basically you it's you shift in the last previous branches outcome whenever you execute a branch and you shift out the oldest branches outcome from this register so you keep it always looking at the last n branches that you've executed and you use this to index into a table of saturating counters for each history entry so if this is 16 bits you have to the 16 entries over here and each of these entries basically encode the direction the branch took the last time the same history was seen this could be a single bit just like the last time predictor but this could be also multiple bits like two bit counters that we have seen for hysteresis purposes okay so you're going to have exercises uh in your let's say homeworks and maybe even uh uh exam questions uh it's going to be fun it's not that difficult I think but you'll see it but this paper combining branch predictors that I recommended explains things nicely so as to why uh this work and how how does this work so if you look at these easy prediction mechanisms uh if you look at uh when you're testing I equ= 100 and assume that you've executed this multiple times uh clearly this for loop this is this branch is usually taken right you're going to iterate because 100 times let's say but if you look at the global history uh at this point this branch is testing I if you look at the global history register it should look like 11 1 mean because last three branches test A and those outcomes should be taken taken not taken based on this information and the previous outcome was taken so taken taken taken not taken and you predict taken for I so whenever you see taken taken taken not taken you predict taken for I now you shift in uh the uh taken outcome for I so this goes to the end over here one of them this this earlier taken drops out so the next history is taken taken not taken and you see that history uh when you fetch the next branch which is beginning of this f for loop the first branch in this for loop and this should be taken because clearly you're just starting this for loop so whenever you see the history taken taken not taken taken the result should be taken in this code structure and then you go to the next branch you shift in the outcome so the history becomes you shift out the first one you shift in the uh next one the history becomes taken not taken taken taken again the result should be uh taken over here right so basically I kind of simulated part of it but uh essentially you can get almost perfect accuracy uh using this uh structure okay let me finish the pentium pro branch predictor and then we're going to be done essentially pentium pro we discussed yesterday or earlier that this is the first processor modern processor that was successfully implemented the out of order execution paradigm it was an out of order supercaler processor this is what made Intel actually extremely successful it was one of the best let's say money-making processes in the market it actually improved performance significantly the reason is not just out of order execution the reason is they also implemented this two-level global branch predictor it was published in 1991 it it was implemented by Intel four years later immediately almost essentially this was a two-level global branch predictor it had four bits global history register not 16 bits only four bits and it had multiple pattern history tables of two-bit counters which pattern history table to use is determined by the lower order bits of the branch address we may talk about that later but essentially it's the same idea and as a result of out of order execution supercaler execution and two-le branch prediction plus precise exceptions using the reorder buffer everything we discussed so far and this branch predictor actually completes the full picture uh this was the widely commercially successful out of order machine and it's actually improved performance greatly compared to the previous Pentium which was Pentium uh which was Pentium 2 actually which was an in order supercaler processor without this branch predictor it has a two- bit counter branch predictor okay and this is the picture of Pentium Pro so basically let me summarize uh and then we're going to take a break essentially this is what we were doing we had a program counterbased predictor we change that direction prediction we don't use the program counter anymore you use the global branch history to actually index into a direction predictor now this direction predictor has to be separate from the BTB because it's indexed completely differently from the BTB right let me just animate again this is what happens if you do one level or uh last time type of branch predictor this is what happens if you actually use global branch history okay so this is a good place to stop let's come back uh when the bell rings and we're going to try to improve uh this global branch predictor's accuracy and then we're going to talk about local branch prediction okay let's pick up where we left off at is everything good online okay okay just to jog your memory we covered uh the one level predictor which look like this you use the program counter to make a decision in terms of the direction now we enhance it with another level of history which is we use a global branch history to make a decision and then the history at that global branch history okay now you can actually improve this a little bit there is a paper that came out two years after that seminal paper that introduced two-le prediction uh which remains as a technical report it was never published anywhere but it had a huge influence in the field uh and actually I show you this uh this description that I showed you is from that paper for example it introduced the idea of uh adding more context information to the prediction and the idea is don't just use uh the global history to make a prediction but also use information about other things one other thing obvious thing is the program counter of the branch basically take into account which branch is being predicted in addition to the global history and this makes actually a big Also it turns out it's called the G-share predictor you basically use global history register hashed with the branch PC hashing means exoring for example you could you can do a bitwise exor of the branch address program counter and the global history register and then you use that to index into this direction predictor which is called the pattern history table or branch history table that's the idea over here so why does this make sense because you use more context information for prediction you don't just use the history register or branch address uh you use both of them and this gives you a better information about what's going on in the program essentially you could actually use other things also but I'm not going to talk about that but this is uh the state-of-the-art again uh in in terms of using context information this also gives you something else which is very interesting if it if it if it turns out if you only use one of these you don't utilize your table very well because the values that only one of these takes is not very well distributed let's say across the entire address space of this table it's assume that each of them is 16 bits you have two to the 16 entries over here your branch history patterns may not be covering entire to the 16 space right but if you actually do an exor you use some randomization this randomization is not just random it's really taking into account context taking into account useful information but it's also distributing the information to different entries in this table so you get better utilization of the table you can better distinguish things okay this is very interesting uh downsides of course everything is a downside and upside clearly you get better information over here to make the prediction you get better distribution of the information in the memory that you have dedicated for your branches but you increased access latency now because you have an exor gate in front of uh your access uh to the memory uh that stores your uh let's say two bit counters uh yeah so but this is not that bad so we're going to make it much worse actually we're going to have sophisticated branch predictors that makes the access latency much worse but this actually had a big impact also in improving uh prediction accuracy okay so this is what the G-share branch predictor looks like you take the program counter take the global branch history uh program counter is the address of the current branch global branch history is which direction earlier branches went you exor them and then you index into the branch predictor uh direction predictor if it says taken you use the target address from uh the uh BTB as your next fetch address makes sense hopefully right okay okay uh so that was global branch correlation let's talk about local branch correlation i'll I'll not go into this a lot of detail but I'll give you one example essentially uh in the same work where global branch correlation was introduced local branch correlation was also introduced and essentially the realization is that a branch's outcome can be correlated with the past outcomes of the same branch so this is also well explained in this uh technical report that I mentioned so if you look at this loop branch over here I'm going to read it from the paper actually if the loop test is done at the end of the body the corresponding branch will execute the pattern 1 1 0 to the end times 1 one where one and zero represent taken and not taken respectively and n is the number of times a loop is executed hopefully this makes sense right you basically uh three times you take the loop branch last time you don't take the loop branch so you get out of the loop you do something else eventually you return back to the loop right because a program structured like that clearly if we knew the direction this branch had gone on the previous three executions then we could always be able to predict the next branch direction because this is an easy to predict branch actually with this sort of mechanism so basically the key is really identifying uh these uh numbers over here so let me give you how uh local history works so this is again from the paper uh this is a loop closing or loop ending branches history it's always doing 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 as you can see right taken taken taken not taken taken taken taken not taken because the first three times it's taken the last time it's not taken loop iterates uh four times essentially to predict a loop branch perfectly we want to identify the last iteration of the loop essentially so by having a separate pattern history table entry two bit counters essentially uh for each local history we can distinguish different iterations of a loop this works for short loops right so you need to basically keep track of the history of the branch at that program counter because some other branch may not do this right some other loop branch may be doing 1 1 0 1 1 1 0 1 11 1 1 1 0 etc so the reason this works is if you look at these different uh points in uh whenever you do 11 1 0 whenever you execute this branch multiple times you see different histories right so if the branch if the past four branches past four times a branch is executed was 1 0 1 1 you should predict it as taken because you're still in the loop iterating if it's 11 1 it's taken it's 11 1 0 it's taken again only if the last four executions of the branch was not taken taken taken taken the next time you should predict that as not taken right and by having the local history index into different locations you can identify different loop iterations in which this branch is executed that's the idea it's not the easiest way to identify this sort of loop later as we will see people introduce specialized loop predictors they basically try to identify branches that go backward and that iterate like this and they basically have a very specialized predictor for that we will see that but this is one way of handling the local history Okay so let's take a look at how do we capture this essentially you do the same thing you have a per branch you have a history register this history register encodes which direction this particular branch went last n times so it's a per branch history register as opposed to a global history register that collects information from all different branches you associate the predicted outcome of a branch with a taken not taken history of the same branch and you make a prediction based on the outcome of the branch the last time the same local branch history was encountered so it's called the local history or local branch predictor again it uses two levels of history one level per branch history register the second level is the history at that history register value what happened last time I saw this history for this branch did I take the branch did I not take the branch and you can add historicis on top of that okay again the picture looks looks like this it is opposed to having a single global history register now we have many local history registers one per each branch i don't know why this wants to okay maybe it's going to come up again essentially the first level is a set of local history registers each of them is m bits you select the history register based on the program counter of the branch because we want to have per branch information the second level you use the history register you selected and index into a pattern history table this is a table of saturating counters for each history entry essentially what I said earlier the direction the branch took the last time the same history was C and hopefully you can identify the ending iteration of the loop for example assuming it's a loop branch there are other cases where this is useful but it's a bit harder to think about loops are easier to think about okay hopefully this makes sense right so the structure is very similar as you can see the key is really encoding this information in a nice and clean way so that the hardware doesn't become complex this these history registers that encode taken and not taken with a single bit and you have a bit vector for what happened to the branch last end times you executed it is a nice way of encoding the information and you index into a table so everything becomes simple if you think about it of course things structures become much larger if n is very large if you want to keep a history of 1,024 then you have a two to the 1024 entry pattern history table which is not realistic right that's why the histories are usually initially it was four we will see 10 12 etc today it could actually be 16 but there are other methods as we will see that actually makes those histories larger but not this method okay so this is what a two-level local history branch predictor looks like you take the program counter address of the current instruction you index into an array of local history registers and you pick one based on the program counter which directions earlier instances of this branch went it basically specifies which directions earlier instance of this branch went and then you use that to index into a pattern history table which tells you what happened the last time you saw this history for that branch and then you decide whether or not it's taken based on the status of the single bit or two bits or three bits over here depending on whether or not you use hysteris so now you can see that this much more complicated right earlier what we did let me go back to that picture oh it's going to take some time so we make things more and more complicated So this was our program counter we just use a program counter to index this is our global branch history register we just use it to index this was nope yeah and then we exort them together and index them that's G share and now if you want to do local history it's a little bit more complicated you index into one table to get the local history register and then you use that to index into another table to get the direction that you want so this is a bit more expensive than global history but it does make a difference as we will see this is also implemented in modern processors modern processors usually implement a global predictor a local predictor and some sort of loop predictor and some sort of other mechanisms as we will see uh soon okay so this uh paper that introduced the two-level prediction ideas actually generalized the concept i will not go into a lot of detail but this is actually a family of mechanisms uh this is a global predictor basically I showed you earlier you have a global history register history registry can be essentially global and you have a pattern history table this is a direction predictor two bit counters that could also be global a global predictor is the simplest both all of these are shared across all branches essentially all branches contribute to them and then on the other hand you have something that's unimplementable meaning extremely expensive you have a pattern history table uh over here uh no sorry you have a history registers you have per branch history registers every branch has its own history register so that's very local and then every branch has has also its own pattern history table counters so for every branch you have this separate now that's clearly very expensive right and there could be thousands of thousands of branches in program you're not going to have hardware to accommodate that in some programs there could be millions of branches right so what's more realistic is somewhere in between essentially and there are actually different ways of thinking about that in between so if you're interested you can look at this paper but this paper had a lot of impact on processor design uh it was selected as one of the test of time award winners uh for the micro conference and if you're interested you can read the author's retrospective uh on this topic the second author is my PhD adviser so okay the question is of course can we do better now we I've gotten you to the 1990s right we're at 1990s the question is can you do better uh than all of this and that's always what has enabled better and better performance processors right and the answer is yes of course right if you actually put enough creative people on a problem they will come up with good solutions even on very difficult problems okay so basically let me give you some other ideas uh essentially uh there are different ways of doing better uh one is to realize that different branches have different types of predictability some branches are more predictable using local history you've seen that with loop branches right short iteredating loops for example some branches are more predictable using global history if for example the correlation global correlation leads to a better prediction right clearly there are cases where that global correlation is much more important than local correlation for some others a two bit counter is enough maybe a one bit is enough right last time the branch always does what it did last time right never always not taken an error branch for example is testing for an error if it's taken that means that you're going to quit the program because there was something wrong that's probably always not taken and when when it's taken it doesn't matter for you right yet basically a single bit is enough right so that could be encoded using static profile information so the observation is that there is a lot of heterogeneity in the predictability behavior of branches so don't use one sizefits-all hardware to cater for everything when you have heterogeneity you should always have heterogeneous hardware essentially to cater for different in the end so basically there's no oneizefits- all branch prediction algorithm for all branches exploit this heterogeneity predictability behavior by designing hybrid or heterogeneous branch predictors which is essentially what we have in old processors today so basically the idea is to use more than one type of predictor and select the best predictor clearly this adds complexity now because you design multiple predictors and you also then need to design a selector meta predictor to decide which predictor to use for a given branch that you're predicting so for example you can have a hybrid of two bit counters and a global predictor right that's one example i'm going to show you a more sophisticated example soon and again this was introduced in this technical report uh which was quite infl so advantages essentially this gives you better accuracy because hopefully you can uh use the best predictor that you think is the best for each branch at any given point of time there's also another benefit which we didn't discuss earlier uh if you have a long history it takes time to warm up the predictor meaning you need to see those history see the history different outcomes of the history so that you can actually warm up the bits right the prediction bits uh if you don't actually warm up the predictor you may actually make note predictions you may you may want to use a simpler predictor that warms up quickly until you see the entire history because if you think about it uh if you have 16 bit history uh if you're actually exercising a lot of those bits uh that means you need to see a lot of outcomes to actually warm up the warm up the predictor okay so that's another benefit than accuracy actually this is this impacts accuracy also right because if you use a simpler but slightly more accurate predictor while the bigger predictor is warming up your accuracy also increases so there are clear disadvantages to hybrid branch predictors you need a predictor to select which predictor to use so now you need to predict which predictor is better right that's another level if you think about it which is fascinating how do you design that becomes interesting also right it's longer access latency because you have multiple things and more hardware and more complexity so whenever you have heterogeneity that's usually the case but this idea was also implemented and it's implemented in existing all of existing processors actually all high performance processors uh and this is a beautiful paper that I recommend the alpha 21264 microp processor it's one of the fastest processors of its time it was competing with Intel at the time is digital equipment corporation and this was their flagship processors processor which was faster than Intel's at the time it implemented uh you can see that at the time their minimum branch penalty was seven cycles so it's a smaller pipeline uh it's they can fetch four instructions per cycle but typical branch penalty is longer and you can see that target address are stored in the instruction cache which we may get to later on uh but basically this is what the predictor looks like you have let me start with the global one you have a global predictor over here you have 12 bits of global history much larger than what Intel implemented in their Pentium Pro uh and you can see that the global predictor has two to 12 entries each of them is two bit counters it supplies one prediction uh using that history right and then there is the local history table basically you have 1,024 10 bit counters based on the program counter you pick one of them that's your local history for that branch and you use that to index another local predictor table which happens to have 124 entries and three bits in each entry so therefore based on hysterices etc they decided three bits is a better idea over here and you basically make two predictions one is a local prediction for the branch the other is a global prediction and depending on which predictor you think is more accurate you select the final prediction so how do they determine basically they they have what is called a choice predictor it turns out it takes the global history and based on which predictor was better the last time you saw that global history it selects a predictor to choose this may not be the best huristic but this is what they found to be working for their system right there could be better heristics of course right you could imagine but this is a reasonable predictor at the time in early 2000s and late 1990s and you can see that it's complicated than what we've seen so far right we started with no bits at all now we have a lot of bits it's going to be more and more bits okay there's another idea that actually is very powerful uh which bas which is basically get rid of things that you don't want to spend resources on this is called filtering there are some branches that are easy to predict don't spend your valuable resources on them filter them out many branches are biased in one direction for example 99% taken these branches pollute the branch prediction structures basically they make the prediction of other branches difficult by causing interference in the branch prediction tables and history registers because there might be some other branch that's really important to predict and it's very hard to predict because maybe it's 50% taken right but now you're actually using part of your resources to update the counters and the global history etc for a branch that's almost 100% taken right so the idea is not do that detect such biased branches somehow either through compiler which is always a danger because compiler may not have perfect information or in hardware monitor branches and figure out which branches are easy to predict there you could actually add another predictor that predicts which branches are easy to predict and filter them out don't use these sophisticated structures because these branches are easy to predict don't pollute the other uh things okay that's the idea basically use them use predict with them with a simpler predictor this is what called branch classification again another influential paper uh that talks about it okay so basically are we done with branch prediction i'm going to go into more sophisticated ideas uh essentially uh the Mcfarling paper the technical report that I mentioned showed that you get 90 to 97% prediction accuracy on average again there's old workloads today that number is still low actually if you look at graph workloads for example you do large scale graph analytics a lot of branches over there are very difficult to predict because you actually are trying to traverse a graph and the input pattern in which you're trying to traverse the graph is hard to predict because it's dependent on what you're searching for in the graph right think about a source social networks you try to find I don't know who is close to you at a given time right you traverse this graph and it may not be not not you may not have a very predictable patterns right so at that time there was no such thing no no social networks etc basically uh you get 90 to 90% but uh this is the same paper shows that difficult workloads still suffer a lot for example a compiler at the time one of the difficult workloads was compiler still this is a difficult workload the maximum IPC with this hybrid branch predictor that they showed in this paper was nine assuming all of the resources are there to execute uh but actually if you can do perfect branch prediction you could get to 35 35 instructions per cycle assuming you're not limited by other resources you could execute any instruction that you want with any resources that you want so branch prediction still limits performance significantly on some workloads even at the time today is much worse so basically people kept introducing new ideas uh I'll go through a few of these so loop branch predictor I kind of mentioned this essentially this is a predictor that detects loop branches and basically says oh if I have a predictable iteration count I can filter these out of any of the other predictors basically detect a loop branch that is backward branch and then say the last time I saw this backward branch it executed it was taken four times so count how many times this branch was taken next time you see this branch start counting again and when you actually get to that limit predict the branch has not taken because you're not going to iterate again this is assuming that you're going to have a predictable iteration count on the same loop this works well for loops with small number of iterations where the iteration count is predictable but there are enough of these that it was implemented in many processors starting with Intel Pentium app i will show a picture of this but I will not dwell on this that much we're going to study especially perceptron branch predictors because this is going to be one of the most successful applications of machine learning to hardware today uh essentially the idea is to use what is called a perceptron how many of you know perceptrons okay this is a single layer network basically today we have neural networks perceptron is the predecessor of all of those neural networks essentially it's not multi-layer it's a single layer there are multi-layer perceptrons also but what we're going to see is single layer so essentially the idea is to learn direction correlations between individual branches or at least more more accurately the history bits and the current branch you're predicting and you assign weights to correlations using simple machine learning it was introduced in this another seminal paper now we're talking about 2001 we talked about 1981 1991 2001 right we're getting closer to today and then uh people have also introduced different history lengths people understood actually that a single history length doesn't work well there are some branches that you need to look at close by history there are some branches you need to go a thousand branches ago so how do you accommodate those very different history lengths and this paper introduced a geometric history length branch predictor with different tables and each table caters to a different history length essentially and the the both of these are also implemented in existing processors so this is Intel Pentium i already described the loop predictor at least at a high level but essentially it's similar to what I just described it exists it is implemented in many processors uh I will take a very quick aside over here we've been talking about conditional branch prediction but there are also indirect branch predictors in existing processors so when you do a jump to a register value it's not conditional you always do the target condition is not a problem but the target is a problem over there because target is different in every execution of the branch so how do you predict the target right so there are target predictors in existing processors also okay this is to show you that the branch prediction logic in an existing process is actually huge according to Intel at least this is this part is dedicated to branch predictor branch prediction logic all of the structures okay let me jump into perceptron since many of you don't know it uh this was actually a simplified model of a biological neuron it was developed in 1950s and Rosenlot has this beautiful paper that talks about it this is when actually people were thinking about neural networks how do I model uh the behavior of a neuron and how do I actually make it use useful in computation how do I design an artificial neural network essentially not even an artificial neural network we're not people were not talking about networks at the time artificial neuron how do I model it right uh it's also a simple binary classifier basically people thought of neurons as binary classifiers meaning you take an input vector X and map it to a one or zero uh input is essentially a vector X you can think of this as bits uh and perceptron learns the linear function if one exists of how each element of the vector affects the output stored in an internal weight vector so it basically uh checks how each of the elements in the input are correlated with the output essentially by storing an internal weight vector and learning those weights it's very similar to a neural network except it's single layer right how many of you are familiar with neural networks probably more yeah okay you're going to learn it at some point uh but this is the simplest you can get let's say it's not a network yet well I mean one layer network so basically output is determined by multiplying the weight uh with the input vector or multiplying the weight vector with the input vector uh and checking whether that's greater than zero you can optionally add a bias this is basically saying oh there's a bias towards zero or one essentially you can actually say that so let's let's take a look at it in the bench prediction context so you could use this for example to detect the language the uh the input vector could be words that you see you may have actually x words uh the uh uh weight vector uh may be actually the probability that that word belongs to let's say a language German language and then the output maybe if the output is greater than zero when you multiply the input vector with the uh weight vector if the output is greater than zero you have a good confidence that oh it's probably German right you could think of it that way right it's you're classifying some inputs in this case you're classifying whether this set of words that you input belong to language X or not right in the branch prediction context we're going to use bits as single bits essentially vector X is the branch history register bits this could be a global history register it could be local also Global is easier to implement as we have discussed earlier uh and output is prediction for the current branch and the idea is to learn these weights internal weights that basically uh learn how each of the branch history register bits correlate with the output of the branch by assigning weights to each of these bits so okay let me give you a pictorial view essentially you use a perceptron to learn the correlations between branch history register bits and branch outcome uh you essentially learn a target boolean function of n inputs so if you look over here there's some internal weights that you modify these are the inputs uh these are the uh essentially uh x input uh each of them is a bit in the branch history register and by multiplying x's with the internal weights and then there's a bias bits basically you can determine which branch is biased one way or another let's ignore that for now you do this calculation essentially you multiply the x vector with the weights and you check whether y is greater than zero each branch is associated with a perceptron a perceptron contains a set of weights wi each weight corresponds to a bit in the global history register it represents how much the bit is correlated with the direction of the branch so these weights actually store the correlations so positive correlation means that bit has a large positive weight meaning that bit if that bit is uh one uh and you have a large positive weight you're likely that this bit this predictions should be one right unless some other bit is negatively correlated and then you uh do the complete calculation negative correlation means you have a large negative weight and you keep learning these weights based on the execution right you adjust the weights there's a training function which I will mention but I will not go into details about but you can read about so how do you do the prediction essentially express the global history register bits as one or minus one in this case taken not taken because multiplication needs one or minus one multiplying by zero doesn't give you much information uh uh essentially you take the dotproduct of global history register and weights which is what is shown here if the output is greater than zero that means the positive correlations kind of dominate it or positive correlations are higher than negative correlations as a result you predict taken make sense right so basically you have as any uh neural network uh or classifier you need to train uh the network using some uh training function which I will not go into again you can read the paper you can look at it you can make there there are different ways of training it also but simp perceptron is simple so training function is also relatively simple but still it requires multiplications as you can see so far we have not done any multiplications and branch prediction now we're doing multiplications right okay prediction function also requires multiplications as you can see right and then there's a bias weight which now I will get to it's the bias of the branch independent of the history so you can actually encode a bias for the branch also but you can see uh this is the branch address you using the branch address you index into a table of perceptrons you take the perceptron associated with the branch you select the perceptron and you multiply it with the history register compute y check if it's greater than zero that gives you the prediction and when you have the branch outcome you take the selected perceptron and adjust the weights okay so you could replace the perceptron with a multi-layer network that is used for other things today but that would make things even more complex this is already complex actually okay so I I've shown you the dot product of GHR and perceptron rates and then you compare the output to zero so okay so this actually works uh this is one of the things that I implemented during my PhD because I was actually doing some things that we will discuss later on memory latency tolerance and branch prediction was very important to me and I implemented this and immediately I saw a 20% reduction in branch misprediction rates it's fascinating many things don't work immediately when you implement it but this is one of those things that works okay the advantage is you get more sophisticated learning better accuracy now you can distinguish which bits actually are better correlated with the outcome of the branch that's what you're doing here essentially it also enables long branch history lengths of course if you have multipliers uh because now your tables are not indexed using the history length right your perceptron can be wide it could be a 1,24 entry perceptron your multiplications multiplications increase but your table can be relatively small so it gets you better accuracy that way also disadvantage clearly is complexity now you need a multiplier adder tree to compute the perceptron multipliers are essentially adder trees uh also there are some learning disadvantages which could be alleviated with other learning methods so a perceptron turns out to be it cannot learn linearly inseparable functions again if you don't understand that don't worry about it this is the theory of learning a little bit essentially you cannot learn exort type of correlation between two history bits and the branch outcome but you can actually by making your learning model more complex you can overcome this also so more complexity overcomes some of these things so this is a successful example of use of machine learning in processor design it's implemented in some processors i'll give you a couple of examples this is a Samsung processor which was actually had a lot of innovative ideas and then later Samsung decided oh we're not going to have a processor team anymore so if you look at this paper none of the authors are affiliated with Samsung you don't have any Samsung name even though the paper is about the Samsung processor which is funny for me right these guys all work from the processor and then this moved to very different companies afterwards but they actually had a lot of innovative ideas including this perceptron predictor implement in the Samsung crower they also implemented one of our ideas on doing indirect bench prediction virtual program counter prediction which is if you're interested you can read it I'm not going to talk about that uh but it's also implemented on processors that do exist which is AMD so AMD implemented a perceptron branch predictor uh you can read uh there so you can see for example they actually have multiple branch predictors uh they have a hashed perceptron uh which is one way of implementing perceptron but they also say that you can get significant branch prediction accuracy so this is some p pictures that show essentially how perceptron operates from one of the websites that analyzes AMD chip if you're interested you can look at this but this is essentially what I said was what I said earlier i just put these on these slides from that website because they're kind of nice as you can see right and they actually are reasonably accurate if you look at this so it turns out this AMD processor also implements something called TAGE so they have multi-level branch predictors not hybrid they have hybrid branch predictors but they also have multi-level branch predictors meaning they have a branch predictor that's relatively quick perceptron that makes a decision and then at the same time they start access to another branch predictor that's relatively slow and if that branch predictor that's relatively slow but more accurate says the previous branch predictor is incorrect they flush the pipeline quickly and they refetch but this is still a prediction they don't flush the pipeline based on outcome of the branch the branch is not executed yet they flush the pipeline based on a more accurate predictors prediction so you can see how complicated our fetch engines and processors are today right you have levels and levels of prediction okay now we're going to talk about that page predictor that's actually that AMD thinks is more accurate but slower any questions i'm going through a lot of ideas as you can see this is fascinating as you can see if people have come up with a lot of creative ideas so let me talk about this uh this is very interesting because it turns out this is a little bit more accurate than the perceptron maybe we haven't figured out how to do the perceptrons yet this is uh implemented many processors and the idea is to use multiple history length so far we've been using a single history length right uh the observation is very fundamental different branches require different history lengths for better prediction accuracy and the idea is to have multiple pattern history tables indexed with global history registers with different history lengths and intelligently allocate pattern history table entries to different branches so it's going to be a little bit more complicated as you can see so you have some base predictor that use only program counter information you have another predictor over here that uses a small history length you pro yeah program counter uh uh hashed or exor with a small history length and then you index into the table and then get the prediction and then you have a tag to match the index correctly so these are actually relatively small tables but you need to be very accurate in matching uh that you actually have the entry over here uh and then you have a little bit longer history length it's in fact these history lengths form a geometric series in the original paper so that you can get to very small history lengths larger larger very large history lengths quickly because of geometric series nature of the geometric series and you can see that you make many predictions and then you decide which one to use so it's a little bit complicated you have many tables this could be 10 tables 20 tables if you want right okay and this is introduced in this paper geometric history length predictor so uh there are a lot of policies here in terms of which predictor to use which predictor uh to update etc i will not go into that there's a lot of let's say huristics that you need to make but if you actually design this carefully this could be very powerful and very accurate and the key idea is there basically the key idea is really use different history lengths for branches that need that work well with different history lengths then the rest is implementation but implementation matters a lot too because you can see that there's a multip multiplexer critical path over here right okay so this is also implemented uh and this is again from that website so let's talk about the advantages you can choose the best history length to predict each branch better accuracy this enables long branch history lengths better accuracy so this actually enables long history lengths much more efficiently in a sense than a perceptron because you can add more tables and you can actually have a geometric series of history lengths or a different series of history lengths so you can actually be very frugal in terms of hardware the disadvantage is that design complexity is not low and getting it right is not easy a lot of huristics are here you need to choose good hash functions and table sizes to maximize accuracy minimize latency even that minimizing latency may not be easy as we have uh seen with the AMD picture earlier but this is a very successful recent idea also relatively recent we're talking about 2006 or so uh that is used in many modern processor designs as well so again AMD Zen 2 had the perceptron as the L1 level one predictor and the TAG as a level two a lot of Intel designs also use TAG uh I believe okay so uh so this is actually a nice picture from uh uh this wiki chip that analyzed uh the AMD processor again uh I will skip this but basically there's a lot of analysis of uh how uh global branch predictors work these folks as early as 1990s uh analyzed which branches you should really use to predict the current branch it turns out if you're predicting a branch you can look at a few branches but isolating those few branches can enable you to predict this branch really well the problem is which those which are the those three branches they actually differ for every branch some of them for some branches those three branches are very far away for some branches those three branches are very close in execution and for some other branches those three branches are kind of randomly distributed in execution right so figuring out which branches to actually use information or taken not taken information from is not easy and again these are old programs new programs are actually quite different new workloads but it's good to do this analysis to do better understanding i think the next big let's say revolution in branch prediction will come from better understanding okay the question is always can we do better right so this is mostly the state-of-the-art of what we have in branch prediction today this is an ugly picture uh it's a it's a beautiful paper but an ugly picture let's say but uh it kind of conveys the idea uh this is called tage uh um statistical correlation and local branch loop branch predictors it's a very hybrid branch predictor You use basically PC and global history and use this tagged geometric history length predictor you get a prediction and a confidence level in the prediction and then you apply some statistical correction methods to it that I'm not going to go into and that is going to be part of your prediction and then there's also another prediction that comes based on some information you also take into account other stuff like global history local history and something called skeleton history that I'm not going to go into detail again you can read the paper if you're really interested and then there's a loop predictor and basically the final prediction is an outcome of deciding which predictor works better over here so there's a lot that goes into this a lot of things that we discussed is there actually almost everything we discussed is there but there's more as you can see right in fact in particular I'm going to talk about confidence which is actually a very important concept but this is 2014 as you can see now we're in 2010s and this is actually very close to the state-of-the-art in modern processors today uh people are experimenting with more sophisticated ideas like convolutional neural networks if you're interested you can look at this essentially uh they're actually targeting identifying what I just said earlier identifying the correlation between a branch's outcome and a large noisy global history as art ideally you would like to identify which branches affect this branch and they use convolutional neural networks over global history to expose hidden correlations for hard to predict branches so again I will not go into detail but this is a convolutional neural network that's used in vision for example it's very similar uh you use this only for very hard to predict branches and you uh predict everything else with this monster so the everything else is also hard to predict if you think about it right for other branches you use a huge heterogeneous monster branch predictor and for even then there are some branches you cannot predict and this thing that's even more complicated can be actually helpful okay and they show good results in this work which I'm not going to go into this is not implemented yet it's very expensive but there could be uh a path for implementation in the future okay let me give you another idea this is actually very heavily used for many reasons and the idea is branch confidence estimation essentially the idea is very similar to the meta predictor that we have seen if you have a hybrid branch predictor you decide which predictor is likely more accurate and choose that predictor right here we estimate if the prediction is likely to be correct right so estimate how confident you are in the prediction essentially so why would you want to do this it could be very useful in deciding how to speculate for example meaning which predictor to use whether to actually keep fetching on this path maybe you want to say okay I think I'm very likely on the wrong path right now and I don't know how to correct things let's say I'm going to stop fetching let me save energy at least right you could do that or you could use some other methods of handling branches which we kind of discussed yesterday and clearly we're not going to have a lot of time to discuss today but you can use the backup slides if you're interested so basically this is good for decision- making and this was introduced in the seminal paper again another paper that won a test of time award from the micro conference uh because it's implemented in essentially all modern processors so basically how do you estimate confidence uh there there may be many ways of estimating this conference in this particular uh paper they kept a record of correct incorrect outcomes for past and branches and instance of the branch and based on the correct incorrect patterns guess if the current prediction is likely to be correct or incorrect and I have a misspelling oh I don't know how this misspelling survived so many years there you go it's amazing whenever you go back and read an older paper you figure out oh I could have written this better misspelling is a bit harder to spot maybe okay uh okay let me remove this okay so basically this is an example it's very similar to if you think about it uh the way they designed the selector in the alpha 21-64 predictor right these are concurrent works actually alpha 21-264 was being designed when this paper was written essentially you use a global blank history register and exort with branch address and basically you have a table of confidence counters basically say I'm based on what I'm where I am I think this predict I'm very confident in the prediction right again you can use a lot of information for this purpose but this could be useful uh uh this could be useful for deciding which predictor to use this could be useful for deciding whether to keep fetching so this is another paper in 1990s uh that that introduced the concept of pipeline gating the idea is very simple again uh so you basically count how many low confidence branches you have fetched for every branch that you fetch you make a prediction you keep fetching at some point you say the probability of uh me being on the correct path is pretty low and I don't know which branch to correct because I have so many low confidence branches so I'm going to stop fetching at least I will save energy that's the idea over here and it could be a good idea uh depending on the probabilities of uh branches that you fetch like the accur like probability of correct predictions okay so I giving you a big idea like another big idea let's say so uh we're not done with branch prediction yet people are actually still organizing contests so if you actually want to participate it may be difficult at this time in your let's say career but I've given you actually a very concise summary of where we are in branch prediction uh you could participate and maybe even win if you're interested the deadline is May 2nd i would prioritize EDCA but again you know you don't need to let's say have an exam on May 2nd but this is fun uh and this this sort of contest actually enables a lot of new ideas in branch prediction data prefetching etc i think this brings me to other ways of handling branches which I'm not going to cover uh but you can look at delayed branching in the backup slides over here and we have actually other lectures uh if you're interested uh on uh how to do better branch handling so I will see you next week i think uh these are all backup sites i'm going to uh put them up if you're interested you can look at different ways of handling branches so don't think that branch prediction is the only way but it is really the dominant way it's the way that has been successful because you don't need to change anything in software for any of these if you think about it okay have a great weekend i'll see you next week [Applause]

Transcript for:[Lecture 16] Understanding Branch Prediction Techniques

Transcript for:
[Lecture 16] Understanding Branch Prediction Techniques