[Lecture 28] Understanding VLIW and Computer Architecture

okay now we're going to start some other paradigms uh actually we've been talking about paradigms uh but we're going to talk a little bit more about different paradigms until we get to memory it's going to be exciting because these are paradigms that are that have had a lot of influence as you will see and that are also being influencing the systems we design we're going to start with vliw which is an interesting philosophy but before that I'll just remind you what we have covered in the last month let's say basically we're we've covered a lot as you can see micro architecture pipelining precise exceptions aut of order super scaler execution Branch prediction so hopefully you're all familiar with these Concepts now uh they're all very high impact Concepts they're all employed in modern processors but we're going to switch gears uh today uh slowly switch gears to VW and then talk about more specialized architectures does that sound good okay everybody comfortable with these Concepts sounds like fun yeah this is I think it's fascinating but there's more to be fascinated about uh this is what we will cover uh today uh and tomorrow and next week basically today I intend to cover VW and systolic aray we won't have time for decoupled Access and execute that'll be a premiere of a lecture uh and then tomorrow we'll do simd processing and next week we'll cover GP use and then start memory which is really the biggest ball neck that we're facing so okay what is VW VW stands for very long instruction work and then architectures essentially it's a different Isa instructions at architecture it builds upon the one nman model but it slightly modifies the one nman model such that an instruction is defined in a different way so you can think of this as a variant of a one Norman model and instruction is defined to encode multiple operations basically okay so if you remember the supercal execution concept that we discussed last time last week super scaler in supercal execution Hardware was responsible for fetching instructions multiple instructions and uh checking dependencies between them and then it executes decodes multiple instructions executes multiple instructions and finishes multiple instructions so it's basically a multiple instruction fetch decode execute Paradigm but Hardware is is responsible for doing all of this right the compiler what the compiler generates is uh sequential code as we would expect in a one Norman architecture right it's essentially what we've been used except Hardware extracts parallelism out of it every cycle okay so everybody's familiar with super Skiller right in vliw we do essentially the same thing except the responsibility belongs to the software meaning comp compiler compiler is responsible for identifying and packing independent instructions in a larger bundle it guarantees that these instructions that are put together are independent of each other and these instructions can be fetched decoded executed concurrently and finished concurrently that's the idea so these paradigms differ in terms of uh uh whose responsibility it is to find dependence or Independence between instrctions super scaler says compile doesn't need to do anything I'm going to find all the dependencies VW says Hardware doesn't need to do anything compiler is going to find all the dependencies that's the idea basically and this actually makes a huge difference in terms of the success of these two different things super scaler is very widely employed everywhere VW not so much although it has influenced a lot of things as we will discuss especially in compiler optimizations so it's a different kind of access that vliw brought into our Collective knowledge of how to really design computers it's really how you design the software stack even though computers under underneath may not be executing uh instructions in a vliw way exactly okay so basically it's the responsibility of the software to find independent instructions and pack them together now the hardware becomes very simple it fetches executes the instruction in the bundle concurrently it doesn't need to check dependencies between those instructions in fact VW goes another step it's not just vertically what you're fetching in one cycle it's also horizontally what you're fetching in multiple Cycles VW says Hardware should not do any dependence checking basically if you fetch another instruction in a previous cycle it's the compiler's job to make sure that the later instructions can go into the pipeline without any dependence checking in Hardware that's the pure vliw concept pure VW is are very hard to achieve of course because you know that it's very diff difficult to have a lot of independent instructions I'm going to give you a flavor of it today uh and also discuss the difficulties but basically there's no need for Hardware dependence checking between between concurrently F instruction of the VW model and in the real pure VW model there's no need for dependency checking at all Hardware gets exposed uh so there's some Hardware pipeline Hardware multiple instruction fetch vertically every cycle and then horizontally a pipeline hardware and it's the compiler's job to decide what gets executed in which cycle and in which slot in the pipeline so basically the mindset is simple Hardware complex compiler this is very different from the mindset that we' looked at in the last several lectures right we basically made Hardware more and more and more and more complex right out of execution is actually very complex you can think of viw as putting the responsibility of finding all of the independent instructions out of order and super scaler to the compiler basically VW is the the equivalent of out of order super scaler execution where the software schedules the instructions not the hardware does that sound good basically that's the concept that's really uh the most important thing that you need to know about a vliw machine everything else is how do you make it work making it work is very difficult basically okay so uh pictorially it looks like this basically this is the VW concept you have a program counter when you get to that program counter you don't get a single instruction like we used to you don't just get add that ad but you actually get four instructions in this particular case and the compiler packed them together in memory and compiler ensured that these instructions have no relationship or no dependence on each other so they can be executed completely concurrently executed completely concurrently without the hardware doing anything other than executing them in fact the compiler also knows the structure of the hardware such that it knows that for example this this processing El in Hardware can execute loads so the load instruction actually is aligned such that uh when you fetch it you go to decoder and the decoder directly feeds the memory unit so if if there was no memory unit over here for example the compiler would not put the load in this location if there was a memory unit over here the compiler would the load in this location make sense so the compiler basically the philosophy is compiler knows the structure of the pipeline and the hardware and it basically bundles the instructions together such that the hardware doesn't need to do well almost anything if you will dependence checking for sure but not even routing the instructions to the correct location to the correct Pro processing element where they can be executed and that's the philosophy and this philosophy was uh let's say first expressed explicitly in this paper it's a beautiful paper if you're interested you can read it we don't have any readings your books don't cover this concept I think it's sad they don't cover this concept but uh we will see so I will give you a hope that this concept may arise into the future maybe 10 years down the road again uh because of some developments of machine learning so compilers may not be as strong today to enable this but 10 years down the road if they're equipped with much stronger machine learning techniques this discovery of parallelism can be much stronger 10 years down the road you can you can use this as a prediction and see if I if I'm correct 10 years down the road okay so conceptually as you can see beautiful right uh simple Hardware complex compiler okay so let's go into a little bit more depth uh so a very long instruction word consists of multiple independent instructions packed together by the compiler packed instructions can be logically unrelated and this is the key distinguishing factor of vliw compared to what we will see tomorrow tomorrow we'll see single instruction multiple data architectures Vector processors which gpus build on they also can fetch multiple instructions per C C from a same from the same program counter except those instructions are completely related they all do the same thing here if you look at this picture these instructions are completely different one is an ad one is a load one could be a square root if you have that instruction they don't have to have any relationship at all to each other except of course they need to be executed by by the program in that vicinity right but in simd or vector architectures we will see tomorrow uh you do the same thing this picture actually I'm going to con tomorrow this picture will be add add add add for example all of them are going to do ads except they're going to operate on different data okay so this is uh something to foreshadow and the idea the main idea of viw which we've discussed is compiler finds the independent instructions and statically schedules them or packs or bundles them into a single vliw instruction remember we discussed static scheduling versus Dynamic scheduling this is completely static meaning Hardware doesn't need to find independent instructions like we did with AO order execution or supercal execution so traditional characteristics of vliw uh basically you fetch multiple instructions you execute them and you have multiple functional units this is very similar to Super scaler as we discussed different from Super scaler all instructions in a bundle that you fetch together are executed on lock step meaning all of them start and complete at the same time this is to make sure that the compiler can schedule instructions based on the pipeline structure so you cannot really proceed if one instruction is complete earlier you cannot proceeded basically and uh the compiler needs to take that into account this makes scheduling easier at the compiler level but also makes performance worse as we will see and finally inst I already mentioned this but instructions in a bundle are statically aligned to be directly supplied to their corresponding functional units so a compiler knows the structure of the processor it can do this easily okay let's take a look at VW example in the ideal case uh so here I'm assuming two white bundles this actually the same example that we looked at in order processor and then supercal processor and then now it became elw uh this is bundle one the compiler figured out that they're independent it put them together this is bundle two and this is bundle three and compiler did all of these and then Hardware is beautiful basically you have two wide fetch engines there's no dependence checking it is not shown over here but basically all of these bundles can be executed uh in a pipeline Manner and ideal IPC is clearly two instructions in a vliw word if you will so you get two instructions per cycle if the compiler was able to fill all of those slots that's great and as I said compiler needs to do this horizontally or vertically every cycle and horizontally across the pipeline uh stages okay we can have a more sophisticated example you will see that in your homework and it's going to be fun so okay let's talk about some of these characteristics a little bit more uh so there's lock step in other words words all or none All or Nothing execution if any operation OFW instruction stalls all concurrent operations stall this makes scheduling easier and in a truly VW machine the compiler handles all dependency related stalls as we discussed Hardware does not perform any dependency checking and the issue is really uh if you know the latency of all of these instructions perfectly at the compiler level maybe you can do this uh because at least you can figure out which instructions take how long long and if all of the instructions are taking let say three Cycles you put them together or maybe close enough Cycles now the problem happens when you don't know the latency of of an operation and this is there are cases where we don't know the latency of an operation we discuss multiply operation for example the latency of a multiply Hardware unit can depend on the input to the multiply if one of the inputs is zero and if you detect that the multiply can take one cycle very quickly you set the output to zero if one of if neither of the inputs is zero or one then you may actually need to take eight Cycles to do the multiplication if it's a complex multiplication so that's a dependence right so basically the latency of an instruction is dependent on the input data does the compiler know the input data probably not not the exact input data right we've suffered from this problem in the past when we talked about Branch prediction for example can the compiler do Branch prediction can the compiler fill the delay slot simp similarly the compiler doesn't know the exact input data because it's compiling the program and then the input data Supply gets supplied dynamically while the program is running as a result the compiler May really not know whether the com multiply instruction one of the inputs of a multiply will be zero or one or something else so the compiler essentially doesn't know how long this multiply instruction will take now that's not true for all instructions clearly for add instructions that may not be true it may always take two cycles for example fine but if there enough multiplies in your program now static scheduling becomes difficult should the compiler assume a one cycle multiply or an eight cycle multiply I don't know what's the answer well you do profiling right that's what we do with compilers we've discussed that for branch static Branch prediction before the compiler does profiling uses some input sets and says oh 99% of the time the input to this multiply is not zero or not one as a result I'm going to assume this multiply is going to take eight Cycles 99% of the time if if the compile is correct that's great 99% of the time the code executes nicely but what do you do in the remaining 1% of the time well okay maybe it's okay right in that case uh it takes one cycle and in that case it's okay because there's some there still needs to be some Hardware support basically that's what I'm getting at if the if the instruction doesn't finish on time there needs to be some Hardware support to make sure that all instructions proceed at the same time so basically this uh this dream of not having any hardware support to avoid any stalls to to even detect stalls is essentially a Dream It's it doesn't have any corresponds to reality I would say there needs to be some Hardware support to stall the bundle if the compile didn't predict the latency correctly for example and this becomes even worse we're talking about a multiply multiply is not that bad let's talk about a load now once we go to memory let's say all hell breaks loose right and that's true with the modern machines a load instru action how long does it take the compiler well we're going to see the memory hierarchy soon but the compiler assume it takes it's going to it's going to hit in the Alan cach we're going to talk about all of those soon it may take one cycle if it doesn't hit in the Lan cach if it hits in the L2 cach it may take 10 Cycles if it doesn't hit in the L2 cach if it hits in the L3 cach it may take 80 Cycles if it doesn't hit in any of the caches it has to go to memory it may take 500 Cycles now we have a problem at hand right which one is going to happen well this is where all H Breaks Loose from a compiler's perspective because compiler okay compiler said I'm going to profile the program fine you profile the program you try to figure out which load gets hits in which cash okay this is very much input set dependent clearly depends on what you input to the program and can you get everything correctly maybe if you're really lucky most of the time you don't get all of those latencies correctly as a result uh your profile is very much not represented of what happens in real life as a result the code scheduling breaks and this is the real big fundamental problem with this sort of statically scheduled machines memory operations we don't know how long a memory operation will take as a result it's very difficult to say the next dependent instruction is going to can be scheduled let's say three Cycles later four Cycles later 80 Cycles later or 500 Cycles later and there's a huge variation as you can see right it's not like one or eight in multiply it's more like one to 500 or maybe 1,000 and in some cases you may actually not even be in memory you may actually need to get the data from the SSD in that case it's 10,000 20,000 Cycles Etc so basically this is the problem with static scheduling this sort of variable latency operations in other words in parenthesis memory operations are cause cause a great problem to static scheduling so the compiler doesn't know what to do basically so what do you do the compiler assumes something if the assumption is in correct you stall the machine who stalls the machine it's not the compiler it's the hardware because the hardware knows where the load is where the data is or where the data isn't initially of course right so that's basically why this dream is a dream this pure vliw uh is almost impossible to build in real Hardware you have to have some stall signal saying oh this operation is taking taking longer than expected so sto the entire bundle sto the pipeline does that make sense so it's beautiful I think it's good to understand these trade-offs between hardware and software and this is the real reason in my opinion well one of the biggest reasons why vliw has not been as successful it doesn't have this tolerance to latency or variable latency I should say if if I'm if I want to be more accurate it has tolerance to latency clearly if it knows absolutely every single latency in the machine is every single latency absolutely correctly it can come up with a schedule which may not be a schedule given the compiler techniques that we're going to discuss but all of those compiler techniques break down when you don't know the latencies essentially or when you have variable latencies okay that's why I spend a lot of time on this slide basically this last part what about variable latency operation memory stalls this is really the downside whereas in an out of order machine it's this is very simple I mean simple in the sense that uh it doesn't affect performance as much right you you don't have a dilemma basically what in an out machine what what happens a load takes 500 Cycles fine Che fetching instructions buffering them in the reservation station some of them will execute some of them may not execute so the model handles it without any problem the hardware cost may increase of course if you really want to not stall after 500 Cycles or 1,000 Cycles or 2,000 Cycles but the model itself doesn't have a fundamental issue as to what should I do you know what to do you basically stall the load and the dependent structions independent stuff can execute okay so I would argue that this is really the biggest reason why aut execution super SK execution has been very successful whereas vliw has not okay so uh but let's talk about this philosophy and principles a little bit and then I'll give you some compiler background also but you can see that this is a paper that was written by Josh fiser who was the inventor of vliw uh and you can see the title right a smart compiler and a dumb machine and that's really the philosophy if you will over here now let me give you this not this this philosophy is actually very similar to what we have seen earlier which is simple instructions risk redu instructions at computers right that philosophy is very similar to VW VW extends it to multiple instructions per cycle just like super scaler extends a scalar process to M scaler processer to multiple instructions per cycle so risk for example uh in the 1970s it was developed John cooch was an IBM engineer who developed the earlier philosophy he won the touring award for his work actually for this and it was first employed in IBM 801 when IBM was actually the Leading Edge of building real systems if you will uh yeah I already said this uh basically the philosophy here was that compiler does the hard work to translate High Lev language code to simple instructions instructions should be as simple as possible no sophisticated instructions like multiplic Matrix Vector multiplication evil multiplication actually and John Co took this to the extreme actually he said compiler should be controlling the control signals So the instructions should be the control signals ension right which is a very different way of thinking you expose every single control signal to the compiler and this actually led to a lot of development in compilers and compiler needs to reorder simple instructions for high performance and that's basically what philosophy of viw is also about Hardware does little translation and decoding so it's very simple BW compiler does the hard work to really do this at the vertical side also so risk says basically at the instruction level and a pipeline level horizontally and VW says vertically within the same cycle you do that also so basically vertically and horizontally is much harder than clearly just horizontally right so you you need to do the hard work to find the which instructions can be executed in parallel actually even worse than that which instructions can be executed when at what time and which with which other instructions and as a result Hardware stays as simple as possible and executes in each instruction a bundle in loog Step so what why is simple why so why are why are people pushing for this clearly simple hard has some benefits right uh people believe and this is most of the time true actually it's easier to design simple Hardware it's lower power in terms of total design power it's it can be higher frequency although you could make your Hardware higher frequency with heavy pipelining also but it comes at a cost but basically getting all of these at the same time uh is very difficult if you're uh if Hardware needs to do a lot of dependency checking Etc but you can get all of these at the same time in a VW machine in a risk machine but it comes at a cost right basically performance cost uh I said simple doesn't provide performance over here performance is the job of the compiler now the question is can the compiler get that performance so power is very interesting over here I said Power low power over here so simple Hardware is low power at any given point it doesn't consume a lot of power uh now if it's low performance also that means that the program will run longer right if the program runs longer much longer even if you have a low power processor you may be expending much more energy right so if you recall your physics uh how many people have taken physics hopefully I'll see lots of hands over here if you recall your physics uh what is energy energy is instantaneous Power Times time well more accurately it's the integration of power over time curve right every time point you have some power that you consume every time and you integrate the power over time and that's the energy you consume now you may have a very low power processor if it's extremely low performance the time is much longer than a high power processor right so okay maybe visualizing this would be useful over here very quickly even though this going to lengthen the context switch penalty as you can see uh what do we do docu cam I need to do something else also doy cam am I correct and then not share thought is it good okay so okay I mean it's a simple example but I think it's it deserves some visualization because this actually uh is interesting in my opinion it's very fundamental right so you have a low power processor uh this is the execution curve you it looks like basically this is instant aneous power how much power you spent it could be watts and then this is time and you execute one program and low power processor may be something like this right over time it has some power it's low power and then the program ends here okay what is the energy you spend energy you spend is the integration of this curve Power Times time right that's the energy basically this the energy on low power processor my terrible handwriting okay at least I'm repeating it high power processor on the same curve may look like this the power is high but the program ends here okay what's the energy you integrate this curve so this energy actually may be smaller than this energy right so in the end if you have a much faster processor that consumes a lot higher power it may actually be better for energy so what is is the implication of this energy is actually what determines your battery life in the end that determines the draw because you have your battery has limited energy capacity and in the end this high power processor uh that finishes your program much faster can draw some energy and then go to sleep right and then it doesn't consume any energy whereas this processor consumes a lot more energy now what is the implication what's the downside of this high power processor while supplying this power right basically this is let's say you have a maximum power over here uh to it's called total design power for the high power processor and then you have a maximum for the low power processor and don't made a mess over here this is the low power processor let's say maximum total design power so this is low clearly and this is high clearly so here your power supply needs to be much more powerful so this has implications on cost and power Etc efficiency so there are a lot of implications over here but this hopefully shows you that low power doesn't necessarily means low energy and low power doesn't also necessarily mean low cost so even though this cost may be higher uh here to get the same uh lifetime you may actually need a much bigger battery for example okay so this is a sideline that I had to express over here because of what will come soon related to vliw but also this is more General than vliw this sort of trade-off any questions feel free to ask while I'm doing the context switch so what do I do follow project D1 is that correct for the uh this goes into two I know but I don't know what this does oh what is this okay this I can fix I think okay that's fixed I think that follow project D1 but yeah it's okay I think I see picture okay as long as it's okay on uh live stream also we can move on is it good on live stream okay I think it's good well I'm not sure okay somebody said yes so I'm going to assume that yes as opposed to look okay okay so now we kind of had an aside because of this so simple doesn't mean low energy basically that's the whole point that was the whole point of what I discussed simple also doesn't mean low cost and simple definitely doesn't mean high performance it does mean low power for the definition of total design power okay okay so you can read this paper it's beautiful actually it's it's very nicely written for a paper written in 1980s I cannot say that for every paper that's written in 1980s but I'm not going to talk about everything over here we're going to distinguish viw more from simd later but I'll give you a little bit history so people actually tried to build these VW machines for example Josh fer himself built the multiflow company they actually tried to build 28 white meaning you fetch 28 instructions per cycle there are 28 instructions in a bundle they built these machines Unfortunately they did not really become commercially successful Bob is another person who was at HP Labs developing uh well before he went to HP Labs he developed the Cyra processors and Cy Drome uh they actually built VW machines again these are startup companies they tried hard uh Unfortunately they were not very successful commercially if you will transmitter cruzo I'm going to talk about this uh briefly later on also these folks were actually trying to do something very interesting they they were trying to compete with Intel at the time they took x86 binaries x86 code because there's a software base right you take the software base and you translate it internally to a VW engine so that you can extract a lot of parall without the hardware cost with simple processors and they tried very hard Unfortunately they were also not very successful at least in the general purpose Market where vliw has been very very successful is where you could actually extract this parallelism relatively easily in digital signal processing for example or embedded processors you can see these were actually where VW was quite successful and also uh Power limitations like total power limitations of these devices are relatively low whereas in a general purpose Computing device if people care about performance you may not care as much about power I want to get the highest performance right but in some of these embedded processors total power limitations are low and uh there's a lot of possibility to uh get uh code uh get parallelm out of the code using compil Tech so most successful machines were these even for example AMD and ATI gpus were VW but I'm not sure if they are right now initially so Intel is a this a very interesting story uh when x86 processors were moving from 32-bit Intel had a lot of discussions as to oh do we just extend the x86 Isa or do we come up with a new Isa instruction set architecture so Intel decided we're going to come up with a complete new architecture instruction set architecture we're going to ignore the software base kind of uh and we're going to move to this I64 architecture 64bit completely redesigned it's not fully V IW but it's based on VW principles it breaks some fundamental vliw principles meaning the instructions that are in a bundle there could be dependencies there but compiler tells which instructions are dependent such that the Hardware's job is easier so this is a kind of a mixed model where a compiler doesn't fully uh guarantee Independence but when it cannot guarantee Independence it tells the hardware oh these instructions are dependent or may be dependent on each other so be careful okay so you can see there's a different model right now this is called explicit ly parallel instruction Computing in Intel marketing terminology uh yeah I already said this so basically there are a few bits in the instruction format that specify explicitly which instructions in the bundle are dependent on which other ones okay in the end this was not successful in the end Intel I believe is not developing these processors anymore after a lot of effort went into it in terms of both software compilers Hardware but even a dominant company like intel was not able to change the huge software base that was out there for x86 well why uh because there was a competitor I would say and that competitor was AMD and what AMD did in this transition was they basically said we're not going to change the ISA we're going to develop X8 664 so if you've heard about X8 664 it's really amd's instructions set architecture which extends X6 with with small extensions it just add 64-bit instructions basically instead of changing the execution model and the ISA complete and Intel had to adopt that in the end because otherwise they would lose because there's so much software BS and nobody wants to really well nobody I should not say nobody but a significant fraction of people who are ex who writing x86 code or already executing x86 code don't want to rewrite their code completely okay so this is there's a lot of interesting things over here actually this is a very in my opinion this a very interesting story okay so people have tried viw as you can see so let's talk about the trade-off the big advantage of VW is there's no need for dynamic scheduling Hardware as a result Hardware is simple there's no need for dependency checking with VW instruction as a result the hardware is simple no renaming also there's no need for instruction alignment and redistribution after fetch to different functionalists as a result Hardware is simple basically Hardware is simple that's the big Advantage the disadvantage is now lower performance basically compiler needs to find n independent operations per cycle and also horizontally for don't forget that if it cannot what does it do it Ur no Ops once you inserting noops you lose parallelism basically noop doesn't do anything noop is a no operation right noop basically says I could not fill this slot instruction slot okay that's good as a result you get Paralis loss and also code size increases because you need to encode those noop somewhere and actually a lot of work looked at how to encode those noops in VW which is interesting also so you could actually encode noops very efficiently okay another issue is whenever you change the micro architecture or execution with instruction latencies functional units you really need to recompile code this doesn't happen in supercale out of order execution in supercale out of order execution somebody comes up with a bear out of order processor it increases the fetch width execution width software doesn't need to care about that right hopefully you get higher performance without anybody changing any software anywhere in the world whereas here if you really want to get higher performance you need to compile the code if if people did not really take into account the fact that Hardware will change and the ISA was strict your code may not even run in the new hardware so you have a problem basically and then uh lock step execution caus independent operations to stall no instruction can progress until the longest latency instruction completes and this is in my opinion the real worst part you can handle the first one maybe you can handle the second one for sure the third one very difficult to handle as we discussed right the variable latency operations so basically to summarize VW simplifies Hardware but requires very complex compiler techniques and as a result the solely compiler approach of vliw has several D sides that reduce performance the most important of which we discussed no tolerance for variable or long latency operations because of this lock step execution too many noops because the compiler may not discover enough parallelism you could alleviate that with more sophisticated techniques the first one you cannot eleviate sorry the second one you can try to eleviate somewhat sucessful depending on the code the third one also is a nuisance basically you have to live with Bas static schedule is intimately tied with microarchitecture uh code optimized for one generation performs poorly for next unless you recompile it basically so all of these three are actually bad but the first one there's no solution if you will third one there's no solution but you could be smarter about it maybe okay uh so why are we talking about this we are talking about it because this has been extremely influential actually I'm going to give you some examples a lot of compiler optimizations have been developed for VW in optimizing compilers the optimizing compilers you use use VW optimizations for supercal compilation because supercal process also benefit from this if you have independent instructions that are discovered and put together you don't need to stall right if you feed it into a supercal process it already is able to process things in a simpler way it does dependency checking but it kind of Wast that Hardware if you will if the compile did a good job so this people have enabled a lot of code optimizations as we will see soon and vliw is also very successful when parm is easier to find by the compiler especially in very simple static codes embedded markets Automotive Market used to be like this for example it's changing uh Digital Signal processors and initial gpus modern gpus are general purpose so once you start more general purpose you start executing a lot of code and it becomes irregular so if you're confined if you confined your application space such that code is relatively regular predictable Etc it has been extremely successful that's why in the embedded Mark it's been successful so let's talk about these compiler optimization I'll give you an idea of what compilers do uh so these are basic blocks this is from the paper that I mentioned these are basic blocks remember the basic blocks that you looked at when we did basic block reordering this is some code and then there's a branch uh this is a taken path not taken path and then there's some other code and then there's another Branch you can see that this a complex control flow graph what VW compiles do is they try to find the frequently executed path in this and they form a trace if you will they merge the code here assuming that that's going to be executed together that's the Assumption and they optimize the code very heavily reorder instructions find Independence independent instructions Etc assuming that that's the only code you're ex going to execute now you can optimize a lot right you can have hundreds maybe thousands of instruction blocks whereas if you look at a single basic block it's five or 10 maybe instructions right now of course when your assumption is wrong it's bad you need to modify the control flow graph when you go out you need to re-execute or change the execution basically you should undo what you've done based on your assumptions this is called fix up code and this is the fix up code that they show in this paper so your code size actually increases significantly but the hope is that if your program executes this path almost always you almost always execute very high highly optimized code and this is the fundamental principle behind Justin time compilers also the dynamic compilers all do this they figure out the frquently executed Paths of code the Rosetta for example uh whatever Dynamic compiler you may be using they all do this they fre figure out the frequently executed path and they assume it's going to execute and they fix up the code if that assumption is not correct at Ground time and if the assumption is correct which means that your profile input set was representative that's great you execute very heavily optimized code okay we're going to see a more uh let's say uh less high level example very soon and I'd recommend this paper because it talks about this so if you think about this we've covered some of these Concepts earlier also right we the basic concept that we' covered in an earlier very recent lecture actually was basic block reordering do you remember basically we were trying to do Branch prediction and the simplest Branch prediction mechanism is next fetch address is PC plus4 right not taken basically pred always taken and we said you can reorder the B blocks to actually make this more successful so we're building on that actually uh basically uh lightly taken Branch instructions are actually problem they hurt the accuracy of always not taking Branch prediction and we tried to solve that but actually Branch Instructions make static code reorderings and scheduling difficult so we want to get rid of Branch instructions or at least the taken Branch instructions that take you somewhere else right you really want straight line code if you will every instruction that's executed is the next instruction next sequential instruction this is good for branch prediction this good for compilers code optimization as well so we're now combining multiple Concepts so what was basic block reordering if you remember uh the idea was to convert the likely taken Branch to a likely not taken one by reordering the code so by reordering the basic blocks after profiling so what's a basic block just to remind you this is code with a single entry point whenever you enter code you execute whatever is inside this basic block and then you exit you exit to one other basic block or this other basic block right it could be a branch so you're guaranteed to have a single entry a single exit and these are very good blocks if you have a single entry and single exit block and you're not exiting or entering at any point in that you can optimize the code beautifully because there are no branches you know that once you enter you're not going to exit un unless you have an exception or an interrupt now this is what where we tie to the exceptional interrupt leure right this huge code if that code is 10,000 instructions that's great the compile can look at all those 10,000 instruction and say Here's how I'm going to schedule this 10,000 instruction block beautiful okay so this is good for branch prediction also obviously because you can always not take it right so this is a control flow graph I actually Drew this uh in the docu cam but this is our control flow graph I showed you one branch is taken 99% of the time this way uh the branch is not taken 1% of the time this way how do you lay out this code basically if the code layout is this way if you actually change the branch in some way such that it's not taken 99% of the time you execute ABD right as straight line code so code layout one leads to the fewest Branch mispredictions this is what I showed uh in an earlier lecture on the docu cam code layout two is bad assuming your profiling is correct right so code layout one says uh branch has not taken 99% of the time basically so this is straight line code this is good for brch prediction clearly this is also good for optimization of code the compiler can assume that this is the code that's executed okay now we're going to take one more step uh so people said okay we're combining uh we're basically laying out code such that the frequently executed path is always an ex sequential path right that's what we're trying to do really why not expand the scope here the scope is only this right so let's expand the scope why don't we combine these frequently executed blocks such that they form a bigger block this is called a super block perhaps unsurprisingly but this is used in your compilers in fact existing compilers use hyper blocks Etc but you combine frequently executed blocks such that form a single entry multiple exit larger block which is likely executed a straight line code so I'm going to give you this example in the next slide so the big benefit is you reduce Branch Miss predictions this way because you're all almost always executing straight line code and you enable aggressive compiler optimizations and code reordering within the super block I'm going to give you a very simple example soon the downside of course it always comes with a downside you increase the quote size right why do you increase the code size you need to add fix up code as we will see it requires recompilation and then the Killer is profile dependent if the compiler doesn't have the good enough information about Dynamic execution you optimize One path dynamically you execute some other path and then once you execute that other path it's actually probably worse than not optimized code why because you're executing something totally irregular right now compile actually added more irregularity to those Paths of course you can try to fix this by optimizing multiple pths now it becomes much more more complicated okay so now you can see the difficulty in static scheduling right that's and you can also perhaps more appreciate why auto execution has been successful it avoids all of these but Auto extion also benefits from this when it works okay so that's the important thing so let's take a look at the super block formation this more complicated uh control flow graph so you have an entry to The Loop and this the exit to the loop and you can see that this a loop and then there's a one basic block uh there is a branch that's taken 90% of the time this way and then 10% of the time this way and then this Branch over here is taken zero times this way 90 times this way these are absolute counts this Branch over here is executed well it's an unconditional Branch 10 times and these are unconditional branches as you can see so basically you have frequencies on the path over here and this Loop is Loop branch has taken 99% of the time so if you look at this this is the frequently executed path in this Loop this is what you would like to optimize ideally so if you want to form a super block you essentially combine these blocks now this not a super block yet because this is not single entry multiple exits there are multiple entry points right you enter from here and you enter from here we're going to eliminate that this is a trace actually this is what original VW compilers did and optimize these traces and you can get somewhere using these now how do you make it a super block you add a little bit more now basically you should make sure that you do not enter the block anywhere other than the top so you need to basically take this D when you exit the block zero is the profile but in real in real execution this may not be zero 10 is profile in real execution this may not be 10 but it doesn't matter you need to make sure that the code executes correctly still so if you exit to D you should not re-enter here because if you re-enter you're hindering some code optimizations if uh if you actually have you don't have straight line code basically anymore we really want straight line code all over this over here so basically what you do is you duplicate the tail if you exited the super block you always stay out of the super block so we have another F Bar over here before going into Z if you exited the super block at C you always stay out of the super block so we have another F Bar you have the same F Bar so that's called tail duplication and this is one example of duplication of basic blocks to ensure that you don't have these side entrances to the super block such that you the compil can nicely optimize a super block so this is the really really the super block and this is code with single entry point and multiple exit points as you can see so again we can go into a lot more detail over here but then you need to take a compiler course which is fast also in my opinion but this is not a compiler course so basically the whole point is that you can optimize the code once you make it straight line like this but if while the code is executing you exit out of the straight line code you execute some other stuff let's say that needs to be added by the compiler so let's take a look at one potential optimization with a very simple super block this is your original code I eliminated the branches because they kind of it's nice nice to look at branches this is your operation a this is your operation B this is your operation C these are uh three different basic blocks there's a branch here 99% of the time it goes here 1% of the time it goes here and then there's another Branch unconditional that goes here basically if something do this otherwise just followall straight line code that's what this Branch does now after super block formation this is what happens you basically combine the 99% path these two instructions you can reorder them nicely if you can uh and then once you exit you have this code but then you cannot re-enter back so we want to eliminate this Edge so we duplicate this tail that's the tail duplicate this instraction so we have four instructions now in this very very simple example this the simplest example of super BL that I can let's say come up with to show something really interesting really interesting thing happens next now what do you see over here this multiply and this multiply are doing the same thing right on the frequently executed path you don't need to execute this multiply you can basically reduce the strength of it to a move so this is called common sub expression elimination the sub expression is R2 plus 3 or R2 * 3 R2 * 3 you compiler figures this out says this multiply can just take the register value R1 and copy it into R3 now it's beautiful even with this very simple code now we've eliminated the multiplication which may take eight Cycles a moov is going to take a move is a simple operation basically it doesn't even do calculation right so that's the beauty of compiler optimizations as you can see right even with with this very simple code but the trade-off is there you used to have three instructions now you have four instructions okay okay so if you're interested there's a lot more over here and we don't have time to cover them but there uh people have developed a lot of compiler optimizations essentially and they they've been very useful for out execution super SEC execution so there's more we will see tomorrow the mem Banks so it becomes hairy basically at the compiler level uh you can do this but whenever you need to reorder code one problem happens uh how do you reorder store over a load you have a store instruction and you have a load instruction and load instruction is earlier in the code in the original code and the store instructions later can you reorder the store earlier for example and this becomes a huge issue basically because you don't know the address of the store again or the load until assuming it used an addressing mode that's dependent on Dynamic execution most of the loads and stores use addressing modes that are dependent on registers let's say so compilers need to be extremely sophisticated for this and that's another reason why these even code reordering is actually not so easy okay there's a lot more on static instruction scheduling that we're not going to cover I'll refer you to these but you can see that this example is has uh let's say uh stood the test of time I think this is the best example if you come up with a better example for benefit of this forming larger blocks please let me know okay let me uh conclude uh with some things on Isa translation then we're going to take a break but remember we've also discussed translating of Isa you can translate from one Isa to another internal Isa to get a better trade-off space you can have a programmer visible Isa virtual Isa translate to an implementation Isa complex instructions to simple instructions in other words scaler Isa to VW Isa also and that's what people have tried to do as well so for example Intel's AMD Intel AMD x86 implementations trans translate x86 instructions into micro simple instuctions CIS to risk let's say and transmit as x86 implementations translate x86 instruction into secret V IW operations in software so they did all of it in software so this techniques that I just discussed more sophisticated ones were employed in what they call the code morphing software and they had a lot of trade-offs so this is what they looked like basically outside the operating system was x86 applications were x86 and then there's a bios internally everything went through this course morphing software which is essentially a dynamic binary translator which translated everything to a proprietary VW Isa and they actually built these processors Unfortunately they don't build them anymore uh and this is an example of the semantic Gap I will not go into this in detail again but basically uh they Chang the semantic Gap trade-off by uh having a small semantic gap between the high level language and x86 over here but then they use a software translator Dynamic Optimizer that did a lot of optimization similar to what we discussed to turn that code into a vliw code and then the hardware can stay simple so you don't change the huge amount of software that's WR in the world that's good you put a lot of effort into humongous effort into the software translator the hardware is simple that's the good part so software translate and Hardware can all perform operation reordering in the end and as I said Unfortunately they were not successful what what has been successful is this they've actually done a great job in my opinion to do something very similar to a more complicated Isa they were able to translate xd6 to arm and this the binary translation if you look at the structure of what's not advertised it's going to look very similar basically except this is not this is x86 and this is not VW but it's arm uh it's the Apple's arm Isa basic okay so Nvidia denare did something similar and you can read about so we talked about this again I don't have time over here but uh I will mention this one dco is the dynamic code optimization software and Nvidia says Cod designing a hardware processor with a dynamic code optimization Software System creates both additional validation exposure and benefits in addition to what we've discussed so far the dco system can be upgraded in the field to address functionality performance or security issues and they don't tell much more about that but it's true actually you can you can upgrade your software without changing the hardware underneath okay there's a lot more to cover and and we don't have time uh but if you're really interested in compilers uh this gives you hopefully a glimpse of it how many people have taken anything about compilers here it's too early of course I know yeah now you're exposed a little bit okay so this is where we will take a break uh 10-minute break so we'll be back at 1518 to cover another uh let's say big concept

Transcript for:[Lecture 28] Understanding VLIW and Computer Architecture

Transcript for:
[Lecture 28] Understanding VLIW and Computer Architecture