[Lecture 20] Understanding Multiprocessing and Bottleneck Acceleration

well we still have a couple of minutes so we'll wait I'm not sure how much more people how many more people will show up it's a Friday and it's cold this is the best time to be in a lecture I think it's cold outside is it all working this is good okay it's what okay that's good cool but this doesn't seem to be the updated slides but I had the date corrected it still says 21 November oh really you didn't get the a okay let me send it to you you got it okay yeah let's update it because I updated some things beyond the 21 November the other lecture room is better because this too much yeah next year maybe we can look for that one most people prefer this the other one also I got it okay Infinity okay cool that was quick thanks this looks more like it everything good okay let's get started uh so today we're going to continue uh on the lecture we started yesterday on bottleneck acceleration uh but we're going to also jump into multiprocessors and let's see if we can cover memory ordering which is a fundamental issue in multiprocessors does that sound good okay so just to jog your mind uh we covered heten yesterday we're going to cover more of that but our Focus will be a bit on ball neck acceleration these are some of the readings that we were covering and we were talking about asymmetric multicore and these are some papers we covered so we started with accelerating critical sections we actually started accelerating serial Bal necks uh and put them on a large core which made things faster and then we ex critical sections kind of blindly put them on a large core which made things faster and then we kept asking the question can we generalize this to all types of synchron bnck in multicore in in multi-thread applications and we came up with a solution bnck identification and scheduling and that made things faster for different types of bnck as we have discussed yesterday and that was the paper and uh we also asked the question yesterday at the very end can we make better acceleration decisions and I mentioned very quickly that you could actually take into account utility I will not go into detail over here but I will show you something that we briefly discussed yesterday so we we were talking about bottlenecks right this is uh a parallel program a parallel program may look like this uh in between barriers uh you may have different critical sections that look like this the red parts are critical sections this is one way of uh one type of program you may not you don't have have barriers also of course uh but here the critical path goes through uh the critical sections as you can see and you can accelerate them you can accelerate them better with Balck identification and scheduling as we've discussed and then there's another B neck which is really uh the barrier over here and the last thread reaching the barrier uh caus a problem and if you actually improve just that last thread a little bit if you identify too late your improvements may not be very high we kind of discussed this yesterday uh so these are uh essentially different B necks so we've also discussed uh actually here you can see that the ball necks are combined you have some critical sections and then you have a barrier at the end so this last thread can be accelerated and also critical sections can be accelerated together with uh it using the Bal identification schedule so we've discussed all of this yesterday both of them can be accelerated now we discussed also this but we didn't go into a lot uh I will illustrate it a little bit there's actually more potential uh if you look into this sort of application maybe get rid of the critical sections you have a box synchronous parallel program like we've discussed yesterday and if you know how much work how much progress is expected from each thread you can kind of Mark that that progress and you can identify a potential bck much earlier this is called lagging threats it may be difficult to do it really depends on whether you can correctly identify how much progress has been made by a given thread uh for example if you can somehow magically identify that processor 2 made only 30% progress in its task whereas some other processors have made more progress maybe you predict that uh that thread two that's executing on processor 2 is actually your balate and it can somehow uh accelerate it well you know how to accelerate right you can ship that to the large core for example right so this is something that is discussed in the papers but I'm not going to talk about that in detail but you can also identify lagging threats essentially so that you can optimize this execution until the barrier much more nicely until something really becomes a limiting ball neck does that make sense basically it's more progress uh uh determination right okay so the question of course becomes there may be many many of these things that happen at the same time in a program do you accelerate the ball neck or do you accelate the lagging threat right and uh in a in a program that kind of looks like this you can have both of them you may have ball necks in critical sections and you may actually identify a lagging thread also in this case they're probably the same but in some other cases they may not be exactly the same right uh so that's one question so Things become more complicated and then Things become even more complicated if you have multiple applications right which application do you accelerate so far we have not talked about multiple applications in this Bic accelerating lectures even though we had talked about them earlier when we talked about resource sharing a lot now we're going to kind of combine both although I'm not going to go into a lot of detail over here basically you may have two applications that look like this uh one of them may have critical section as a BAL neck another may have a lagging thread uh or a barrier as a BAL neck uh then basically you need to make a better decision than what we have done before actually we didn't have a mechanism for making a decision before who uses the large core becomes interesting if there are multiple large cores of course maybe you send two things to a different large cores but then as applications increases the number of applications that may be demanding the large core may be more than the number of large cores that you have uh so in the end acceleration decision better acceleration decisions probably need to consider both the criticality of a code segment how important that code segment is and how much speed up can you get by accelerating that code segment the both of these are important if the code segment is not critical for performance maybe accting uh that uh is not going to buy you much if the code segment would not get performance higher performance if it gets shipped to a large core then maybe it is not worth accelerating that much right so this paper takes a methodical approach for both ball necks different ball necks that we've examined plus lagging threats for any running application I will give you the Bas basic idea and I like this approach actually it's a very methodical approach I believe this sort of approach needs to be applied for other things uh in in later works we have applied it to uh hybrid memory management basically we developed a utility based hybrid memory management technique that considers utility of uh moving a page from slow memory to fast memory uh in terms of how much benefit you would get so we're going to do something similar over here basically our goal is to identify performance limiting ball necks or lagging threads from any running applic and accelerate them on large cores of an asymmetric multicore processor but of course you can have different methods for acceleration as we discussed yesterday and this paper introduced a utility of acceleration metrics that combines speed up and criticality of each code segment I'm not going to tell you exactly how this is computed otherwise we would take a long lecture on a single paper which I don't want to do today because we have a lot of other fundamental things to cover uh but basically the basic idea is it breaks down things in a easier to compute way because this utility of acceleration uh may not be directly easily computable but pieces of it may be computable uh essentially uh we Define the utility of accelerating code segment C of length t on an application of length large t as a combination of these three things which kind of look funny right but basically if you really understand what's going on this is the local speed up that you would get by accelerating that segment on a large core this is the the fraction of execution time that's spent on the segment the importance of the segment for the application and this is the global criticality of a segment essentially uh which is a more Global measure of what's what's going on in the machine and there are different ways of estimating each of these which I'm not going to go into but I like this methodical approach and if you do that uh basically uh there are two two things that this paper does it identifies ball neck it selects a set of highest utility ball neck it identifies lagging threads we using a different mechanism it's select B neck identification is done exactly like we did with B neck identification scheduling lagging threads adds a little bit more sophistication because you need to really understand how much work is assigned to each thread and how much progress each thread is made based on that amount of work which may not be completely accurate and once you understand uh the set of once you have a set of highest utility BNS and highest utility lagging threats there's a coordination mechanism that decides which one to actually send to the large assuming you have a single large core if you have multiple large cores the problem becomes a little bit different but again you you can still find out which ones are the highest utility ball necks and lagging threads and then decide uh which how many cores to use for them does that make sense I wanted to cover this yesterday I said maybe I shouldn't cover this but I had to cover it because this is really the State ofthe art uh and uh on on identifying did I do something on identifying ball necks and uh also it's a methodical approach right it's a more methodical approach uh to understand the import uh benefit that you would gain from acceleration and the the results are actually reasonably good uh these are the comparison points essentially uh M there's a multi multi program aware Bic identification and scheduling that's the red bar that's what you seen yesterday plus some multi- application awareness uh and there's another work that's uh that does performance impact estimation that's uh around the same time and compared to both of those approaches uh you get higher benefit in most applications this is an S curve as you can see these are about 55 workloads over here and usually whenever you actually evaluate a mechanism uh and compare it to some baseline in this case uh um actually I don't remember what the Baseline is but I believe it's acmp in this case it may be ACS I'm not sure uh don't quote me on it it's in the paper probably it's acmp uh uh you would you see that some applications actually lose performance uh and some applications gain performance and usually this is the curve you get it's kind of like an S curve that's why it's called it's called an S curve actually the curve looks like an S and you can see that overall uh doing utility based acceleration provides better performance but not in all cases so there's still room for improvement as you can see make sense okay so I think the sort of utility based metrics are interesting and important to use people have uh applied this sort of utility to cash partitioning in fact the paper uh won the best T of time award at micro this year it's called utility based cash partitioning in in past lectures when we actually talked about caches we will talk about it but we don't have lectures on caches probably this time uh but if you're interested you can look at that also uh and later works also try to uh understand the utility of decision different decisions so if you can somehow figure out the utility or performance benefit of making a decision and compare to Performance benefit of making other decisions you can actually make much more principal decisions I think the question is how easy it is this clearly is much more complicated than thread waiting Cycles right thread waiting Cycles was easy it's relatively easy to compute once you have the uh mechanisms in place to actually keep track of them in this case you need to keep track of a whole lot more and that's true for any kind of utility based metrics if you look at utility based cach partitioning there's a Hardware that you need to add that's more significant than keeping track of how much reuse you've had in a cach line for example if you look at utility based hybrid memory management which is also another paper uh in 2017 that also adds a lot more Hardware okay so the question is of course always can we do better but before that maybe any questions on this okay now we're going to tackle another issue which is actually uh something that uh had been bothering us while we while we were doing a lot of this work and the bothering part is really this private data locality that we discussed yesterday right uh what we do when we accelerate a code section is we move uh we keep the shared data and shared locks in the same cache right because you put you you accelerate that at least important code sections uh on the large core this is good so improve shared data locality but this comes at the expense of private data locality because now you need to transfer things that uh are produced by a thread uh locally and then U consumed by this critical section or ball neck uh on the large core from the small core to the large core so that's the issue over here and uh this paper develops a mechanism for handling this nicely I would say uh which I think is actually a more General mechanism that we should think about uh in general uh because if you can keep the shared data in a single place and send uh the operations that work on the Shar data to that place you solve that shared data locality problem as much now your problem really becomes your ball neck really becomes the private data and I will show you some examples of that uh and there I think there there should be more Improvement in this direction also so this actually a very general problem so I'm going to generalize the problem even more uh this applies I think also to single thread applications multi-thread application any application in the the end so I'm going to introduce something that we call stage execution model and I think the goal in essentially uh parallelism uh is to speed up a program by dividing it up into pieces uh and the idea usually is you split a program code into segments you may call many different things threads tasks uh dynamically formed tasks Etc uh or critical sections uh but we we're going to call segments in this particular case to be General uh and your goal is to run each segment on the core best suited to run it right and we've kind of done that in earlier works and one way of doing this is actually to have each core to be assigned a work CU storing segments to be run this doesn't have to be the case but we're going to kind of assume that uh you don't have to assume that in the end the benefits you get uh as we have seen in earlier works is you can accelerate segments or critical paths using Specialized or heterogeneous units uh that's what we've seen in the last lecture clearly you exploit inter segment parallelism even if you don't ship it to a large core the stage execution model segmenting a program buys your parallelism across different cores right that's the classical multi-threading uh and if you actually try to run each segment if you actually partition your program nicely such that uh segments that operate on the same data are scheduled onto the same core you can improve the locality of within segment data and that's that's what we get with putting critical sections that operate on the same share data to a large core right so we can get a lot of benefits which is nice this is actually basically if you look at these benefits these are very fundamental concepts going from bottom to top you get locality you get Paralis you get heterogen right or specialization now there are examples of this clearly we've discussed a lot of examples multi trading is a very general example of it but xray critical sections bnic identification scheduling pipeline parallelism producer consumer that we've discussed uh other task parallelism models that uh these this these are old slides so people have been using interesting task parallelism models there're still in Ed actually uh to parallelize workloads and special purpose cores and functional units actually can be thought about it this way also because if you think about a single program a functional specialized functional unit is really executing a segment of the program uh on okay maybe core should be more generalized also on a functional unit that's specialized for the execution of it right so if you think about it this model can be broken down into even uh smaller pieces than cores we're not going to do that in this but uh I think it's good to think about that okay so let's take a look at the stage execution model you have assuming you start with a single uh program I'm just showing you the loads and stores uh you split or someone splits the code into segments it could be the programmer it could be the compiler it could be the hardware assume that programmer has done it uh now you have segment s or S1 S2 and and they communicate with each other and maybe you segment zero communicates uh on the same type of data right you can think of it that way some similar arrays and then you communicate something to segment one and then something to segment two and then you assign instances of segment zero to a core instance of segment one to another core instance of segment two to another core and then you hope you can hopefully exploit inter segment parallelism right so this is one way of spawning segments when you execute segment zero uh when you get to the end maybe you spawn segment one on core one you spawn segment two on core two it's going to be very similar to ACS very soon right but the pipeline Paralis is also kind of like this maybe you don't necessarily have a spawn instruction right okay so let's take a look at two examples over here ex critical sections uh the idea here is to ship critical sections to a large cord an asymmetric CMP multicore processor you can think of segment zero as non-critical section and segment one as critical section any critical section is segment one any non-critical section is segment zero you can further uh classify different critical sections of course right uh but this is the general uh idea so the benefit you get is clearly faster execution of critical section redu realization and improved lock and share data locality as we have discussed producer consumer pipeline parm we've also seen it but to a lesser extent perhaps and we've seen the idea multiple times before you split a loop iteration into multiple pipeline stages where one stage consumes data produced by the previous stage and each stage runs on a different core and usually in these cases a stage has a set of data that it operates on uh so it's really exploiting locality in this case segment n is really stage n so you can have millions of segments potentially assuming you can do that uh paralyzation nicely okay the benefit is really stage level parallelism now and you also get better locality uh because you're executing uh hopefully each stage is operating on shared data that doesn't have to be the case of course you may not get bar locality but you can still get good parallels this way uh so you get faster faster execution so the problem with all of these models is locality of inter segment data that we have discussed I'm I'm going to show it a little bit more rigorously essentially if these stages execute on different cores when stage one requires data that's produced by a prior Stage IT essentially gets a cachas and this cash Miss as we have discussed in earlier lectures is a communication Miss right it's really a or coherence Miss as we will see coherence later on uh so you don't get rid of these cash misses by making cashes bigger there's no way you cannot manage the cash better you need to do something else about it you need to be more proactive about it the reason it gets to cash Miss is because this other stag is producing that data and when uh this stage starts executing this load it will detect that it will get a cache Miss and it will send a coherence request to the cache uh that houses the latest value of that uh why makes sense right so we're going to see coherence methods later on so keep that in mind uh okay so as stage two executes it's load Z it will also get a cach Miss because this is really a communication miss the data is produced by a prior stage in this case it's nicely produced by stage one but it doesn't have to be it could be produced by some other St also right maybe stage minus one it's not in the slide okay basically uh in in acced critical sections as an example critical section incur a cach miss when it touches data that's produced in the non-critical section right thread private data we've seen that in producer consumer pipeline parallelism we never explicitly said that but a stage incurs a cach miss when it touches data produced by the previous stage and performance of stage execution is limited by inter segment cache misses in the end and we have some anal of this in the paper take this with a grain of salt these are with small core results essentially If you eliminate all inter segment cash misses you can get significant performance we're going to get very close to that uh performance but this problem actually becomes bigger as you scale up the number of cores as you make things more finer grain that's why I say don't get stuck on these small numbers over here 10 to 20% becomes much bigger as as the number of cores increases and as you want to paralyze even finer grain okay so let me Define some some terminology I'm going to give you a mechanism to do this but this is just one mechanism there could be other mechanisms that people develop and again I will also ask you to think about we're going to develop a hardware software codesign method but I believe actually this can be done purely in Hardware in a nice way as well uh maybe purely in software that's very tough I think but purely in Hardware is easier and maybe there are other mechanisms so let's define some terminology so I'm going to Define inter segment data as cache block we're going to still operate on cach block grity since operate on cash block gr lat today but maybe if you go into processing in memory we should not be thinking of cash blocks right it's good to uh uh let's say distruct some of these Concepts uh as we go into different environments uh but we are still processor Centric in this uh it's inter segment data is a cache block that's written by one segment and produced by the next segment so clearly cache blocks that house uh address Y and address C are inter segment data and ideally we would like to identify them and do something about them such that uh the processors that need them don't get a cach m that's the idea okay we also identify Define a generator instruction as the last instruction to write to an inter segment cache Block in a segment so clearly why is an why is part of an inter segment cash block and its generator instruction in this segment is the store instruction right and Z is also an inter segment cache block and it's generated instruction is the store instruction now there could be multiple generator instructions which makes things more interesting and complicated because of control flowing programs right you may have if andl structures and if part of the program can produce one data with one store instruction and else can produce another data with another store instruction clearly when you're executing the program only one of will be executed So based on the branch you've taken right but there could be multiple generator Instructions make sense and that's going to be part of the inaccuracy of the mechanism that we're going to discuss soon okay yes yeah the instruction so uh that's fine uh basically the definition of the generator instruction is it's the last instruction to write to a particular cache block not the last instruction to execute so in this case it happens that these are also the last instructions but it could be somewhere in the middle yeah true that's right exactly we're going to identify these generator instructions and you can identify these through data flow analys in a compiler okay yes maybe this figure should be nicer so that the generate instructions also not the last instructions but okay so basically this work makes a key observation that the set of generator instructions is stable over execution time and across input sets basically the generator instructions that generate these values are producer instructions you can also think of them that way and then they're consuming uh instructions on other segments and if you can somehow identify these generator instructions what you can do and we're going to identify this in software actually but you could do this in Hardware also I think uh in Hardware you can record the cach blocks produced by the generator instructions and it can proactively send such cash blocks to the next segment core before initiating the next segment and that's the idea over here and this the idea that's discussed in this paper again I would like to uh blur the hardware software boundary here earlier I said maybe you cannot do this on software maybe today's architectures yes it's difficult to do it in software but you could potentially think about uh recording cach blocks produced by generated instruction software assuming you have a hardware buffer that's visible to the software and uh you can have a push instruction that basically pushes from that buffer to some other core that's executing the next segment that needs that so you could actually imagine a programming model assisted by Hardware that could enable this purely in hard software assuming that Hardware is already in place right in current Hardware you don't have that unfortunately you could also think about this does it's it's a fundamental principle is really passing uh the data to another place right it's kind of like you're passing the message in a sense right well that's not maybe that's not a good analogy you're you're really passing the data you're really pushing the data to some other place as opposed to pulling the data so existing models of execution uh usually is based on pulling the data right the program basically says load store load store store is actually kind of pushing the data but not to another process usually to memory now actually if you have mechanisms for communication that push the data from one processor to another processor then you can actually enable a lot of this in software as well which is very interesting okay I kind of got ahead of myself but let's take a look at the mechanism I think I I I gave you the basic idea so it's going to be very easy from this point so compiler profiler and hopefully not the programmer but again the programmer can if they want to uh identify these generator instructions uh and in start Marshall instructions we're going to talk about what that is Marshall instructions basically pushing the data to the other core uh that is going to need the data and uh uh you generate a binary containing generator prefixes and Marshall instructions and Hardware when it sees these instructions and prefixes it records generator produced addresses and when it sees a marshall instruction it Marshals the recorded blocks to the next core okay so there needs to be some destinations Etc there are some interesting issues over here that I may uh uh gloss over but you can ask questions okay let's take a look at how the compiler can do this uh this can be done exactly actually it doesn't need to be profiling but we did profiling in this particular case uh you can exactly figure out which instructions are actually generator instructions uh in in the binary uh by doing static analysis assuming you can disambiguate addresses that may not be always the case depending on the programming language and uh uh uh memory access semantics that you have okay so in this case for example if you profile the program you will find out that this load y touches some data that's produced by a prior uh store and clearly this is inter segment data you can identify inter segment data easily and you can mark this last store as a generator instruction not this store right in this particular case you can Mark both of them but then you may be sending some useless data uh unnecessarily okay so assuming you've identified these generator instructions then uh you can insert Marshall instructions and Marshall instruction needs to specify when to send the existence of the Marshall instruction basically says this is the time to send the data so you can actually put it in an appropriate place and also it specifies where to send the data in this case core one this needs to be some virtualized ID of the core it cannot be physical otherwise programmers really uh using physical assignment onto the course so there are issues like this which needs to be taken into account uh but you produce a marshall instruction okay this doesn't have to be the last instruction also again it could be somewhere in the middle in fact it's better if you can schedule it much earlier right because if it's the last instruction the segment is ending and the other segment is probably going to start very soon and you're not going to overlap a lot of latency probably okay so that's the idea so clearly the spans across the stack uh in this particular implementation the profiler or the compiler needs to identify generators and insert Marshall instructions the ISA needs to change also uh you need to have a generator prefix and you need to add some push instructions I like calling them push instructions also uh but Marshall instructions and as I said uh library and Hardware needs to change such that you can bind the next segment ID to a physical core uh so that your uh the programmer doesn't need to know which core a particular segment is executing in so there's some Hardware analysis that I will not uh go into a lot of detail but uh we're going to use this Marshall buffer which stores physical address of cash blocks to be marshaled you're going to see an example of this and we see that you don't need a whole lot of entries for this but in some cases you do so there are some communication patterns and programs where uh you you may need more than 16 entries and that's where we're not going to get a lot of performance but if you increase the size you're going to get a lot more performance uh clearly you need to execute the new things uh you also need to have the ability to push data into another cache this is a part that uh is not exposed to programmers as much today in many isas okay so there are advantage and disadvantages uh one is hopefully you get timely data transfer so you can think of this as a prefetching mechanism also right except prefetching is not done by the core that needs the data prefetching is really done by the core that produces the data right producer induced prefetch in a sense right you can think of it that way u i i I still like calling it producer induced push of data somewhere else uh so basically you're pushing the data to a core before the core needs it uh another Advantage you can Marshall any arbitrary sequence of lines so uh we actually looked at ways of doing this with pattern identification so for example uh if you know which cach blocks are going to be touched you can record them for example example uh and you can replay them kind of like uh a memorizing prefetcher but uh in this case we're really executing the generator instruction and then sending the data so uh it's more arbit it's more General and it actually has low Hardware cost uh so essentially there's no need for Hardware to find the generator instructions but as I said uh that has a downside also now you need modify the software so it may be actually better to have a hardware approach completely that way you don't need to modify a lot of things on software right I don't think it's that nice that I keep adding hardware for everything but there are some realistic reasons why you may want to just modify the hardware because now you don't modify anything else in the system right okay so there clearly disadvantages like profiling and Isa support and you're not always accurate uh because what we do is profiling based identification but even if you don't do a profile based identification may not be uh as accurate depending on the path You' taken so I generate set is really conservative in our case and if you don't if you push some data that's not needed by a next core uh you essentially cause pollution in the caches of that core right uh and you also waste bandwith on interconnect clearly uh this is not good for performance or energy but we find out that this not a huge problem because the number of inters segment blocks that are communicated are actually usually small yes yeah offline yeah profiling is offline basically yes that's right yes the particular approach is offline you could of course have an online profile as well but that's not what we examined I think online software profiling is probably a bit heavy-handed but online Hardware profiling if you do this profiling in Hardware it's probably easier uh to do and it's probably better for performance because you don't have the overhead of software profiling things yes exactly something like that essentially yes yeah those are all interesting things and just in time compil also employed but if you keep adding more and more stuff at A fine grain to just in time compilers it may be expensive that's why we went with static profiling here okay but of course you always have the downside of profiling whenever you profile as we have discussed in earlier lectures like input set representative okay so let me give you an example of how this works uh very quickly these are uh the binaries that are generated uh for two segments over here non-critical and critical uh this an example of ACS we add a marshall buffer to small course when uh the non-critical segment that's executing on that small core gets to a generator instruction it puts the address of that instruction into the Marshall buffer and when and there could be multiple of these that are put into the Marshall buffer when it gets TOS call instruction which is critical section call for a particular critical section it basically there's Hardware special Hardware when as as this instruction executes there special Hardware that goes through the Marshall buffer takes each address accesses the small course cache and takes the data and sends address and data to uh the large course cache L2 or L1 depending on the analysis that you've done well depending on which uh which one which one you want to push the data in right ideally L1 actually in this case but we're showing l2s over here okay and hopefully when the critical section starts executing on the large core it gets a cach it when it does a load wi right it all depends on of course the distance but you start the access earlier and this critical section may not start executing right away because it may be delate especially if it's a contended critical section there will be other instance of the critical section that's executing on the large core so the Timeless of the speed fetcher actually becomes better because of contention and ideally you want to send the contended critical sections to the large core right so what you're trying to do by ball neck exporation is actually going well nicely with what you're trying to do by pushing the data okay so uh we evaluated this again using similar workloads to ACS and we this improves the performance of ACS significantly some workloads better some workloads it doesn't matter maybe it's not a big problem but clearly on workloads where it matters you can get the performance benefits and ideal is actually the blue one so we get very close to Ideal so it's almost perfect mechanism which is nice yes yes but you can identify that right maybe when you get a Marshal instruction you so what one part that I omitted was basically when you get a marshall instruction you need to do this translation from the virtual core ID to the physical core ID you can easily identify whether you're in the same physical core yeah that's a good point I think you want to avoid extra pushes your own cash right okay okay so let's take a look at an example with pipeline parallelism also because this is this actually more beneficial with pipeline parallelism there there tends to be more communication pipeline parallel workloads because the workload itself actually is consisting of these pipelines right whereas critical sections may not be the major thing in uh in uh even critical section intensive workloads uh here we we essentially add Marshall buffers to uh the small course and again when you when when segment zero gets to a generator instruction it stores the address over here and when it gets to a marshall instruction it Marshals access its own cache uh gets the data associated with address Y and Marshalls the data to the next score uh to the corresponding course caches and hopefully again when this segment gets to uh load it gets to the instruction that's loading y it will get a cash it or at least reduce access latency okay we evaluated this again with the same workflows that we did well or pip for pipeline par workflows and you get more benefit over here and here because there is more communication a 16 anti buffer doesn't work for some workloads as much so you can you still get very close to Ideal as you can see but you can actually improve this by adding a bigger Marshall buffer uh for uh workloads that uh need that larger Marshall buffers because there's more communication that happens across segments which is interesting so there's more analysis in the paper that I will not get into but you can see that the cover is actually pretty high over here you can cover almost all misses uh accuracy is relatively low because of the profiling so you could do better over here and Timeless is usually quite good as you can see whenever you want to access the data that's coming from inter segment data you get a cash it essentially uh and medium accuracy could not it could be bad for performance but we see that that's not the case usually but it's bad for energy of course right whenever you're not accurate you're transferring something that's wasting energy on the interconnect okay any questions I think there's more work to be done in this area don't be fooled by the close to Ideal results there are many different types of applications that are communicating I think integrating this in a general purpose way is actually very important so okay yes that's right yes exactly usually it's well actually it could be it's still in flight or it's not in Flight yet yeah so we didn't break it down uh that much at least in this particular picture it means basically whenever you issue the access the data is not in the cache it could be mostly done that's right that's right exactly yeah and also there's scope to improve accuracy and also there are different workloads so We examined the set of workloads that uh but then some other workloads may have much more complex communication patterns right and that exists actually I know that very well yeah yes so in this case it's a I think it's a ring based Network yeah it's a bidirectional ring so answer is yes but it's not as much as what you would support with some other network as we will see in the interconnection network sect yeah okay okay so we actually have interesting scaling results in the paper uh the performance Improvement of data marshalling actually increased with more course a higher interconnect latency and larger private L2 caches and the interesting thing is this is H all of these are happening right meaning this is an idea that could actually be much better in the future and whenever you I think analyze some idea it's good to look at the trends clearly we're adding more course to systems today we're having high interconnect latencies today and we're having high larger private L2 caches although with memory Centric Computing we want to get rid of them as much as possible I think with memory Centric Computing we want to get rid of especially the second two more cores are still use for I think because you want processing near M but uh still I think uh these are these Trends are a bit hard to change and the main reason why is with all of these Trends inter segment data misses become a larger B if you have more course you paralyze your program more and you get more communication tion uh if you have higher latency you get longer stalls due to communication and as I said earlier as you get a larger L2 cash these misses are the type of Misses where larger cashes are useless or better cash management is useless because these are not capacity conflict uh uh uh or cold misses well you could consider a c Miss from the perspective of the core uh that actually executes the segment execute that load yes that's true but that's from the global perspective of application execution it's not a called Miss Right somebody else is generating this data that is causing the Miss so basically it's a communication Miss and I think these misses are going to be more and more important in the future that's one of the reasons I spent some time to discuss this it's a harder area to do research in but I think it's a very important area so in the paper we also discuss other applications that I'm not going to talk about uh this could be generalized to other stage execution models like clearly task Paralis models that people uh use in the field I'm going to briefly talk about that when we talk about multi processors I think um and special purpose remote functional units um there are also other approaches in literature that are interesting that I'm not going to talk about and I think if you actually reduce the cost of this sort of communication including with this approach this could be an enabler for more aggressive stage execution meaning more aggressive paralyzation right and this is kind of what people really wanted to do when they initially uh started with multiprocessing uh you really want to paralyze your program as much as possible one instruction in one core core or functional unit what's the difference that much Nvidia used to call their execution Lanes as cores and I used to joke with them I don't think they're cores basically so gpus used to have these huge cor many many cores right um but yeah maybe from a perspective functional unit can be considered a core also right uh it's good to think about that but basically the reason is you lower the cost of data migration or communication and this is really an important overhead in remote execution of code segments and you can always think about parallelism as remote execution right what does parallelism mean you're really executing some piece of code somewhere else and that's remote it's good to think about that H and if you actually make remote execution of finer grain tasks uh more affordable more feasible by reducing the overhead that's caused by them you get much finer gain paralyzation in multicores or whatever you're communicating I think a lot of these ideas could be applicable for example to uh models where you actually do a parallelism in memory as well right because in the end at some point you're going to communicate local data hopefully is not a problem if you've done a good job in partitioning your data and memory but at some point you need to communicate across different let's say processors or execution units in memory and your ball neck will be this it's the think about that okay so I don't want to go through this in detail I think we've discussed all of this uh but I think basically what's even more interesting in this sort of approach is you can enable new models veryify and gr new M execution as I kind of mentioned okay and that's the paper if you're interested you can read it I think it's going to be part of the homework you can as an optional one uh any questions otherwise I'm going to switch to other things in heterogen Al but okay okay so let me switch to other things clearly heterogenity is something uh real today right uh people are having heterogenity in both microarchitecture as well as architecture uh Isa and existing systems in a widespread manner uh but I will mention that there are other uses of asymmetry so we focused a lot on uh performance scalability and also Energy Efficiency that comes with all of that but even if you may not get performance uh there are good reasons to use asymmetry and one is Energy Efficiency right uh so there are works that discuss this and uh clearly these works are important uh basically U the idea is to implement multiple types of cores on a chip monitor the characteristics of a running threat for example by sampling what kind of energy you're spending what kind of performance you're getting periodically and dynamically pick the core that provides the best energy performance trade-off that do you think uh will happen in the next phase in a given phase of a program so in this core best score of course depends on the optimization metric whether you're optimizing for performance or energy or energy delay Etc you could optimize for different metrics clearly that's the beauty of heter gen8 as we discussed right and this is a nice approach clearly so uh when this was proposed uh it was proposed by folks who were actually working on uh Alpha processors also and uh they basically demonstrated these different Alpha cores uh these are called ev4 ev5 ev6 ev8 which was never built uh because I guess they were bought and then somebody didn't want to build those processors actually they were bought by Intel and then Intel didn't want to continue this line clearly for but maybe they should have reconsidered in hindsight who knows okay but basically if these cores are already built ev4 ev5 ev6 U ev6 is Alpha 21264 I think ev4 is Alpha 21064 I don't uh yeah that's true I think so this a out of order processor this a heavy outof order processor these are the relative sizes of the cores scaled to 0.1 Micron which is still pretty large today right uh you can see that these are uh ev4 is actually very small it's ev8 is 80 times bigger so there's something to be said about area as well and these are what's reported in the paper in terms of peak power average power and performance across some workloads uh I think these numbers are a bit more realistic from compared to the Intel paper that I mentioned yesterday although I'm going to mention that paper because that paper has value independently but you can see that uh these folks say ev4 has a peak power of this and performance is normalized to that ev5 has a much higher Peak power but performance is 30% higher and ev8 has huge Peak power as you can see comparatively I don't know if these numbers are correct actually because ev8 was never implemented so I keep that take that with a grain of salt uh but performance is also double let's say anyway basically you have a variety of cores with variety of performance and energy and power characteristics now you one way of utilizing these cores assuming you've implemented all of them on your chip since you already designed them in the past why not put them all of them in your chip right that's one way of easing the design of heterogeneous course uh now what you can do is you can monitor what kind of performance you get on different cores for different phases of an application and this is one application I don't remember which one but you can read the paper uh it shows what kind of instructions per second performance you're getting clearly there are some phases right there are some phases where ev8 is much higher performance than ev6 ev5 ev4 and there are some phases where their performance is similar right and this could be because you don't have a lot of instruction level parallelism to exploit in ev8 uh in some particular phase whereas in some other phases there's a lot of instruction L parm to exploit Etc uh and the idea is basically to determine those phases where you would want to schedule the code on a different core based on some Metric that you're optimizing and then uh you schedule it on that particular core so their Dynamic mechanism if you look over here uh if you want to optimize for energy uh chooses some of these cores so this is basically I think their Dynamic mechanism it decides to choose ev4 only very infrequently it chooses ev6 most of the time as you can see and sometimes chooses ev8 and this is if you want to optimize for energy delay product okay I'll let you think about it uh that's the paper click clearly there are advantag and disadvantages to this uh now you get more flexibility and energy performance trade-off and you can execute computation to the core best suited for it in terms of energy right assuming you're doing you're taking into account energy directly clearly there are disadvantages but this is true for disadvantages for uh a lot of heterogenity if you do an incorrect prediction you get to the wrong core you get reduced performance or increased energy this is more of an issue in this case because uh uh you're you're not really picking a critical section of a program right you're really picking some program you may be migrating a program from one place to another and really wasting performance so if you do the wrong decision this is not good clearly if you do the wrong decision in critical sections yes is not good but you're not migrating the whole program whole program migration because if you do that now you need to warm up your caches in the other core also okay there's overhead of course switching which is part of whole program mation so these are actually relatively uh similar to the disadvantages that we see I think incorrect predictions are actually very dangerous here uh yeah you need to design multiple cores but if you already have them maybe that's good or you could argue that that's good at least it reduces Your Design effort you need to have phase monitoring and matching algorithms I think these are important actually to develop independently of whether or not you do this identifying the phase of a program and deciding what kind of Hardware to use for a given phase is really an important problem that's still not completely solved uh this could be done at a course granularity like this you identify okay this is my phas and in this phas I use a large core right but you could also uh do this phase driven uh configurability at smaller granularities this is my phas and I think in this phase this prefecture configuration will work nicely you don't have to migrate things to a large core right so there's a lot of benefit to doing good phe prediction so what characters should be monit is important once characters are known how do you pick the core in this particular case okay any questions yes that's also true yeah I think in this case they they did like uh they chopped the execution time into intervals basically that's one way of doing it but of course if you have more information if you can identify the change changes like an behavior of code potentially with software support you can do better in fact that may be even better because you may make less uh or more informed decisions exactly yeah exactly so this doesn't consider utility for example I think you you mentioned some other thing May more utility based uh metric could be better right okay so there's room to improve as you can see actually the later works kind of looked at the utility that's performance impact estimation work that I didn't talk about but we compared to in the utility base acceleration tried to understand the performance impact of moving from One Core to another with some metrics it's not it was not exactly utility but it was getting closer to utility Okay so we've talked about asymmetric and symmetric cores let's uh start wrapping up a little bit clearly if you have asymmetric Cor you can provide better performance when threat parallelism is limited you can be more energy efficient also at the same time uh disadvantages of course uh more complexity Bas I think we've kind of seen this earlier but you need more complexity you need to design more than one type of core if you already have them that's great but if you that already have them that may not be a good match also right maybe you don't have the right core for extremely high parallels right as we have seen earlier uh like ev4 actually in this particular case ev4 uh let me go back uh is a core uh that's still uh a bit expensive let's say it's not out of order execution it's an order execution but it has Branch prediction Etc so it's expensive still maybe you want very very cheap cores right uh to uh really design but maybe the design effort of cheap cores is smaller we could also argue that so scheduling should be more complicated uh we've already seen this also what compettition should be scheduled on the large core or small core who should decide this Hardware versus software managing locality and Bal loan balancing can become difficult if threads move between cores this actually interesting we have not discussed as much uh but especially if threads keep hopping between cores now you have another problem because your working set data working set and instruction working set should also be hopping between cores meaning whenever you get schedule on another core you may not have your you may get a lot of cash misses to begin with right that's locality uh there's load balancing also uh programmer May tried hard may have tried hard to uh balance the load assuming that your code is executing on multiple cores now underneath you're moving things uh how does that load balancing change right so load balancing is a tough problem as we will see a little bit more and cores may have different demands from shared resources also that's something important to think about like how do you actually uh provide access to different cores to share resource like the memory controller caches Etc okay with that I think let me talk a little bit more about how to achieve asymmetry we've discussed this a little bit yesterday uh you can have static asymmetry you can have Dynamic asymmetry uh with static means that you can your type and power of course are fixed at design time dynamic means these change dynamically right with static uh people have looked at two main approaches for Designing faster course basically you have hardcoded high frequency static higher frequencies for some cores or you can build a more complex powerful core with an entirely different microarchitecture including frequency potentially right with Dynam uh and then there's some natural a symmetry right as we have discussed yesterday even in cores there's variations in chip by frequency so some cores May naturally be faster some cores May naturally be slower depending on the variation and if you can identify that maybe you uh have higher frequency and higher voltage for some CES and lower frequency and lower voltage some other cores if it's static Dynamic is of course more powerful in general uh basically uh one way of achieving Dynamic asymmetry is by boosting frequency I'm going to talk about that a little bit that's an easy way of achieving that's why everybody did this earlier now people are doing asymmetry also so both of these are done well not the dynamic combining that part is not done but boosting the frequency dynamically is done uh and uh static asymmetry is done today uh but also there's another way of dynamically achieving asymmetry which is combining small course to enable a more complex powerful core I think that's a very interesting approach that needs to be reexamined perhaps again but I'm going to mention some papers that I briefly mentioned yesterday also but I'm going to put them on a slide we'll not go into that that much today okay maybe there's a third fourth or fifth approach that I don't know so the second approach combin small cores to enable a more powerful complex core there's actually Works in this area which I believe are interesting I'll just leave it here and not talk about it uh I believe morph core is actually uh the state-of-the-art approach clearly it's the latest compared to all of these and it actually uh is a more implementable way of doing it some of these are actually very interesting ideas but much harder to implement like core Fusion I think is a very interesting idea but much harder to implement uh okay let me talk about asymmetry we have frequency boosting because this actually opens up interesting issues as I said this already employed in today's systems uh today's systems for example when there is a single thread running uh the core the the frequency of the core that's running that thread is boosted significantly it's called turbo boost at Intel I don't know what's called in IMD but it's kind of similar essentially you do turbo boost and you increase the frequency of that core and the reason you can do that is because you now put all the power to that core and the other cores are not running essentially they're not consuming power right so this was uh okay before I go into that there ways of doing the statically due to process variations cores might have different frequency and statically tight and simply you can Hardware or design the cores to have different frequencies and then you can migrate an application from a low frequency core to a high frequency core whenever you wanted to have high frequency right you would of course have some cost of migration but it's possible to do Dynamic is already done by doing Dynamic voltage and frequency scaling that's nicely described in this paper by Intel uh you can actually achieve this so let me go over this a bit basically the idea in this paper was also interesting basically their goal was to minimize the execution time of parallel programs while keeping power within a fixed budget and their observation is similar to what we have discussed earlier uh you want the best scaler performance when you actually have a sing Le threat you want the best throughput performance when you have many threats and there could be some intermediate points also and the idea is to you you normally have fixed power budget and uh what you can do is you can have a lever to play and this lever to play could be energy per instruction that's consumed so uh by changing the energy consumed per instruction while the power is constant you can change how many instructions are executed second right does that make sense and that's the idea over here now this may not be intuitively mapable to uh asymmetry but it's actually easily mapable to asymmetry also if you're in a high throughput phase reduce the energy per instruction and execute things in many small cores that require low energy and this naturally enables High IPS instructions per second right if you're in a uh let's say low throughput phase increase the energy you spend for instru for each instruction but you're going to execute probably a smaller number of instructions meaning executing a large core for example or multiple large cores Etc that's the idea over here so I've already said this I think for a fixed power budget run sequential phases on a high Epi processor high energy per instruction processor and run parallel phases on multiple low energy per instraction processes and now you have asymmetric multicore again okay so that's one way of achieving it right so clearly you can do this energy per instruction throttling using dvfs as well uh not asymmetric multicore uh actually in this work they didn't really talk about asymmetric multicore as much they achieved it through dvfs uh but I like this uh approach because this applies generally as we will see later on also uh the key is really if you think about asymmetric multicore is really modulating the energy per instruction uh to match the IPS that you really want right but there could be other ways of modulating the energy for instraction and that is through dvfs for example Dynamic voltage and frequency scaling in phase of low thread parallelism uh you can run a few cores at high Supply voltage and frequency and in the extreme case one thread exists you run one core at the highest Supply voltage and highest possible frequency without within the power budget and thermal budget in phas of high thread prism you run many cores at low Supply voltage and low frequency okay so there are other ways of modulating things which is also interesting uh this is not a comprehensive list uh but this paper this other paper that's actually earlier from also Intel uh discusses other ways of modulating the Epi one is voltage and frequency scaling clearly uh uh when you do voltage and frequency scaling you may increase or lower voltage and frequency in this case they basically say they say lower voltage and frequenc so that they get a lower Epi uh this is Accord to them the Epi range uh the range that you would get energy for exraction from one to four take all of this with a grain of salt and then there is also a time to change the energy per instruction right meaning time to achieve that state where you're running faster or low less fast U and according to them at the time it was 100 micros seconds you ramp you need to ramp up the voltage essentially right this this changed over time but there's always time to actually change throttle the Epi okay so this is volum frequency you could do asymmetric course and according to them again Epi range is higher so you can actually get higher performance as well as energy range uh according to them uh migrating a 256 kilobyte L2 cach takes 10 micros seconds or so so there's some more effort over here again take all of these a grain of salt but there's more time according to them and we already discussed the action and then there we could have a variable size core meaning you can have a single core but when you don't have enough instruction level parallelism for example uh what you can do is you can reduce the size of the structures of the core they're not talking about caches over here but they're talking about uh maybe changing the sizes of uh different structures internally uh you can you can for example change the width of issue you can change the size of your instruction window so that you can make things in order maybe I'm not sure if they're talking about that over here because I don't remember but I don't think this is in order it's really still within out of order okay so basically there's some cost to it uh Etc and then there's speculation control uh which is reducing the amount of speculation that also changes your Epi in some way basically you say I want to I don't need a lot of performance over here so I'm going to reduce the speculation that I do on the processor when when I get to a branch I stop fetching that's one way of reducing the speculation Etc and this is clearly the fastest right one of the fastest ways of doing it okay so there are different ways and if people are interested I'd recommend looking at that paper and some of these techniques are actually implemented in modern proc actually I think all of these are implemented in today's proces but the numbers take them with a grain of salt okay so uh we're specifically talking about frequency boosting frequency boosting is clearly implemented this was the first thing one of the first things that was implemented uh this is very simple to implement no need to design new core clearly uh and the other big benefit of boosting frequency is parall throughput doesn't degrade when TLP is high when you have a lot of thread level parallelism you still run uh a core with low frequency and you still have a high thread level Paralis you need you don't need to dedicate cores uh you don't need to dedicate the space of many small cores for a large core for example right and it preserves a locality of the boosted threat which is nice right so this uh yeah of course this has disadvantages uh and one of the biggest disadvantages is frequency improving frequency helps uh with instruction level parallelism if you're memory bound if you're if you're compute bound especially uh but it usually doesn't help with memory Bond workloads so if the phase of your program is memory Bond you're still memory Bond you may generate more requests more quickly to the memory system and get better paralyzation but the effect of that and the grand scheme of things is actually much smaller you're better off doing something else with memory base basally and also another thing is it doesn't reduce cycles per instruction this is another way of saying that you didn't design a different core right basically frequency is only part of both of these point to the fact that frequency is only part of the performance equation right if you remember the performance equation you can improve the performance of a single thread uh by changing the number of instructions uh times uh the Cycles spent on each instruction times the times spent on each clock site right that's the performance equation that we have clearly we're not changing the number of instructions here because the binary is the same uh we're changing the frequency hopefully things will improve uh but cycles for instruction fundamentally doesn't change actually it's increases uh because now your frequency is probably larger maybe you need to do something else in your pipeline right uh it may increase so basically it's good to think about that but fundamentally you're not improving cycles for instruction but whereas if you design a different core uh hopefully that has maybe the same frequency I don't know maybe better frequency but maybe also a better cycles per instruction okay okay this is also interesting to think about changing frequency and voltage can potentially take longer than switching to a large core the debate is still out over here it's good to think about that there's a trade-off over here right clearly there's time you spend for changing frequency voltage ramping up voltage especially ramping up voltage Tak time right and if your frequency range is very large you need to ramp up voltage from a very small value to a very large value and that is actually there is a lot of issues not not uh clearly it takes time uh it takes time to do it reliably because now you're actually you're actually inducing a lot of noise uh in the system and that noise may actually lead to some power issues okay where switching to a large core doesn't require changing the frequency okay there are interesting issues as you can see even with this tradeoff uh that we go into uh there's there are a lot of issues okay any questions oh wow I spent more time than expected so I think uh I'm not going to cover the remaining slides over here but I'll go through them we've already covered a bunch of these this is more of um some Vision slides that talk about asymmetry everywhere I'll let you study it but essentially I think maybe I'll cover some of these things very quickly not all of them I think one of the biggest interesting things in asymmetry these are slides actually from 2010 I think I I I I put these slides I presented these slides in a workshop on essentially the question in Workshop was what is the future of computing something very general and my one of my Visions was really asymmetry everywhere uh essentially each component should be designed asymmetric that that's the goal to fit different computation access and communication patterns for different Power performance liability characteristics there's nothing you have not heard of so far but asymmetry I think also subsumes configurability and partition ability and then you need to you design the runtime system to automatically choose the best fit components for each phase so for a given phase let's say you determine uh you dynamically stitch together best fit chip for that face while satisfying the performance SLA with minimal energy that could be one goal right so I like thinking of it this way I'll go through these again relatively quickly for example phase one may want high power high performance core some lean cash hierarchy optimized for some access pattern maybe some lean interconnects and maybe some type of memory right and if you have some heterogeneous things you can actually dynamically uh identify the phas and match computation to it phase two may require a very small core but a bigger memory hierarchy right phase three may require a lot of course but very lean memory hierarchy maybe a streaming phase with large capacity requirements Etc now if you could actually design the hardware to be able to do that it would be nice and then you maybe when you go back to phase one you still uh incarnate this hardware and then you can also morph the software components to match asymmetric Hardware components you may have multiple versions of software for different resource characteristics I think it's also very interesting essentially I think you you want flexibility in hardware and flexibility in software also match the flexibility in Hardware so version one may run here version two may run here with Min small course and version three may be heterogeneous also right so there are many interesting research and design questions over here uh again I will not go through these in detail some of these be covered in a very limited manner in what we have discussed but I'll give you some simple examples like serial and parallel cores we've kind of discussed this right uh but you could also have a single core that has a of order capability and a VW capability and you can dynamically figure out which one want to use wew is very long instruction word so a compiler schedules a parallelism or it could be out of order and in order and it could switch between them right depending on the face and there Works in this area uh you can design the M hierarchy differently for streaming versus Random Access we've kind of seen examples of this with prefetchers Etc but I think there needs to be more over here if you remember the for example stream buffers they were getting rid of the caches right earlier it's good to reexamine some of these things today we have these huge caches and people don't reexamine some of these fundamental decisions in my opinion huge cast are easy to build and easy to put if you have area let's waste area let's put it over there let's forget about all of the costs of it whereas if you're streaming those caches are useless and they actually hurt performance uh and we don't have a way of actually stream easily into the processor today okay uh so we're going to talk about interconnects you can have latency optimize versus B optimize interconnects uh we've talked about memory controllers and you can partition the memory bandwidth or interconnect bandwidth across lat sent bandwidth Cent workloads differently thread cluster memory scheduling kind of does that uh you can have different uh we've kind of also talked about this memory scheduling policies that are cater differently for compute intensive versus memory intensive workloads and we've talked about hybrid memories right for different purposes okay well we also talked about uh reliable DRM versus less reliable DRM and we also talked about henus Laten DM and hetus refresh rate DM so we have a lot more today I Le here but this is kind of a vision that I have I don't think it's ease to achieve this Vision in the short term it's a much longer term uh research Vision uh but I believe it can actually yield a lot of benefits going into the future I think we The Works that we've discussed today are kind of scratching the surface right and the good news is these are already in real products so asymmetry in both frequency scaling core size uh uh speculation control uh and also a symmetric course finally are in products who knows maybe in future we'll have a lot more let's say uh fluid hardware and fluid software okay any questions otherwise this is really the right time to take a break let's take a break until 14:45 17 minutes and then we'll come back with lecture 20b e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e for okay shall we get started is it all good online okay okay uh we've been talking about multiprocessors but I have my lecture is also titled multiprocessors so we're going to uh take a step back and uh give a different broader treatment of multi prosers now but hopefully so far you've seen an interesting collection of ideas and also problems in multiprocessing these are some readings I put Mall's paper as required you'll read it hopefully how many people have already read Mall's seminal paper no one no one took ddca here I guess not okay if you taken ddca that was a it was an extra credit you would get 1% of the course grade for completing it a review of it right yeah yeah it's an important paper to read and this is the full title as you can see validity of the single processor approach for achieving large scale Computing capabilities and we've discussed that this is another paper that we will discuss probably not today given how fast we're going uh probably next week that's about quential consistency and there's another paper on Cache coherence the Mesi cache coherence protocols that is also here these are some fundamental papers there are a bunch more that we will discuss as we go along uh but I want to give a broader overview of multiprocessing and this starts with a tonomy of computing in general basically how do you design a computer is a is a is a question to ask and this paper from Mike Flynn in 1966 classifies uh different computers based on how they operate and there are two dimensions in which he looks at the problem one is instructions and the other is data right uh and uh you can have a single instruction or multiple instructions and you can have single instruction multiple data sorry mult single data multiple data uh in and then you have a 2 by two Matrix in this 2 by two Matrix uh one side uh one part is single instruction single data process processors in this case a single instruction operates on single data element right a scaler another way of putting that is is you have a single instruction operating on a scaler uh datea elements it's a scalar processor it's good to think about that way uh uh and uh clearly it's not necessarily a single data element right you may have multiple Ops multiple operants uh doesn't mean that uh doesn't violate this uh each of the operants that the instruction is operating on is a sing piece of data it's not specifying each oper doesn't specify a million pieces of data for example it's just a single scalar piece of data uh so this is uh essentially a single uh sequential thread as we know it today simd is you still have a single instruction but it operates on multiple data elements now in this case uh a single instruction specifies a lot of work and there's specific way of that work helps basically you do the same operation on many data elements right this is essentially array processing or vector processing as we have seen uh in uh ddca uh and but you also have simd instructions in existing isas like xc6 Etc but this is a way of Designing the computer you can combine these ways also and we're going to briefly discuss that uh So This is highly parallel clearly because uh what what a main benefit of single instruction multiple data processors over single instruction single data is clearly you unleash a lot of parallelism with a single instruction right it could be a million element vector and we have discussed again if if you haven't taken ddca I'd recommend watching the lecture on simd processors uh existing gpus are simd processors today uh we may have a lecture later on but maybe uh not let's see depending on how things go uh essentially the benefit is you unleash a lot of work but you also a different way of looking at unleashing a lot of work with a single instruction is you reduce the overhead of control now what does this mean if you have single instruction single data uh every instruction specifies little amount of work so the amount of control uh that each instruction goes through is a big part of the overhead right you doing really the computation is executing the computation on lots of data but you need to fetch the instru you need to decode the instruction you need to rename the instruction that instruction needs to a lot of processing needs to be done on that instruction so that you can do some work on two pieces of data right that's a lot of overhead if you look at the energy spent on an sisd processor most of it is really spent on the overhead not the computation itself this is another way of thinking about I'm not even talking about memory over here forget about memory this is just a computation that you try to do and what is required to enable that computation you fetch the instruction de the instruction with simd and this is one of the major benefits of gpus assuming or vector processors in general uh assuming you have the parallelism in your program you decode Fetch and decode the instruction only once and you unleash a million operations right so the efficiency of this is much better uh essentially the overhead of an instruction is really uh amortised over many data elements okay and then there's a couple of other things on the other dimension multiple instruction data now we have multiple instructions multiple instruction streams I'd like to think of it that way operate on a single date element this is a bit of an awkward one uh essentially the closest form is actually something that we have examined in the DCA also systolic array processor which is a way of building machine learning accelerators today essentially you input data uh to an array of processors array of functional units you can also think of it that way and the data gets processed in one functional unit and transformed into Data Prime and that gets input into the other functional unit the other functional unit does something on it and then transform into Data prime prime and then the other function unit gets it and transforms it and outputs it so basically the single data I don't know what's going on there's some connectivity issue yeah a single data element uh is input ones to multiple instruction streams if you think about it multiple instructions in a systolic fashion but it gets fashion it gets transformed in every operation right I think this the closest form that I could think of to MISD processors and then there's mimd which is really the subject of what we're going to talk about today which is you have multiple instruction streams operating on multiple data elements essentially you can think of this as instruction streams that are separate from each other and they're executing concurrently uh it could be a multi-threaded processor it could be a multiprocessor and clearly today's systems are a combination of almost all of these uh you have mimd processors in a multiprocessor and each core in that multiprocessor is a CSD type of processor which has simd extensions and if you have an accelerator that is a machine learning accelerator that may be MD right the GPU is clearly a simd type of processor uh but you can uh it also has some accelerators internally that may be that look misy okay so basically we use all of these paradigms today which is good so if you want to brush up on uh these different paradigms we cover essentially all of them uh in uh ddca SSD simd MD now we're going to talk about mimd type of Paralis but Paralis is more general actually uh clearly all of these processors processing types exploit parallelism in some way to improve things parallelism in its very basic definition it's really doing things multiple things at a time right everything can be parallel it's not a it's not a function of just computers clearly uh things from the perspective of computing could be instructions operations or tasks right and the question is why do we do it why do we do it anybody why do we do things in parallel faster exactly basically the main or original goal is really improving performance uh execution time or test throughput depending on how you define performance and we're going to talk about mdas law we talked about it already but that's not the only reason there are other goals anybody if you don't want faster for example you could want something else and paralis still can help you Energy Efficiency why [Music] m m yes not maybe it is more efficient right so basically energy or power you want to reduce power consumption right uh and if you have let's say n units at frequency f for example it consumes more power than fourn units at frequency F over4 you can think about that am I right I'm not right or do you have another reason to do paral huh you can shut it down exactly that's one one way of thinking about it uh but I mean this may be slower right maybe there's no time to shut it down so the main reason over here is uh it's this is assuming you can perfectly paralyze your program by the way first of all this there's an assumption over here you can perfectly paralyze your program or you have completely independent things uh that way you can uh the Baseline is you run n units at three quency F or let's make N1 one unit at frequency F and you have four units at frequency F over four assume that they take the same time because you reduce the frequency by four this is still more power efficient why or maybe you can say this is not power efficient then here's why why is four n units at f44 more power efficient than n units at yeah why you could keep asking why forever right no not physics well of course it's physics right yes yes okay that's that's right okay maybe I'll write the power equation but I don't know how to deal with these things something is moving so how do we I should have used this one maybe maybe we can there should be a way of blinding this right yes okay too much overhead of course switching is as you can see switching to different medium always takes overhead yeah basically as you have said the basic reason is at least Dynamic power is equal to c v² f well I guess I wrote large F so if you reduce frequency clearly this reduces right uh now if you reduce frequency you can also reduce this because you don't need to really power your circuits as much with high voltage and now this reduces now we have a quadratic effect right or or or cubic effect once you reduce your frequency the power reduces actually cubically make sense now you could also argue I didn't say that I just looked at frequency but assuming that you design a simpler core also potentially you could also reduce capacitance who knows because there could be much less loading a much a heavy outof order core is actually much harder to design than there a lot of loading that happens on the buses for example a simpler in order core there's a lot less loading so basically you would definitely gain on this one gain more on this one and probably gain on this one and your Dynamic power is much lower right and this is dynamic there's also static power even if you ignore static power you gained already quad cubically right so if you if you actually have four n units at F over4 and N units at F you can do the calculation right this is clearly going to be lower power because again you get cubic reduction over here okay I don't want I don't want to deal with that basically for cube let's say uh even though you increase the number of units by four uh you'll uh hopefully reduce the power by 64 right or 16 64 okay but static power is also another thing to think about and static power is usually related to temperature uh and area again I don't have a good equation for this but it's actually correlated with area I will say because it's really leakage if your area is larger your leakage is also much larger and if you actually design a smaller core which was not what I said over there it was frequency but assume that you also design small core now you have lower leakage also overall actually it's related to time also how long how long you execute but the time was hopefully the same yes yeah yeah that's right that's why yeah yeah that's why I said assuming you can perfectly parallelize things I said you don't move data right right if you can perfectly parallelize a program yeah but once you actually become imperfect meaning you have more data movement you have more synchronization you have some loow imbalance you have yeah you have resource sharing then of course you your your power savings will not be as much right in fact sometimes you may increase the power rate because your performance will not increase make sense yeah that's why the assumption is really important perfect parallelization okay how do we go back should be a simple button though oh you need to share okay oh I see okay okay oh no my second thing is already there okay I'll give it to you so we we've covered the power consumption energy part the second reason is improving cost efficiency scalability and reducing complexity I kind of put all of them together right so if you have parallelism again assume perfect parallelization uh it's harder to design a single unit that performs as well as end simpler units right if I have perfect paration I'd rather go with n simpler units hopefully that's kind of obvious I have I design a simpler thing I basically replicate it million times that sounds good so that was another goal what about the third goal or if you have a fourth or fifth one yes okay okay so you're but you're saying performance Energy Efficiency yeah MH yeah MH yes yeah I I agree but that's uh that's not that's not adding one more goal right you're you're actually covering all of them right now they're simpler which is through Power better power and also a better performance yes I agree but I'm looking for some other to do parallel computation no other reason there must be something because there's some space over there it's not a trick question in that sense we're going to fill that space why would you ever want to execute some thing on two different things yes exactly I gave you too much hints maybe basically improve robustness yes dependability basically if you have multiple of the thing or multiple things uh multiple processing units you can actually do the same thing on both of those processing units and then compare the results right this redundant execution of space I'm assuming no one else has another reason that way I don't need to modify my slide because there's not enough space here as you can see yeah I'm joking if if you find some fundamental reason we'll find a way of modifying the slide okay that this is kind of like like page limits in articles right they're designed to be like fixed and they don't consider what's really important we we will add stuff to the slide and slide will become bigger and longer the article become become more it's it's a fundamental to add something important but the page limit goes against that fundamental thing right okay yes yeah uh not necessarily linearly but yes uh certainly wait well I don't want to talk about dard scaling but that's not a bad question I don't know if anybody can see this anyway I don't want to switch too much okay but basically you can see it right c v squ f essentially Den art scaling which is uh a technology scaling let's say law uh that used to hold in the past uh when you reduce the size of a circuit it basically said that you could reduce the voltage associated with it that's basically it and it doesn't say anything about linearly Etc uh in the original paper there are some calculations but it basically you can reduce the voltage but at some point it became very difficult to reduce the voltage so it's not happening anymore so it's really related to the size of the circuit and how you can scale the voltage along with it clearly there is a relationship between voltage and frequency but I would say it's more complicated than just linear does that make sense okay so it's an online question so maybe they can write something okay uh okay uh so there are different types of parallelism and different ways of exploiting them uh We've covered uh some already so instruction level parallelism uh essentially different instructions within a stream can be executed in parallel and we've covered a lot of these in digital design and computer architecture pipelining autof order execution speculative execution VW data flow they're all exploiting instruction level parm data parallelism uh in this case different pieces of data can be operated on in parallel and simd is an example of it systolic are and streaming processor are also ex examples of it and then there's test level parallelism in which different tasks or threads can be ex executed in parallel uh this is actually really the multi- threading and multiprocessing and we're going to look at that uh in detail so what is test LEL Paralis so if you want test LEL parallelism you want tasks clearly and somebody needs to create those tasks uh the more difficult problem is the first one the easier problem is this one you already have the tasks because there are different programs proc different jobs for example and you run them together on different processes but let's take a look at the first one first you want to partition a single problem into multiple related tasks or threads uh you could do this explicitly using parallel programming this is easy when tests are natural in the problem you can actually easily divide things into threads this difficult when natural task boundaries are unclear actually uh Loops for example could be parallelizable but uh it's good to think about that or you could do this transparently or implicitly this is also called thread level speculation let's say you can partition a single thread speculatively or non-sp speculatively also I should probably add that so uh a compiler can try to do this for example completely a programmer can also try to do this uh well sorry not if a programmer does it then it's not really done over here but a hardware I meant to say Hardware Hardware can actually do this transparently uh so in a sense uh if you could for example ship a part of the program to a large core transparently you're really paralyzing things right or a small core maybe okay or you can run uh so this is the hard part we're going to talk more about this one the easier part is when especially when you have independent tasks that are already there for example you have different users task from different users uh test from different simulations for example you do the same simulation for a million different workloads right you're designing a processor and you want to simulate the different effects uh so cloud computing workloads for example cloud computing systems take different workloads from different people these have nothing to do with each other clear they are different tasks and they're they could be executed on the same uh computer they could be executed in parallel across different computers now this doesn't improve the performance of a single task of course right which is the harder problem so we're going to focus more on the improving the performance of a single task in this lecture okay let's talk about some fundamentals there are two types of multiprocessors uh they both exist today loely coupled versus tightly coupled uh loely coupled the main distinction between them is whether the multiprocessor uh the different processors share a global memory address space or not so in Lely coupled multiprocessors uh there is no shared Global memory address space in tightly coupled there's a shared Global memory address space meaning when you run one program it can reference an address that can also be referenced by some other program okay that's interesting now uh this is also called a multicomputer network in a sense network based multiprocessor so if you're on all on the same network uh you can they could also be thought of uh losely coupled multiprocessors and you can partition a program across my laptop and your laptop and your laptop and your laptop for example right these are actually loely coupled multiprocessors right now maybe they're not explicitly designed that way but they are and I could actually parti my program across uh your laptops assuming I have the permissions to do so right yeah okay so this is more traditional multiprocessing so we're going to focus more on the tightly coupled part it's also called symmetric multiprocessing I don't like that name that much but yeah because what does symmetric mean assume something right essentially existing multi-core processors or multi-thread processers are this way uh Lely coupled M multi processors you are usually programmed via message passing you send messages you send explicit calls send and receive for communication this doesn't have to be the case but you could uh because there could be uh programming is actually different from whether or not you have a global memory address space but let me talk about this first and then I'll clarify that essentially uh you send explicit messages to uh data to communicate data uh between the processors right because there's no other way in a sense you you cannot say load from location X you could say that there has to be some level software that translates that to messages so that's the interesting thing about programming right programming can be anything and Hardware can be anything it all depends on what's in between you could program a Lely coupled multiprocessor assuming a global shared memory address space assuming you have a virtualization layer a software layer that translates what you have done to something that can actually execute on that Hardware right which is messages it's important to I think understand that that's why I say usually of course if you program this with a global memory address assuming a global memory address based programming model uh there going to be some there's going to be some overhead to translate all of that to the underlying communication model which is no Global memory address space right okay so a tightly coupled multiprocessors a programming model is similar to unit processors again in general uh meaning you assume loads and stores and different tasks communicate using loads and stores with each other uh so you could think of it as a multitasking unit processor except operations on shared data require synchronization right again this is a there's a difference between programming model and Hardware execution model you could have a tightly coupled multiprocessor a multi-core processor and you could program it using message pass right this is a bit easier to achieve I think actually than the other way around uh essentially if you have message passing based programs you just need to translate them to loads and stores I believe it's easier okay so I think in this uh slide I covered different types of M process and also the distinction between programming model versus the hardware execution model right they basically the the way these multiprocessors actually differ is really the hardware execution model which is whether or not you have shared Global memory address space to communicate or not and usually they differ in programming model also but not always okay so there are many design issues we're going to focus a lot on tightly coupled multiprocessors uh there are many design issues over here shared memory synchronization how to handle synchronization what kind of synchronization perimeters you have uh we've already talked about some of these uh if this was a more uh software oriented course you could actually talk a lot about like parall programming course you could talk a lot about like how to optimize a locks uh how to do better Atomic operations how to do better barriers how to do better parall programming but we don't have time in this course for that clearly that's the subject of a full parallel programming course uh which usually assumes here memory synchronization also actually uh which usually assumes tightly coupled mp uh but we're not going to talk about that we're going to talk about some of the implications of it cache coherence becomes important you need to ensure correct operation the presence of private caches uh keeping the same memory address cached we're going to talk a lot about that next week uh memory consistency ordering of all memory operations is important what should the programmer expect that the hardware provide in terms of ordering we're going to talk about that probably not today given that we're going slowly we talked about shared Resource Management that is actually an issue in I think all types of processors but uh typ coupled multi processers also suffer from that and then we're going to talk about interconnects which is also an issue across uh the board but shared memory synchronization uh cache coherence and memory consistency are especially issues that are imposed by a shared memory model especially Hardware model now if you do a programming model shared memory programming model I think you can run into some of these issues also even if your Hardware doesn't support uh shared memory which is it's not tightly coupled but we're not going to look at that situation okay so uh their main programming issues there also programming issues and we've covered some of these actually in tightly coupled multiprocess how to partition a single task into multiple tasks so that such that the load is balanced I've given you an example from yesterday yesterday right partitioning a book and Counting the uh characters in a book even that task is not so easy in terms of load and balance partitioning is easy load and balance getting the load and balance right is not easy right uh synchronization how to synchronize efficiently between tasks how to communicate between tasks uh there again we're going to cover the basics to enable better synchronization but not talk about synchronization to how to do better synchronization software better you talked about actually uh mechanisms enable this uh like in a fast way without making the programmer go crazy yesterday right bottleneck acceleration and today also but there's more if you are a programmer and if you want to optimize your program you can actually do more contention avoidance and management we'll talk about that but there's also a shared resource contention we already talked how do you maximize parallelism and how do you ensure correct operation while optimizing for performance that's actually interesting and kind of what we covered yesterday how we motivated asymmetric multicore was this right we don't want programmer to uh to uh reduce the size of the critical section at the expense of correctness and if you already have a not so bad let's say critical section we're going to accelerate right okay so I will make an a side over here also uh which is Hardware based multi-threading essentially uh multi-threading is also another uh approach it's not uh basically you can execute multiple threads concurrently right uh but uh these multiple threads don't need to execute on separate processors they could execute on the same processor in the same processor there are different approaches to Hardware based multi-threading one could be course grained uh fine grain and simultaneous and again all of these are implemented in today's processors uh except prors that Implement fine grain multi-threading don't Implement F multaneous and vice versa in general uh but course grain is always there course grain means basically you have a single processor and you Multiplex time Multiplex multiple threads on it multiple tasks on it it could be based on quantums time quantums and that's what happens in today's systems right you switch to another thread after some milliseconds and the operating system has a schedule for that or you switch to another thread bed on uh some event happening this thread basically yields for example you can actually explicitly yield the processor and then that notifies the operating system that this the processor has become schedulable and the operating system schedules another thread on it or it could be done at a finer Grand l t event based uh which is also called switch on event multi-threading for example when you you can switch on an alris right you can have uh uh an Alis uh happen in one threat and another threat uh can uh can be scheduled now there's an interesting distinction here at some point clearly course gra has different course granularities as you have seen right one is a large time quantums uh the other is events uh large time quantums and course grain events don't require multiple Hardware contexts to be in the same processor right but if you start switching on an lus altrus for example now uh it takes a lot of time to change the context because that requires an operating system call and the operating system needs to remove the existing thread write its context write its uh context to somewhere memory and then bring in uh some other threat read its context from memory and put it into the hardware structures executing right including the register files program counter address translation basic structures Etc so if your cor scin ground lity is actually lower and lower then you want to keep the thread context in the hardware also so there's some course grain multi-threading that doesn't require thread context in hardware and there's some cor scen multi threading that starts requiring thread context in Hardware you need to have basically multiple sets of registers multiple essentially multiple sets of registers right uh maybe even multiple sets of tlbs right usually that's not done but possible uh to do uh okay so uh that's course grin once you start going fine grain as we have discussed in ddca fine grain means we have also discussed in the last lecture right San nagar approach was like this uh essentially you do cycle by cycle every cycle you switch to a different Trend clearly you cannot do this onof software at all you can do switch on Al al3 Miss software except it'll be costly and you also need some mechanisms uh to uh signal to the software CLE you need to change the hardware but here you have to change the hardware and add thread context to it uh simultaneous is I guess somewhat similar to fine grain you have Hardware thread context uh but you don't switch to another thread every cycle uh you just keep executing from different threads concurrently you may fetch from the threads every other cycle but instructions from different threads are everywhere in the pipeline whereas fine grain means a given instruction only one instruction can be uh from a given thread in the entire pipeline right so different threads occupy different parts of the pipeline whereas here simultaneous everything can be simultaneous in the PIP so the difference here is fine grain multi is a lot easier to implement in the sense that uh within a thread you don't need any dependency checking whereas simultaneous within a thread you need to have dependency checking because you cannot guarantee that multiple instructions from the same thread is not present are not present in the pipeline at the same time right okay so clearly today's processors have simultaneous multi threading Intel calls it hyperthreading which which means nothing I think I guess it's hyper compared to course grain multithreading yes but fine grain is also hyper then right fine grain is uh actually fine grain doesn't have a good name also in my opinion but that's the way things are named fine grain is extremely cycled uh fine Gade this I I would call it cycle by cycle maybe okay so this is important and clearly all of these issues that we're going to discuss arise in multi-threading as well but uh we used to cover more multi-threading uh in past well we still cover fine grain multi-threading but then I had more more multi-threading lectures in an earlier class that I'm not going to cover these are actually very nice interesting topics over here that we don't have time to cover if you're interested you can look any questions okay otherwise we're going to cover something that is extremely important I think uh when we discuss parallel processing and that's going to be the limits of parall speed up we're going to have some fun I'm going to ask you to think a little bit so this is my uh polinomial over here a uh variable AI are actually uh constants X is our variables X is an input essentially and this is X to the power of four x to the power of three x to the power of two x to the power of one and that's x to the power of zero to be let's say complete but assume that you are given inputs X and each AI assume each operation takes one cycle there is no communication cost and each op can be executed in a different processor this is the assumptions that we have had earlier right perfect parallel processing uh there's no overhead in instruction fetch forget about that there's no instructions we just schedule operations into different processors functional units then the question there are two questions over here how fast is this uh polinomial evaluation with a single processor and then uh you assume no pipelining or concurrent execution of instructions just one instruction in the processor right one cycle in each processor and then uh and then there's no distinction also between instructions addition is the same cycle the same one cycle multiplication is also one cycle okay and then the other question is how fast is this with three processors do I go to the next slide maybe not okay if you had a photographic memory and if you took a picture of the next slide then maybe you'll get it right okay I'll give you a couple of minutes you if you want you can rewrite the expression well assuming basically I want to compute this polinomial as long as you get the result right uh I'm not going to say I don't care how you do it because you may get lucky in the getting the result right but I want you to still compute this polinomial yes you have a question or no okay it got answered that's not given yeah yeah only X is given only AI are given yes I'll let you work through it this is fun let's see if we can achieve consensus let's see let me see first and then oh seeit yeah maybe we can switch to you want to let's see how this works well we can overlap the latency of switching don't go to the next slide start start need stop cool so this is it okay I'm not going to show anything yet did anybody completed yet not yet we can wait a couple more minutes no using chat GPT I should have said that earlier I wonder if it'll get it right actually I'm curious if it's studied my old lectures maybe it'll get right okay go ahead what what say it again two what I I couldn't understand that part two two in two like two by two mhm yeah yeah I mean that will be tightly coupled in that case assuming those two risk five cores are part the multiprocessor that actually have the same global address spacer but it may it doesn't have to be tightly coupled it could be aisk 5 Core on mohammad's laptop and another risk five core on my laptop now if you partition the program across those they don't share Global memory address space right so it really depends on where those risk five cores are so it's not a function of risk five clearly it's really a function of whether uh the parallel processors share a global memory address space okay anybody any answers there's one person who has an answer there's another person who has an answer three people are we getting answers from more four are there more don't tell the answers yet five okay do we wait one or two more minutes or do we start collecting answers I'll wait one more minute okay let's start collecting answers anybody wants to volunteer so the first question was actually how long does it take to execute uh this polinomial which you don't see on uh a single core single processor not core 12 Cycles eight Cycles what none what 11 11 okay 12 8 11 no matches so far nine 14 14 did you just make it up no 14 I don't know how you can get 14 on this one but fine anybody else also 11 10 11 anybody else clearly you won't be outlier with these because there's no outlier nobody else wants to volunteer there's no embarrassment don't worry there's only one correct answer actually so maybe none of this is correct but I think uh who said 11 four people are you sure okay I didn't count so it doesn't matter anyway this is how you would get 11 I think well you could get in a different way perhaps how do we do this actually I don't think I need this I was going to construct all of this but we don't have time so maybe we go back to slides so 11 was the majority right the majority is always correct that's how democracy Works fortunately in science we don't have to be Democratic all right thank you sorry I was I thought I was going to write but maybe I don't need it so 11 is the correct answer according to the majority and this is what you get this is how you get 11 I assume right you try to do this basically you take uh essentially there is a uh you multiply X twice you get X squ and then you multiply it one more time you get X Cub you multiply one more time you get X4 and then you input a and then you get uh input x to the power of four and this is A4 x to the power of four over here and then you compute A1 X1 A2 A1 X A1 a2x s a3x Cub over here and then you start adding a2x Cube over here with A4 X4 and then you add A2 X2 to it and then a1x one to it and then a zero to it right you have 11 operations another way of doing it basically one operation here uh so there's no operation anyway you basically count the number of operations 11 Cycles sounds good now what about the three processor version there was another question how many cycles does it take with we're going to get to it I think the majority is always right so you have you have no voice to speak in a democracy for one interpretation of democracy I should say also let's not get into politics but that's one interpretation of democracy which is quite flawed in my opinion but anyway uh okay three processers I want to collect those answers nobody did it you just did it for one yes six six okay five okay no this is not an auction you [Laughter] know well I don't know yeah that's right the lowest lowest wins in this case you're right anybody else 665 I have so more people uh did the single thread version which also shows something it's easier to do a single threed version or maybe you forgot I don't know maybe you forgot that there were two questions should I give you more time or okay I'll give one minute so far six wins I want some people to tip the balance right I'll give you 1 minute I want an answer from the person who got 14 for the first one also okay it's no it's fine we still haven't ended things yet but yes I think you pointed to another flow of democracy you cannot change your vote after you [Laughter] voted doesn't matter anyway who you want for okay can I collect more answers right now yes five another five 66 five five or maybe you're be between six and five five and a half that doesn't work with the rules though anybody else nobody else computed during this time that shows you the difficulty of parallel programming okay if you're not uh going once yes you have maybe six I like this approximate answer maybe six okay so the the majority says six then is that true yeah okay so the majority one in the parallel uh single thread case that was 11 hopefully no one has an issue with this 11 I have no idea why someone would get 14 but we can figure that out later how did you get 14 oh I see yeah but if you look at this no communication cost each op can be executed in a different processor assum each Opera from one ccle it doesn't say anything about fan out yes you assume something about fan out well you you were wrong as a result okay uh I get five with three process and this how I would do it x square here a4x Square here uh and this is A2 plus A4 x square here and x square goes also here and then this a3x and this a3x Square here and this a1x and then a a0 a1x plus a0 plus A3 X3 and you already have whatever I just said over here a4x S Plus uh yeah okay anyway yeah exactly exactly yes a2x squ plus a A4 A4 and then plus did you get it this way for those who got it five yeah okay good so let's compute speed up so T3 three processor time is five Cycles single processor time is 11 Cycles so the speed up you get with three procor 11 divide by 5 so 2.2 not a linear speed up with three processors but not terrible I guess optimistic what do you mean yeah yeah sure is a fair that doesn't answer my question my question is this a fair comparison this is when the minority wins it's absolutely not fair because the single processor algorithm even though the majority suggested that 11 is the right answer it's not the right answer so an informed minority maybe we'll get to that at least me but you probably also would say you revisit the single prer algorithm with what you learned in high school how many people learned Horner's method in high school you learned there you go you learned also how many people did I'm curious actually seriously how many people learned and remember of course Warner's algorithm not remember to the point of uh using it but to the point of like you actually learned something like Warner okay how many people did not learn they're pretty sure they didn't learn this it's also good to know you're pretty sure okay you didn't learn okay and you're pretty sure okay okay it's good to know I mean yeah it may depend on the high school it may depend on the time Etc but basically this guy Herer wrote a paper in 1819 and talked about how to actually solve this sort of polinomial equations nicely or numerical equations generally sorry I won't get to it but basically what you can do is this I will not go to horon algorithm but essentially you first do this a4x plus A3 and then take X Out essentially you progressively take X out right and by doing so so you reduce some of the computations you do right you're really in a sense you're doing some sub expression elimination right you can think of it that way and if you do this you would get eight you got eight right okay and nobody else got eight and you did this essentially okay and this is a much nicer version of the algorithm as you can see on uh a single processor so now your speed up is not 11 divid by 5 it's 8 divid by 5 and you get 1.6 speed up so maybe the multiple processor is not looking that good and we also did made a lot of idealistic assumptions about this processor and this is actually a really important thing to think about so let's talk about this so there are implications of this clearly there are a lot of interesting things that happened over here one of the interesting things was clearly fewer people answered the question for the three processor case which kind of indicates maybe uh I mean it was not a fully controlled experiment so I cannot say but it it indicates uh the difficulty of the problem and the second uh is uh most people didn't get it right which also indicates that there is some difficulty here right and this is true even on a lot of algorithms and think about more complex programs actually right it becomes much much worse than this so uh basically the key Point actually before I go into the Super linear speed up is key point is this comparison is not fair well 11 divide by five was not fair it inflated the benefits from technique parallel technique and uh the reason it was unfair was it was not using the best algorithm for a given uh uh processor a single processor so you even using actually the best algorithm for a multiprocessor this was the best way of doing it I don't think you can do it with four good luck but we were not doing the best thing for the single processor and this an important fallacy to avoid yes it was pretty fair for the multiprocessor we no communication one cycle fine yes that's right but that's a different question the question is whether this is fair or not I will ask the exact same question to your what whatever your favorite embarrassing the parallel program you bring the same question applies are you doing the fair comparison right okay so essentially the comparison is not fair and if you don't do Fair comparisons then you may actually build something that is not necessary or not good and someone can actually do much better than you if they optimize your software so there is a lot of interest interesting things to think about here okay so one way of actually thinking about this is essentially we've seen this graph before right a speed up graph you have number of proces on the x-axis parall speed up compared to single threaded version on the y-axis and typically the curve you get is like this it goes down actually this is not the best curve unfortunately typically you get sublinear speed up for reasons we've discussed last time right uh sometimes people get super linear speed up and usually that's due to two main main things whenever you see super linear speed up you should be extremely wary whenever you see linear speed up you should be extremely wary also close to linear speed up also Etc but it's possible at least I think Super linear speed up is not possible unless you're doing something wrong like unfair comparisons like compare the best paral algorithm to wimpy serial algorithm it's unfair and we've seen companies do this also in the past I will not name them here to sell their products it's amazing the lengths to which companies will go to sell their products this is probably the maybe not as bad okay anyway but then there is actually a real case where you can get super linear speed up anybody think of that yes okay can you be more clear yeah okay okay yes yes essentially cach and memory effects basically this comparison assumes that you're not adding anything else other than processors right but when you're adding a processor if you're also adding more cach or more memory then you may get fewer misses in cach and memory so you don't go to memory as much anymore because you paralyze your problem not just from a computation perspective but also a cach perspective now instead of having one megabyte cache on chip you have 10 megabytes cash one chip and you never go to memory right with a single processor you have only one megabytes you could also say this an unfair comparison maybe you should be comparing to a one a single proc with a 10 megabyte of cache but there may be good reasons that this may not be possible if you're doing a real system for example comparison so this is real with cach memory effects you can actually get super linear speed UPS hopefully you're not getting them because of unfair comparisons though okay so hopefully this was a useful exercise even though it took some time and whatever we experience here is real in real life is actually much more difficult okay so I will also mention some metrics over here there are some traditional metrics that we use to evaluate uh processors with multi processors with uh and these metrics assume all P processors are tied up for parallel computation for some amount of time of time okay uh utilization is how much processing capability is is used uh it's really the number of operations in the parallel version divided by how many processors you're tying up for how long redundancy is how much extra work is done with parallel processing and there's almost always extra work in parallel processing because you need to divide the tasks you need to maybe replicate the inputs Etc uh this is number of operations in the parallel version divided by number of operations in best single process algorithm version version and then efficiency is basically time with one process divide by process divide by uh how many processors you're tying up for uh how long efficiency is also utilization divide by redundance I will not go through this in detail but it's good to think about this like pictorially especially so utilization basically in this particular example that we've seen uh we were tying up let's say uh actually uh yeah we had 10 operations in the parallel version and we took five time units we were tying up three processors that was a given for five time unit assuming you tie up all of them because you don't know how long each of them will take right this also shows a load IM balance across processors over here uh X means you're utilizing the processor at a given point in time over here so basically our utilization is really two 2/3 as you can see 10 over 15 so these are kind of not used you could allocate them to a different process different thread but you need to know that they're done that's good to know what like how do you know they're done right so this is the load imbalance problem as you as we have seen earlier right so this simple example gives us a lot of problems so redundancy is operations with uh the P processors best number 10 operations with one processor as we've seen eight and redundancy is always greater than or equal to one whenever you do parallel processing you increase the amount of work that you do in this case it's not bad but it's still more than uh 20% right 25% actually efficiency is basically how much resource you use compared to how much resource you can get away with essentially we're tying up One processor for uh I guess T1 time units my terrible writing over here I had to correct and then you're tying up P processers for TP time units you can really get away with eight tying up eight uh one procer for eight time units but we're really tying up three pressers for five time units so our efficiency is actually close to health let's say not so great if you once you start looking at this metrics these actually uh show that you're actually wasting a lot of resources in this particular problem right with all of these assumptions okay uh We've covered some of these so I'm going to go through some of these very quickly we've seen AMD law right and we've seen MD law is really about the serial ball neck but just to show you some pictures again with my handwriting there's a parallelizable fraction and there's a nonp parallelizable part over here you can derive m m law based on this time with t p processors time with P processor is it really uh paralyzable part times time with one processor divided by P assuming it's perfect paralyzation Plus nonp paralyzable Part times uh PR time with one PR and does that equation that I showed you comes uh based on speed up this is speed up with P pressers you do T1 divide by TP TP is this T1 is T1 and then once you actually simplify the equation this is what you get which is essentially same as what I showed over here except F and Alpha the same and if you take the limit of this as p goes to Infinity you get one divide by one over Alpha and that's the ball neck for parallel speed up now if you start drawing things nicely uh this is one this is p this is the speed up and these are the speed up curves that I drew clearly with my hand you could draw this L with some other program but you can see the key point over here is adding more and more processors gives less and less benefit if uh Alpha is less than one paralyzable part is less than one so you need to be really close to 100% parallelizability if you want to see more and more benefit and also this is another way of looking at it this is basically parallelizable part and this is speed up uh the benefit is small until you get closer to one and this is another I'm going to actually I should have another better way of okay this is a better one over here this is the parallel fraction F and this is real numbers using Excel and with excel's terrible Baseline graph you can see that the speed UPS become larger when you get really close to uh one right and the differences between uh different number of cores also become more interesting as your very close to one so you want really embarrassingly parallel applications as you said okay so and and we have also discussed that there's par parallel portion is also not perfectly parallel and these results don't take into account synchronization loading balance resource sharing it's good to think about this this is exactly why amdal said we should be focusing on the single processor this 1967 and there's still good reason to focus on getting that single procer EX fast and efficient and scheduling code that requires that fast speed and efficiency on that single processor okay we've already seen a lot of this so I'll go through this relatively quickly again we've seen the sequential ball neck in the last lecture right and main reason for this is nonp paraliz BL operations on data for example if you have non-p paralyzable loops where you have a loop carry dependency you don't get this beautiful parallelization you need to have some serialization there are other causes single threat prepares data and spawns parallel task we've discussed this yesterday also and this is usually very sequential so this is an example from a paper you're reading so I will not go through this again in detail But Here For example there's a critical section there's another critical section uh there is a parallel part and this part is serial and this part is also serial and this is what the execution time kind of looks like serial portions look like this and then parallel Parts look like this and then critical sections are the limiters this is again from the paper that we've discussed yesterday we've also discussed B in par portion again I will not go through this in detail this just an overview uh essentially synchronization operations manipulating Shar data cannot be parallelized so you need locks Mutual exclusion barrier synchronization I also said yesterday that this also a communication essentially whenever task may need values from each other you need to synchronize them you can think of this as communication or synchronization basically and we've already discussed that it's called thread calization when share data is contented we also talked about load imbalance this is used imp paration or microarchitectural effects parall test may have different lengths and we've already talked about resource contention a lot in these lectures parall test can share Hardware resources delaying each other right okay I think uh we've seen a lot of this uh uh I'm going to skip a lot of this I'm going to talk about something else which is at the very end this is just for your review we actually covered this okay this is it basically you you you you've seen all of these difficulties right uh serial ball neck uh synchronization ball necks resource contention Etc essentially if you're embarrassing the parallel uh you have diffic little difficulty in parallel program uh if the parallelism is very natural for example you're updating pixels on uh a screen it's completely parallel right unless you're manipulating pixels together you basically completely parallel update things physical simulation has a lot of parallelism right if you're actually uh uh modeling things that part of the room has nothing to do with this part of the room for example there's a lot of parallelism as long as you can divide your work such that things that do not interact with each other at A fine grain are separated to different processors you have a lot of parallelism in that problem right uh but now if difficulty is really in uh problems where this doesn't exist this embarrassingly parallel things exist uh first of all you need to get parallel programs work correctly even if they are embarrassingly parallel uh and if they're not uh easily parallelizable you need to optimize performance in the presence of BX and as a result much of parallel computer architecture is really about designing machines that overcome the sequential and parallel B to achieve higher performance and efficiency that's one part and the other part is really making the programmer's job easier in writing correct and high performance parallel programs so what we're going to explore in the next lectures are going to actually we've already looked at the first part right we at least some of the techniques that uh are used to overcome the sequential and parallel BS we're going to see maybe more of that but now we're going to focus a little bit more on making the programmer's job easier in how how to write correct and high performance parall programs well the first one actually helps that clearly because the programmer doesn't need to put as much effort but there are some fundamental things that you need to provide in Hardware so that you can actually synchronize across different processes so there are some readings that we've covered uh that I'm not going to talk about let me see how much time we have we don't have time okay any questions so I'm going to leave these slides I'm not going to cover them in the next lecture but this basically looks at task assignment for process that we briefly discussed but you can look at them on your own they're not that hard it just covers some uh interesting uh trade-offs if there are no questions we can end early today and you can enjoy the winter okay I'll see you next week e

Transcript for:[Lecture 20] Understanding Multiprocessing and Bottleneck Acceleration

Transcript for:
[Lecture 20] Understanding Multiprocessing and Bottleneck Acceleration