[Lecture 22] Understanding Interconnects in Computer Architecture

for hey let's let's get it started good afternoon everyone welcome to lecture 22 of computer architecture today we're going to start this new topic interconnects it's not a new topic but it's a new topic in this course like for this semester let's say uh we have seen actually the importance of interconnects in many different lectures and you know that we discussed about this uh long discussion between uh you know is is the problem is the bot is the memory or also the interconnect and uh you know that we we believe that actually both um both memory and interconnect are actually the b in our system since actually memory also involves a lot of interconnects how you connect uh Banks uh in your memory module also all these uh involve these kind of questions so this is a just very brief overview of our last few pictures so we discussed about multiprocessing fundamentals yesterday we also seen U memory ordering which is consistency is another word and cach coherence and this is a recall for yesterday's lecture about two cach coherence methods we discuss a Snoopy boss that essentially caches are connected to a shared boss and they are snooping the BS like observing the boss and they are updating some metadata which are related to coherence protocol depending on things that are happening on the boss so the effectiveness of a Snoopy boss is really dependent on if we can have a boss essentially having a boss for a small multiprocessor should is possible like four cores eight eight cores I don't know 16 but at some point the effectiveness of boss is not that good because when you add more noes to the boss you need to reduce the frequency of that B and then your communication latency uh will become actually Higher and Higher and at some point you get to basically uh scalability issues but there are also some ways that you can actually make a virtual BS in a network so you can have a network and then create a ritual boss but that also needs a lot of um basically Bas uh that needs a lot of um effort in order to make it work and that can actually is because a lot of races and how how you basically handle these uh broadcast packets uh in your network is not really easy so that's why actually people also um look into directory based techniques that essentially you have a directory which which is actually adds U step uh to the critical pass now when you want to access your memory you need to First consult with the directory and then get response and then based on that you can do whatever uh is needed so you can of course distribute your directory across different noes and this is good so directory can be scaled um in your network but it also has some issues as well but both of these Snoopy boss and directory can also motivate why we need interconnects and why this topic is actually quite important so we will start about concept of interconnects um not only in the basically uh like interconnect in general which also called as interconnection Network that you don't consider on having an interconnect only on the chip it can be actually interconnect on a board or can be also in a uh super computers that you have several switches and these switches actually overall they are making a interconnect we have seen also Venice for in the SST lecture that for example devis a mesh Network there but that mesh network was not actually on the chip that mesh network was on the SSD board which is system on board design essentially so today uh we are mostly discussing about u interconnects in general and we're going to learn a lot about that but next week we're going to focus more on unchap networks uh what are the basically potential or let's say limitations that we have when we want to design interconnects on the chip and these are already uh things that I mentioned about the Snoopy cach and directory and that brings us to interconnection network lecture before I start this lecture I also like to announce uh quickly um because I got a lot of question about exam so the exam is happening on December 19 as you already aware uh I think this Thursday right Thursday before the Christmas so uh we will have the exam actually in the lecture time so from 1: p.m. to 4 p.m. and in our actual lecture room which is ml F39 we also Reserve ml f38 just in case uh if we don't have enough space in that but essentially it's just a normal lecture time you just come and do the exam uh and we're going to also schedule some problem solving session or discussion session let's say that we're going to solve some exam questions and answer to your questions uh before that so we will announce it also hopefully uh soon so I think I I answered questions mostly important questions about exam but feel free also to ask if you have so yeah so I think our T are some of them are working on them some of them are delaying um the so the final basically all grades like the labs homeworks and reviews will be out uh in early February essentially because some of the deadline for them actually is end of January right so we will spend some time in early February to grade those stuff but for homework questions uh we will do our best but at the same time we also release Solutions right so you can check also the solution and see um basically learn about the answer but we will also try uh to basically release the your um grades on homework questions as soon as possible but I cannot promise yeah false semester is always quite uh hard due to many deadlines that you will see also later when you start PhD or Master Okay so uh any other questions good so these are some very important readings for this lecture uh the first one uh these two actually are very important papers we're going to we will discuss this paper actually a case for bufferless roting I think next week uh Professor mlu will discuss it and this one is also about U paper well like packet U exploiting packet latency slack in an networks these are important readings that you can check um we also suggest some other readings as recommended we're going to see them in your homework five um essentially okay let's start with some Basics so the first question is that where is interconnect used so ENT the answer is s simple basically we use interconnect to connect and communicate between components and these components can be actually anything like we have many examples like when you want to communicate processors with Prof processors when you want to communicate processors and memories like Memories bank or cash Banks or for example caches and caches like in a nonuniform cash access Nuka you put a lot of cash banks in a network and you make a basically bigger cash with that or when you want to connect iio devices or for example if you want to connect CPUs gpus and cash Banks and some other accelerators so essentially whenever you want to connect some components that that component does something essentially you need interconnect and interconnect enabl communication as you can see and these are like two example of that that interconnect can be for example connect uh so in a both shared memory system and also message passing system so in shared memory system you have a storage modules or let's say memory modules that they are shared across your processors and then these processor are using the switching Network or what I mean interconnects essentially to communicate or to access your storage modules essentially but interconnect can be also considered as you have different nodes and each node is actually a complete processor like a processor and a storage like complete let a computer and then uh these computers can be connected through the network right and with this structure you're are using uh in your parallel programming you need to use this Paradigm of message passing if you know about MPI for example message passing interface this is actually um make I mean make more sense for such a structure uh for shared memory Paradigm we are usually using such stuff that you have a shared memory with cxl actually you can have even bigger shared memory that you already I guess you know about it so yeah so with that you can use this shared memory Paradigm in order to program but yeah so interconnect is essentially useful for both systems we need that so why is it important just to um conclude again like it affects the scalability and cost of the system essentially how you can effectively uh connect all these nodes right so this is important that affects scalability of your system and also cost also affects performance and Energy Efficiency how much what's the latency to access uh in the network uh if your network can sustain enough throughput or not uh for example you you might have a network that you can access with low latency but that Network can only work when there there are not many requests at the same time so that means that if you have a lot of requests then your network will be saturated very soon and then latencies will quite high so essentially the the design of network interconnects uh with would affect the performance of the system lot Energy Efficiency you can also assume right so how many uh like how many hops or say let's say how many steps you need to take in order to reach to your destination note and in every step how much power you consuming so all these would affect Energy Efficiency of the system and uh it also affect reliability and security um for example there are some networks that from one Noe to another destination no there is only one pass let's say so if something happened to that pass and you got some faulty nodes or faulty links let's say then basically this node is unreachable to that these these two nodes are not connected anymore and that's actually cause reliability issues essentially so people actually have worked also a lot in this line to to make to provide past diversity and also make sure that routing algorithm also benefit from that that past diversity in order to avoid uh these reliability issues but at the same time interconnect can also cause security issues like intercon is a shared fabric across different cores and whenever you are accessing a cach for example um your message to that cash bank can be actually observed by other cors as because you have a shared medium here and people have worked a lot actually to use that interconnect as a site channel uh such that they can get information um you know they can leak information so interconnect also involves a lot of parameters and goals um when you want to design it here in this table you can actually see uh there are many parameters like the number of processor um like the number of uh processor ports like how many ports for memory you are considering what should be the peak band beats average band beats message latency is your U interconnect has quality of service or not should we satisfi some service or not uh how what we do about reliability so for example this interconnect claims that no message loss so that was the goal for this interconnect to make sure that it's reliable and what's the availability so there are essentially a lot of parameters uh of processor memory interconnection networks that and goals that you need to consider them and basically design your interconnection network and I should say that actually interconnection network is a very dense topic that involves a lot of stuff a lot of detail stuff so if you want to learn about it in detail I would recommend this book uh principles and practices of interconnection network from Bild Deli and TOS so this work this book actually is the one I also actually read this book and is beautiful book I would recommend if you have time and you if you're interested said check this book and uh interconnection Network actually in some universities they allocate one big course for only interconnection networks so it's actually actually that's true for every topic that we discuss in this course so this course we wanted to share a lot of exciting topics to you but uh if you really want to go deep into any of these topics that we discussed you really need to allocate a big course to that interconnection network is also one of them so another example of why interconnect is also important is actually you can think of clock distribution Network so we have a problem about Distributing clock you know that I guess all we know about clock signal that we need use clock signal in order to synchronize across our modules so the problem we have is that clock signal arrives nonuniformly or nonuniform late uh to different parts of to we know that clock will rece will arrive to the point late because from the source of clock and destination there is a distance but the problem is that we have nonuniformly late some component receive clock earlier and some of them later so and that can cause potential timing issues and you know that we call it the clock skew in um digital design and computer architecture you may know it from those courses so the solution is that we like to design the interconnect to equalize the AR arrival time of the clock signal to all parts of a chip so that's a solution and um people actually have de developed different clock Network this is actually one of the uh famous uh clock Network which we call it H3 and the goal for H3 was to make sure that you know people can um basically clock arrives at all locations at roughly the same time so so if you know about clock SQ topic you know that these clock SK effectively increases both uh setup time and hold time I hope you know what is setup time and hold time okay good so I don't need to explain right anyone wants me to explain don't don't be shy good great so yeah uh clock effectively increases both at T set up and he hold so that's basically is very important to you know make sure that every component all the components in the system they observe the same clock um basically you know they they they observe the same Peak and um low in clock signals so you can check these papers if you want to learn more but essentially people have developed lot of different networks for clock and this is also the uh interesting diagram for the for this processor Alpha clock Q special distribution that you can see in the chip from the vertical axis and also horizontal axis you can see what's this SK for the clock so even even that people actually put a lot of effort in distribution of the clock still we have that problem so essentially I mean this interconnect is a big problem honestly for every thing that you want to communicate it's not only for clock but this is just an example if you want to learn more you can check this lecture timing and verification from our DDC course but now I want to basically jump into interconnection networks Basics and I will start with some boarding stuff uh some terminologies so interconnection networks unfortunately involves a lot of terminologies that I'm going to actually mention them in a few slides but hopefully we will get to more exciting stuff as well any question I'm not sure actually yeah it's good to check yeah yeah okay uh so the first uh terminology is actually topology which is the topology specifies the way switches are wired so essentially the graph of switches the way that you connect uh switches or routers let's say uh that defines your topology so topology essentially affects routing algorithm how you wrote your messages from a source to the destination it affects viability as I said there are some topologies that they provide only one connection from one note to another note so those topologies are fundamentally they are not providing good reliability you know because if that link or that connection um just uh you know get broken then you you cannot reach to that destination note anymore which is not good also topology U involves I mean can Define your throughput how much per throughput your topology can s can sustain uh depending on how many wire how many links or how many passes you can provide also topology can affect your latency there are some topologies that um sending a packet from one node to another node you need to basically uh take a load of hops so essentially your topology has a long diameter we're going to get to that terminology also later which is uh so you need to basically your distance uh can be very long depending on the position but there are some other topologies that they try to reduce that long distances and they can provide better latency essentially and another point which is very important that affect actually all other stuff is how easily you can build that topology there are many actually if you check re if you just do some research on uh topologies that people have proposed in interconnection network you can actually a lot of exciting ideas like a lot of like many uh exciting graph structure that they can provide and they actually show a lot of good reasons for such topologies but in the end this item building is is a is a very important metric here that can actually uh that can cause that mostly in our Network on chip design when we want to have interconnect on the chip you don't really go that fancy in the interconnect right so you mostly use like topology similar to mesh for example or Rings or hierarchical Rings or hierarchical boss or things like that essentially but there are many also other topologies that they can be actually used for super computer as well so in super computers because your topology or your interconnect is actually happening through um switches which are they are not on the chip so you can think of connecting many things you know you have more hopefully possibility but on the chip you always need to care about this building needs which is very important uh routing algorithm is how does a message get from source to the destination so in in your topology you know that there are some passes that from those passes I can get from source to the destination routing algorithm essentially tells you that okay among all these passes how you can get to that and there are many versions of actually routing there are some routing that they only offer you one solution like this every time you ask the routing algorithm says that okay from this source to that destination this is your routing there are no other routings there are no other paes but there are also some routing that they are more adaptive we're going to actually see that and they can offer many other versions of routing as I said it can be static or adaptive people also say deterministic and adaptive so there are many of also different terminologies here another important U terminology here is buffering and flow control so what do we store within the routers and links so essentially flow control says that uh when you want to go from a source to the destination when you want to send this message so you know the pass so you know the P because you know the routing but the way that you go from this source to the destination how you move your packet how you move your message in uh chunks or not or whatever so these actually is the message of flow control so if you want to for example go from one note to another note do you make u connection or acknowledgement or you know to the destination to uh to the to the next router that in the terminology we call it Downstream router so upstream and downstream so you send a a packet or message from Upstream router to the downstream router so this needs some communication so there should be some buffer for example in the downstream router such that that Downstream router should can buffer your data if there is a blocking but that being said there are also some Network design that you can have them bufferless is actually one of the very exciting work that we're going to cover next week that your network does not have buffer essentially and that's actually involves a lot of exciting topics in flow control and routing uh that you're going to see later so this actually involves that okay your your flow control is based on the entire packets or is parts of packets or Etc so you're going to see more when we get to that part of lecture and how do we throttle during over subscription sometimes you are overloaded and you want to reduce the load you want to throttle so how do you basically essentially throttle so all these are topics of buffering and flow control and this is actually tightly coupled with routing strategy as you know as you can imagine any questions okay good another important terminology is network interface which is a module that connects endpoints which can be processors or caches to network so essentially when Network usually is considered as a network of routers and then each router uh should be connected to a cache or a essentially should be connected to the processing element or a node and that can be cache processor GPU accelerator whatever so in order to connect a router to the processor there is a module uh named network interface that network interface connects that router to that processor essentially um these actually top the these structure make more sense uh when when you want to really decouple your network and the processors and caches but there are also some Network design that actually network is tightly um coupled with your uh processor and cache design so me that's things you don't really need maybe network interface or or network interface can be actually integrated in your processor chip or router chip so it does not need to be a different chap or different structure essentially uh you also have link which is bundles of wires that carry a signal so essentially two routers when you connect them or a processor to router you call all these bondes of wires a link and links can be um several wires they can be like eight wires 128 wires for example 256 so there are many versions of that depending on the depending on your cost and uh essentially the way that you design your router your network another important topic is switch and router um so these two are sometimes U interchangeably used even though overall in the terminology of an interconnection Network routers are more complex than switches so switches usually are quite simple you can consider them they are only switch and somehow you need to Recon that reconfigure that switch the way that basically it uh connects the input to the output but router essentially involves a lot of decision making uh like um you know you you can think of router is might be more intelligent than switch so there there are parts of router that does this arbitration and making decisions and try to to allocate um resources to the requesters such that uh it can optimize the bandwidths essentially but switches are overall uh less complex but that being said there are also some papers that they just use them interchangeably and um yeah so and you can see that this is to connect fixed number of input channels to fixed number of output channels essentially and channel is a single logical connection between routers and switches so you can think of channel is actually very similar to link but the only difference here is that channel is a is a logical uh basically uh terminology like when you when you talk about when you discuss about topology which is a graph you talk about channels like you say that this topology provides like eight channels for example but the but the fact that each Channel like how many links actually cause can build that Channel or what's the bands of that channel or these stuff are more related to link so when people want to provide more detailed information they usually use terminology of Link question are these uh terminologies uh surprising or they make sense somehow yeah yet more terminology we have note which is router or switch within a network we have a message which is unit of transfer for Network clients so when processor or a cache wants to send a message some or some stuff some data basically uh to a destination note we call it message and you you might actually want to send a lot of messages so so basically one message is unit of transfer uh for Network's client and packet is actually unit of transfer for Network so essentially what you want to communicate to the destination that might involve a lot of messages but then when you want to send every single message in the network you need to send it packet by packet because your network is actually designed to handle packets and that's exactly also the case in our I I guess networks how many many of you have taken Network course but these terminologies I guess are kind of similar right there also yeah and fleit is actually a flow control digit so once we get to the flow control you going to see that uh when we want to send a packet from one note to another Noe we usually send it fleit by fleit essentially okay and uh these are hopefully terminologies we uh depending on the way that you design your network you can categorize them in direct or indirect networks so in direct networks end points uh sit inside direct inside or we call it direct or um outside the network in outside U we call it indirect Network so this is an example of indirect Network that these are your end points like processors you have 16 processor or caches for example and here also you have H 16 again so these can be actually the same or different notes so you can also think of 16 processor here that they are connecting 16 Cash Banks let's say and these are this is your network and you can see that endpoints um are outside of the network essentially so that's why we call this indirect Network and each of these is these are routers or switch and the radic of so we call them radic of two because they have two inputs and two outputs um sometimes people also call it Radix area or so the these routers are too are essentially so this is indirect and this is an example of mesh Network which is direct Network you can see that end uh points are actually inside the network question good okay now let's dive into topology and I start with more terminologies so in the area of topology we have regular or irregular topologies so regular topologies are those topologies that um essentially your topology is a regular graph um that from every point that when you look at your topology you see the same thing essentially uh so ring is actually one of mesh is considered as one of the regular Network even though mesh is not completely regular because of these nodes you know these nodes are have only two uh connection or this node has three connections essentially but people usually also consider mesh as a regular network uh and irregular something beyond that is irregular essentially at some I will also use blackb to show you some more stuff but I don't think you need me you need examples for regular or IR regular right makes sense uh routing distance is a number of links or hops along a route so whenever you want to send a packet from a source to the destination uh there are some certain number of hops or links that you need to take so that's actually called as routing distance however diameter is actually maximum routing distance with within the network so diameter is actually uh is defined by the topology so if your topology are let's see here so in this network inir network your diameter is uh four right for example assuming you want to go from this node to this node you need to go from all these steps like this one hop another hop another hop and another hop so you need to take four hop to reach to the destination but how about diameter in the mesh Network so of course diameter is actually from this node to this node and you need to go essentially I mean this is one way you go X and then you go Y and you get the destination so you can see that actually diameter of the mesh network is actually quite High an average distance is the average number of hops across all valid RADS so this can be actually uh defined uniformly like you assume that your traffic pattern is completely uniform uniform uniform and random like uh every Source can send a message to every destination uh with a equal chance so if you consider that kind of traffic pattern then you can actually calculate average distance right but people also use this average distance for any uh traffic traffic pattern so sometimes your traffic pattern you might use a network which does not really have a good diameter like the diameter is long but the the traffic pattern is usually that you use is more or less neighbor so you most of the time you only connect communicate to your neighbors so you don't you rarely go to the that far distance nodes so in the end your average distance for that kind of traffic would be less would be much low um and that can actually cause some uh I mean depending on your uh usage but uh that can also mislead people for example say that okay this network the average distance that I calculated um is for example three like in in mesh network if most of the time you just communicate with your neighbor uh your average distance would be around one or two or three something like that right however if you're uh traffic pattern is uniform your average distance would be something around uh in this example I guess eight or so right so yeah but yeah people um sorry I will get that people get calculate average distance sometimes numerically you know with mathematics so with mathematics they assume assume that basically they have completely random traffic but sometimes people calculate or measure um average distance actually here is measure because they measure average distance using simulations and with simulation they assume some traff traffic patterns and that traffic pattern can actually affect your average distance yes yeah I mean this is um yeah that's all that's true um of course you can also consider this inject so you need to inject your packet to the network and then you need to eject and that's always plus two so that's why we don't I mean it also depends on the so you always need to have the exact definition that you you know what you want to report but usually people don't report that injection and ejection that's true actually here is three hops for example yeah okay another very important actually um terminology is bisection band BDS and people use this bisection bandwidths in order to compare many topologies together so bisection bandwidths essentially is often used to describe Network performance so when you cut Network in half and then uh some bands of links uh basically that they are served like the minimum number of channels is spanning two halves times Bandit of each Channel like in a mesh mesh network if you want to calculate the bsection M you need to find a way to cut the network to half and then see that how this half to another half is connected right do we have a lot of channels or not and you can actually see that you can actually make a lot of nice analogy uh in uh road traffic as well right there are some cities that from one part of the city to another one there are not many channels like for example in stul I guess I know that those CH there's only one channel yeah and that's always there are a lot of traffic jam in that channel right so that's that's exact but that being said that really depends if your traffic is actually going to far distance if you're mostly communicating neighbor sometime you don't want to spend a lot of energy or a lot of cost to make a lot of um very good bisection Bandits right so it's very depends on how want to design your network depending on your workloads you really need to optimize because if you just want to if you design a network that should work well for every which I don't think is actually possible that we Design Network which work well for every possible scenario but if you have that kind of mindset you never you you going to get to a network which is quite costly in the end right so you always need to optimize your network with your goals and with your essentially workflows that you are thinking yeah I remember that stumble traffic because I've been in that traffic so that's why but that's is not not only stumble has that issue I think many cities actually yeah yeah in the city that I studied uh um my bachelor master and PhD in ton actually we also have that issue whenever you want to travel from west of City to the uh to the east it's just uh a headache and I remember every morning I had to travel from uh Southwest to Northeast Nik is laughing because she she knows what I'm saying yeah from Southwest of City to Northeast and that could easily take two hours so two hours to go to reach to your destination and kind of two hours when you want to go back so yeah yeah I I guess uh all cities they have such issues I mean probably they were thinking that people should work more or less locally you know such that you have neighbor commute but that's not happening because resources are not that balanced I would say okay another very important terminology which actually can uh affect your performance is is your network blocking or non-blocking so if connection any uh if connecting any permutation of sources and destinations is possible uh network is non blocking otherwise network is blocking so you have many like every paid of uh source and destination there is a connection right if you if you want to make all these connection at the same time and still your network can sustain uh then your network is not blocking which is very hard to achieve essentially and most of the time our networks has some blocking aspects so depend depending on your chance depending on your current traffic status you might not be able to reach to your destination that doesn't mean that your network is unreachable I mean you you can reach to your destination but you need to wait because of the net you have some blocking uh links or routers in your network we also have this nice trade-off that people have proposed like rearrangeable nonblocking that the network is actually can be re red um in order to make connections these are also considered as reconfigurable Network architecture that you can reconfigure the design of your network you can actually configure your topology such that you make some connections uh that is needed in order to make sure that you don't have blocking your network okay any question now let's see some examples example topologies we're going to discuss some of these topologies um for some of them I will provide some details for some of them I will just skip but but those hopefully you will get some insight about why topology is important we will start with the simplest topology which is boss there are some metrics to evaluate interconnect topology like the cost of network like how many routers or how many wires let's say uh we need to spend in order to design our interconnect uh what's the latency of interconnect like the Hops in nanc for example and how your interconnect works at contention essentially and of course there are many other metrics exists that we should think about like what's the energy consumption of the network bandwidths of the network and overall system performance and I would like to actually emphasize this overall system performance because the research in interconnection network and network on cheap people have worked a lot to only improve the network performance but after which is good actually I I think that that kind of research is actually needed but at the same time we should also think of okay what's the effect of that Network performance in total system performance sometime when you use this network like for example in GPU so gpus are designed to tolerate Laten essentially by fastly by by Yeah by making possible that you switch across threads like cont switching quite fast so essentially they can tolerate latency so if your average the network latency that you have on average is around let's say 200 cycles and you make some you know performance Improvement and you make that 200 Cycles to I don't know 175 or 150 Cycles which is actually a good Improvement if you only report the performance of the network you can see that I average latency I reduce from 200 to 15 but the moment that you want to see the performance of the system so you run your GP application and your network is beautiful because now the Laten is 150 but then the overall in performance in your GP application would be something around 1% or maybe less than 1% because essentially that 50 cycle is possible uh to tolerate between your using your U basically latency tolerance tolerability that you have inside your GPU but if your network latency is around let's say 1,000 cycles and then you make it to uh um 300 or 400 then you can observe some performance Improvement so it's always important that you see also the effect or at least analyze the effect on overall system performance uh this part I would say that in early researches in this area people actually have forgotten or they didn't simply focus on um but later on actually people start to report uh performance system performance metrics as well in interconnection works this is boss which you can imagine like here we are connecting eight nodes to the boss all nod connected to a single link and uh this can be actually simply processors and memories like we have four memory banks and then processor and cash and these are all connected so of course it's simple and cost effective for a small number of nodes so it's actually cost effective for a small number of notes if you have many notes then you cannot really make your boss cost effective because as you uh in length your uh as you lengthen your boss uh your capacitors and also resistance of that BS is getting larger and larger and in order to sustain the good performance of that BS or good frequency you need to add a lot of repeaters and a lot of stuff you know such that this boss can sustain good uh signal Integrity but yeah and that make this not really cost effective essentially and it's easy to implement uh and it's also very good for cerence as you know uh for snooping cash for example and because it's just a single point of sterilization so everyone can observe what's happening uh on the BS but as you can imagine it's not scalable to large number of nodes because of limited band and electrical loading which in the end they reduce frequency and also it's not it's not very good at contention so it gets saturated very soon so when you have only few nodes so your boss needs to arbitrate across only few notes like four nodes but once you have 64 nodes that let's say now your boss needs to service 64 no and at at this at each time assuming that your boss is not hierarchical so you have only one single boss so at each time there should be only one driver to the boss so not two noes or not more than one no should uh access the boss to write or drive it there can be many other nodes that they read it that's why actually boss is very good for broadcasting and multicasting but you cannot yeah only one note can drive the boss so essentially when you want to ride to bus or drive the bus you need to involve the you need to consult the arbitration unit essentially and you can see that when you have many ndes uh your chance to get the permission to access bus is getting lower and lower and in the end your performance is low so I think a tradeoff for boss is actually quite clear right or anyone wants to make another upside or another downside about bus which I believe actually we have but here are just some examples okay good uh at another uh extreme we have pointto Point Network which every note is connected to every other with direct and isolated links essentially so here for example node zero can is connected to all other nodes with a isolated link meaning that node zero can can reach to every other node with only one hop and without any contention so that's the pointto point connection essentially and that's also the same for node one node two and you can see other nodes so of course this has lowest contention and it has a potentially lowest latency why is it potentially any thoughts sorry yeah but that link is not contented right yeah but you can consider that you have isolated link from 4 to zero so there can be two unidirectional Channel or links Essen but there is a more important reason any other thoughts yes yes exactly that's actually one of the very good reason so once that's actually bring me to that part that I said easy to implement or easy to build so once you want to make this pointto point connection on your chip essentially it's not possible right I mean it's possible but you need to some of these wires they cannot really easily connected to the destination because of the contention of the wire routing essentially so then it will make make some wires longer and longer and that long wire would have higher latency and in the end you might not have actually the best latency so I would say that if can provide the best latency if your network is highly contented so if because then your uh link latency is not really bothering we're going to actually learn about performance metric of interconnect by interconnect there are many actually different parameters that they affect the lat of the or performance of the interconnect um but essentially when your network is not loaded is not contented very much uh your latency is bonded by your links latency and those stuff but when your network is highly contented like there are a lot of request going on then your actually latency is bonded by the queuing latency right how much you are queued um I would say that since this network can work the same uh for zero load and Al highly loaded at some point this will would outperform many others but in low loads we don't know right yeah uh in in boss right so there is a there is a maure somewhere that does arbitration so every not should uh send a request to that arbitration and then we have these arbitration cycles that arbitration arbitrates across the active request it can be and there can be a lot of different policies can round robing or any other stuff yeah okay so but it's you can see that it has the highest cost because you have order in u connection on ports uh per note and um overall we have order and square uh links essentially so it's very high cost and it's not scalable of course another nice tradeoff is actually crossbar which actually crossbar tries to make a uh same performance as Point too connection but provides less I mean with much less overhead essentially and that's why actually crossbar has been used in many real processors actually so every Noe connected to every other uh with a shared link for each destination so that's the main difference here here uh if you want to connect to node zero uh so every node can connect to node zero with a different or separated link essentially right but in crossbar if you want to connect only one um node can connect to node zero at a given time because there is a shared bus here so essentially you have multiple bosses right in crossbar and that makes a difference here so this enables concurrent transfers from non-conflicting destinations so as long as your uh accesses are going to different destination you don't have conflict but if your accesses are going to the same destination yes you have conflict but in reality that case you would also have some conflicts because you okay your your network might not cause their conflict but once you get send all these data to that destination that destination need to process those right and then that destination might not have enough band also to process all these outstanding requests at the same time so in the end uh sometimes we don't really need also um basically we don't need to provide uh concurrent transfers for conflicting destinations essentially because that destination node also is not sometimes uh capable of processing all these requests at the same time make sense so this is a could be it it could be cost effective for a small number of NOS again but uh this actually has been used in all some of our some of the real chips as well so it's low latency and high throughput of course it's expensive but it's not as expensive as the point point to point connection it's not scalable because it's order n Square cost again so you can see that when n increases then the cost is actually increases by the uh by the power of two and uh we have difficulty to arbitrate as end increases because now you have a shared bus and you need to arbitrate okay so that involves some difficulties but actually this use in core to cash bank networks in IBM power five and Sun uh processors uh it's in gpus also I'm not completely sure the exact network but many research paper they have considered that we have kind of CR crossbar to connect the GPU cores the last L2 cash Banks essentially so as long as your the number of cores are not that many and as well as also number of cash Banks crossbar can be actually implemented in real chips and this is another view of crossbar design that you have a kind of this tried buffer that you can control which of these uh Tri buffer can uh basically uh Drive the Bus and other would just float the bus essentially this is the um this network structure in Sun processor that you can check if if you're interested but essentially they use this kind of crossb to connect uh course to caches there is also another trade-off in order to for Designing crossbars you can instead of uh having this this this design is actually bufferless right that you have a there is no buffers and you need to arbitrate and at each given time one of these Source can access or can drive the bus but you can also imagine that you have some flow control and then you have some buffers and essentially these sources can send message to that buffer and that buffer will put data in on the BS this can uh potentially um basically make sure that you do this arbitration um or I mean the arbitration latency can be overlapped by your sending your request essentially because while you are sending your request arbitration can be also happen in parallel and at some point your uh this que is also getting uh um dced basically and you put data on the bus so this of course simpler and for arbitration scheduling and efficient support for variable siiz packets uh however now it comes at the price of buffers so you need n Square buffers essentially make sense okay now the question is that uh we know that crossbar is good but but can we get lower cost than a crossbar but with a similar performance yet still have low contention compared to boss and the answer is yes and people have proposed multi-stage networks that they are very similar to I mean multi-stage networks are consider in the category of indirect um Network as we discussed this is for example an example of one multi-age network Omega Network that uh essentially the idea is that indirect networks with multiple layers of switches between terminals and nodes so we have some switches here and each uh switch can be here for example is two are like two inputs and two outputs but it can be also four are for example or four uh inputs to two outputs so there can be many variants of that and uh so the cost of uh this network is actually order n um is the number of U processors and uh log n which actually is here is actually log log two n here because your routers are two like two are if routers are 4 are for example it's going to be log fourn essentially so that's the cost and latency is also order um Lo in again here which makes sense there are many Vari variations of this multi station for Omega butterfly and so on so forth you don't really need to know the exact uh basically structure of them but here are just some examples interesting thing about this network for examp Omega Network is actually is a blocking Network so here you can actually see the very the definition of uh conflicting network with this example so assume that there is a packet sending from this note to this note and you can see that there is only one pass and that's actually part of this issue so there is only one pass you don't have pass has diversity now uh this note 001 wants also to send a note to completely different note so you don't have conflicting destination right but unfortunately because of uh your shared pass which you need to get also um access this note from this shareed channel this basically green flow needs to wait essentially and that's uh cause these conflicts which is not good and people actually have tried a lot to to um propose multi-stage Network that does not uh provide I mean that can mitigate conflicts as much as possible or make it almost zero U the likelihood I mean of course maybe there there might be some cases that you can St you know but hopefully those cases is not that uh probable in Normal usages yeah so we already discussed this that a multi-age network um basically now if you think about it actually in a crossbar uh each of these routers each of so here there should be a router that uh if you think about it that is n input and one output it's like in are uh router but the output is one in point to point you have n every n n input and N Out output but here you have n input and one output so that's the difference here in Cross but what makes this U multi-age Network simpler is that your routers are now for example here is two input to output and that's essentially what we have here any question so the way that you implement multi-age networks um there can be different way essentially so one way you can send your packet and they Traverse through the network hop by hop and you somehow need to basically roll them buffer them until they get to the destination but you can also think of circuit switch which we're going to learn about circuit switching at the end of this lecture almost end of this lecture um in flow control that you can essentially you create a switch sorry you create a circuit from source to the destination so you reserve pass and you create a switch create a circuit from source to destination and once your circuit is established you use that circuit in order to send your messages we're going to learn about Circ switching very soon actually I mean end of this lecture and this is Packet switch that I already mentioned okay so yeah I don't want to go into detail of circuit and packet switching here because we're going to learn about it today um in flow control but essentially circuit switching uh sets of Full Pass before transmission so establish rot then send data it's actually the exactly the way that we also use in our Venice for for SSD we reserve a pass and we establish uh basically transfer from flash controller to the flash chip and then we send the data so the the the problem is that none else can use those links while circuit is set but in packet switching it performs routing per packet in each router route each packet individually and possibly by a different passes and if link is free any packet can use it so you don't Reserve uh a link for packet switching so of course circuit switching is faster arbitration you don't necessarily need buffering although you might need buffering um when you want to create a pass like when you want to establish the pass you you might need some buffering for that but after that you don't need buffer and uh however setting up and bringing down pant takes time that's why actually circuit switching is very good when you want to uh when you want to transfer a lot of data from a source to the destination many actually prior work they use circuit switching for streaming accesses for example you know for for the streaming access you know that once you U make this circuit then for a while that this Source will going to send data to that destination so it makes sense to you establish the circuit and the overhead of that can be amortized easily um but when you're uh basically when you're communication is not that frequent or that big uh it's better to use packet switching for example so in packet switching U it's potentially slower again potentially very it depends on you know the cases and now you need to handle contention uh via buffering or many also other techniques that you're going to learn about later however you don't have set up and bring down time so when when you want to send your packet you don't wait you just if your if your uh Source if your Source router is free you just send your packet to Source router and say that okay uh handle this packet for me until you get to the destination but for circuit switching you need to wait uh to set up that pass essentially so let me also quickly uh discuss the switching versus topology this circuit and packet switching choice independent this independent of topology um and it is a higher level protocol on how a message gets sent to a destination so basically depending uh for any topology like in mesh you can have switching both packet switching and circuit switching and but there are also some topologies that are more amenable to Circuit versus packet switching so that's why actually people um design different networks for that this is a example that you have seen in I guess in a symmetric uh heterogeneous lecture as a Tyler processor that topology is Tod mesh but actually we have five different networks and four of these networks are packet switch that they are used for different uh use cases um these are actually also related to coherence protocol as well um we didn't discuss it actually much and we are not also going to discuss it today but essentially uh in order to to make your uh coherence protocol work in the network you need to really sometime you need to design a separate Network for response for acknowledgement uh and also a separate Network for um basic for data so people now try to make virtual channels uh in order to give that illusion of different network but essentially in some design also they design completely different networks and they also have one circuit switch uh which is used for streaming data essentially this is also another example of multistage I'm going to just show some example and then uh we can maybe take some break so this is a Delta network uh single pass from source to destination and each stage has different routers and this proposed to replace costly cross crossbars as processor memory inter as you know and you can see the description of Delta Network in this basically paper this also another one Omega that we already seen which is actually used in NYU Ultra computers very interesting thing about this network is that they actually use this idea of uh doing computation in the network which is a very interesting thing once you access memory and you're reading your data you can actually once you're uh sending uh your um your data is getting transferred through routers or switches you can actually do also competition on that data so they actually implement this Fetch and AD MI which return M and replace M with M plus I so as your data is actually traversed through the network you also doing some computation very also similar to near data processing essentially um butterfly is another um famous example of a multi-age network it's very is equivalent to Omega Network and indirect and um so conflicts can cause this saturation so this network still can cause conflicts um but of course randomization of rout selection can help so you you can try to randomize the roads that you take from source to destination in order to reduce conflicts essentially but let's quickly review these topologies um so cross bar um that we have seen multi-stage logarithmic and mesh so we haven't seen mesh much we're going to see more but you you have some idea about it so these two are indirect mesh is direct uh crossbar is non blocking this uh network is actually multi L is actually blocking and mesh is also blocking cost is uh order n Square this one is order n Logan and this one cost is actually much less or um ordering latency for this one is o1 this one is O Lin because you need to take steps but this one is order umt of it essentially any question yes the cross is non blocking because um as long as you don't have uh so um yeah so as as long as you don't have blocking destination so that's the difference sometimes your network is blocking meaning that once you make it a pass from one source to the destination there can be other PA that they cannot use that destination that pass so as long as you consider that you don't have blocking destination like not uh if you have blocking destination probably you need to wait because your destination like in the example of flashship for example you know you want to send a read request to flashship so flashship can service only one read reest at the time right so anyway you need to wait so that's even though that's also people try to think about it and think about providing non-blocking Network in that sense as well like as I said like pointto Point connection is completely non-blocking right but usually people consider network non-blocking if you if it can actually provide um all possible communication as long as you don't have conflicting destination so crossbar is consider as nonblocking because yeah as long as you don't have conflicting destination you can establish any pairs of uh source and destination but that's a very good question okay so now we're going to see some of these uh direct topologies but before that I think it's good uh to take some break Let's uh take 11 minutes uh break and we can continue at 2:35 for e e e e e e e e e is e e e e e e e e e e it's okay let's let's get this started so now let's uh take a look at some U topologies that they are considered as direct topologies um so we already seen boss right but this is another topology which actually quite useful and this is ring so yeah so with ring topology you can actually have this wrap around and that actually provides some pass diversity so um assuming your ring is actually biral because your ring can be also unidirection such that you cannot go two way but as long as your ring is bidirectional um from this Source essentially between every pair of source and destination there should be one there should be two passes right so you have some pass diversity which is good for performance and also for reliability because one of if one of these paths um basically is faulty then your ring will be an array essentially right but still your network is connected but of course it cannot tolerate more than two faulty noes so it's a still cheap because the order of n um but latency is still high because it's o n even though latency is lower than array but when you think of order still is order in and it's also not easy to scale and by B section bands also remains constant just one right there is only one connection or no two connection actually there are two connection so bisection band for array is one bation B for ring is two and that actually remains constant for array and ring as as you increase the number of nodes which is bad you know as you as you make your network larger and larger at least you think that okay bisection band also should be scap somehow right but if that remains constant that at some point that becomes your mod but this actually has been used in um some processors and many other commercial systems today this is a unidirectional ring that you can only send packet in One Direction here is U cter clockwise um so again simple topology and implementation reasonable performance if n and performance uh yeah um cannot read this yeah essentially if if n and the need for uh performance is still moderately low and order is o n as we discussed and this the average hop is n over two but of course you can also have by directional rings and with that uh it can reduce you can reduce latency even more and improves scalability but it's slightly more complex uh injection policy you need to select which ring to inject a packet so you can have actually um two different Rings you can have one two different uni directional rings that they are connected uh and then you can decide in which ring you can put your packet so and you can think of more rings as well in your system and that's the topic actually that people have proposed using B directional Rings hierarchical rings that they can be really useful the example of one of the Intel's processor and hierarchical rings that essentially you have different rings and then you can connect these rings by a ring right so you can think of such topologies but these are actually very useful topologies and they have been used you want to learn more about hierarchs you can actually check this paper uh it's beautiful paper uh that you can learn a lot about trade off and as I said High cardings has been used in real systems this is a a work from Huawei that they basically essentially use hierarchy cats okay but now let's take a look at mes but we always talk about mesh but we never actually show mesh in a formal way so this is mesh uh each note connected to four neighbors I most of the notes there are some notes that they don't have four neighbors as you can see and uh when you talk about mesh uh you always U use this north east south and west Direction so op is considered as North as you can imagine um and then west east and south um cost is o n and average latency is O TN as we discussed also so easy to layout on chip which is because it's regular and equal length links so this is very good for mesh so as long as you increase your uh size of your network and I mean as long I mean as you increase nodes to your network you don't need to increase uh The Links of wires that's very good that's really important that makes mesh a very suitable candidate for many of these inter connections and past diversity is actually we have a lot of good P diversity so many ways to get from one not to another so for example uh if you want to send packet from uh this note can you see my yes from this note uh to this note you have two passes right you can go this way or you can go this or for example if you want to send this packet to this note then you have more passes you have this you have this you can go X YX y you can go YX YX or many other also permutation yes but all of these are shortest right you can also do exactly you can also go for U non-minimal pass the terminology is non minimal pass here and uh and that's actually useful and we're going to also see that yeah sometimes uh I will actually explain that non Minal plus when I I want to talk about a trade-off between uh lengths of pass and also blocking because non-minimal pass increases the lengths of these uh passes but at the same time it can be used to avoid conflicts and with that you can reduce queuing time and once your queuing is actually the major bot neck you can avoid that you can mitigate that Bott neck with non minimal pass and we have used also non-minimal pass in our Venice work as you remember probably so mesh has been used in some processor like the Intel tylera and many on chip Network P ta so this is tros I should also say that in mathematic mathematics we have a uh basically formula to basically to multiply two graphs I think it's called cartisian graph multiplication if I'm not mistaken so when you multiply two arrays you get two mesh network if you multiply uh two rings you get two throws essentially that's also very interesting things about how this fed is supported by graph Theory and Mathematics so tros is is kind of mesh but you know that mesh is not actually when I said that mesh is not regular I I was confused with this terminology so mesh is considered as regular network but it's not symmetri but there is also another uh terminology so mesh is not symmetric on edges as you know and performance very sensitive is very sensitive to placement of task on Edge versus Med so TR avoids this problem as you can see so we have also higher P diversity and bsection band and mesh and but of course it comes at higher cost and harder to layout on ship and then you have unequal link links as well because of these wraparound these links can be uh longer people try to avoid that by wave nodes to make inter node latencies constant so they made a lot of you know kind of placement such that uh this is actually I guess for rink that uh basically all links now are constant but at the price of now you your uh now your latency for every link is actually higher compared to this design right so this design all links are the same I mean most of them like those links that they are in the mesh Network they are the same but those wraparound they have longer latencies but people also made such optimizations that now all links are balanced but now they are also longer another topology is trees um this actually can be considered as the indirect topology because usually um processors and caches are connected to the leaves of this tree so this is a hierarchical topology and latency is the OWN um order log in and this is good for local traffic of course um but if you go need to go for longer traffic you need to go basically go to roots and then that can cause a lot of Laten uh and you and you know that for example Distributing clock Network we are using H3 this is also cheap and easy to lay out and many Works actually has used that but you know that root can become a bottleneck so that's why people people propose Factory as you go to Upper levels uh the basically the the bandwidths of the links uh becomes um higher and links are getting uh basically fatter and more um basically powerful such that because you have more lows essentially when you go to the rout right this is an example of cm5 factory which is a prop proposal that I can show here actually like the data Network uses a Tre topology each interior note connects four children to two or more parents but here in this example they consider only two pars and the processing notes from the form the leaf notes essentially of the tree as more processing notes are added the number of levels of the tree increases allowing the total band across the network to increase uh increase proportionally to the number of processors so that's that shows how your network is getting escaped when you add more processors so when you want to add more processor you can increase you need to increase the uh number of layers of that tree and that make your network bigger meaning that your network has also higher pands hopefully your network also getting more powerful when you add more resources when you add more power processors that uh your network can handle the conest question okay now we can get to one of the fanciest topology hyper Cube um this hyper Cube uh actually is used for super computers a lot I don't know if it's actually used for onchip networks I don't think so because it's not very easy when you want to uh basically uh fabricate hyper Cube on on a 2d on a chip but in a super computer you can actually do that so hyper cube is a in dimensional Cube or n Cube and you can actually uh use this uh terminology like Z D is a hyper cube with a degree of dimension of zero and it's only one note then you have 1 D um hyper cube is you essentially you have an array here but this array has only two notes once you have four nodes you can actually make a 2d hyper Cube and this is 3D hyper Cube this is 4D and you can imagine how you can go to 5D the good thing about hyper Q is that latency is actually quite low uh is order Logan like um why is that what the Lan yeah hops in 3D right exactly but how you can imagine in n Dimension why is it n so the number of nodes is always is always two to the power n you know so when you have so your when you talk about your hyper Cube as n Cube uh it means that it's connected in nodes so then basically your U the number of uh the number of Beats if you if you want to uh represent your notes like add to address your notes how many bits you need or how many attributes or uh exactly attributes you need you need a l in essentially so talking about 2 to the power n nodes you need n bats um to represent every node and essentially the the longest pass is from the node all zero to old one right and in every hop you are fixing one of these uh one zero to one and then you need to take n uh which is log 2 The Power n but here because we are talking about NQ we are we are saying that Laten is o or the l n and radic is Radix is number of U links connected to uh to each router Radix is all order Logan here again and the number of links is order and LAN and this is how hyper cube is connected is this is actually for 4D I think actually we we use this kind question for one of the exam yeah but we explained we explained the so you don't you didn't you didn't need the the explanation here to understand so we provided enough explanation in the in the in the exam in the exam draft so yes no no they already the same oh yeah that's that's true exactly yeah yeah that's yeah you're right actually when we say um ND hyper Cube it means that it has two to the power n um notes and then yes exactly and then we have yeah exact this n here is number of processor which reality is actually 2 to^ N yeah that's actually a good point okay so it has low latency but hard to lay out in 2D and this is the Cosmic Cube which has been used in one of the super computer back then and you can see how they connect 64 notes so essentially they need to use six Dimension um hyper okay question okay let's now also learn more about routing so again we're going to start with some terminology so routing mechanics Can U basic it involves arithmetic which is a simple arithmetic to determine routes in regular topologies like how what's the Ari such that can tell you how to reach to the destination somehow you need to address your source and noes and destination there should be arithmetic say that okay how you can reach to the destination like in a mesh network is easy right you just need to um yeah in mesh network uh you can actually address each note with a pair of XY and then uh you know that arithmetic is easy so you just calculate difference between X and Y and then you know that how many HS you need to take for the X Dimension and how many hes the for hyper cube is also easy um because so the address is number of bats and then the difference between the source and destination is the number of essentially differences in all corresponding beats and then you need to make sure you need to take hops in order to fix all these differences but that's actually can be very hard for some topologies and people propose some some topologies and then they spend a lot of time to design some routing algorithm for those topologies so it's not that easy as I said for XY rting for this Dimension order routing um is easy for mesh and Taos for example um Source based is a another terminology here which source a specifies output port for each switch in in r so you do the routing um once at the source router so you want from this Source you want to send packet to the destination you don't do this routing algorithm distributed like at every hop you just make the pass like calculate the path and put all the rades that this packet needs to take maybe you can actually add it to the header of that packet for example so that packet knows that okay uh I need to go to the node two at the second how need to go to note three and then four and then I don't know whatever eight and so so basically you put all these routers or let's say or uh ports that this uh should take at the source another way is so this is actually simple um and makes your router uh or switches quite simple because you don't need to do this arithmetic at every hop you only do it at the source but at the same time it makes your header large another way is to to do table lookup based um so essentially the these are all different ways that we can um calculate the routing as I said you can come up with some arithmetic or you can do the this routing you can calculate that Source or you can have this table look up base so at every Noe you can have a look up table that just access that table and tell you that which output should you go so this makes your header smaller but can make your switches more complex but uh in general you can also categorize routing algorithm in three types U deterministic which always chooses the same path for a communicating source and destination P so there is no let's say adaptivity so there's always one pass uh from one source to the destination in other way your design your routing can be oblivious so it choose a different path without considering Network like it's actually quite oblivious it's not intelligent meaning that it doesn't really benefit from it doesn't handle your P diversity nicely so just okay I have past diversity and I'm just randomly um decide based on but then you can actually also have adaptive roting then you can choose different paths but this can be adapting to the state of the network which is you can see that as we go from determinacy to Adaptive the intelligence your roting decision is is getting better so how to adapt uh you need to know some notion of uh congestion so B essentially in adaptive roting is always you always need a network or way to learn about congestion in the network people have designed a lot of way um some sometime you only um care about your local your neighbor congestion you only check the next router which is down stream router and say that okay how many buffers you have oh you have enough number of free buffers so I think this PA is free or this PA is less congested but that might be not the best way right sometimes you your next uh hop is actually quite free but then you get to another Hub which is now okay you get to congested roads so that's why people try to combine uh kind of local congestion and Global congestion and they they actually sometime design and separate Network in order to communicate congestion information in the network um yeah and that's why actually inter connection network is not really easy to implement also are you taking minimal passes or non-minimal passes as one of our one of your colleagues also mentioned so you your adaptive rtin can also benefit from non-minimal passes as we discussed okay now let's uh take a look at determin growthing more detail so all packets between the same source and destination pay take the same pass example can be Dimension order routing in a x in a mes Network for example you first Traverse Dimension X and then Traverse Dimension y That's XY rout um but in general for mesh so that mesh that we showed is actually considered as two dimensional mesh so mesh also can be can be implemented as 3D mesh or 4D mesh right so in general we call it uh Dimension order routing and when you have three dimensional mesh or four Di mes you don't talk about X and Y but basically you you basically resolve the differences uh in Dimensions one by one be the order so you first resolve fix the difference in dimension one and then Dimension two and then three and four of course you can also have the other way around for example here we can have a rou in YX YX is also Dimension or the routing but the important thing is that your routing does not make XY and YX at the same time so it should decide either XY or YX is that clear but this is of course simple and this actually deadlock Freedom um not sure yeah we have a slide for dead okay I'm going to discuss deadlock soon um so your um routing algorithm should avoid deadlock we're going to see however it could lead to high contention because essentially you only use one pass and Al and it does not exploit pass diversity even though mesh Network for example or trust Network provides very good diversity past diversity but then your uh routing algorithm does not benefit from that so deadlock I think you know the concept of Deadlock in general so deadlock can happen when you there are some resources and there are some requests there so you need to make sure that deadlock is not happening but deadlock in inter connection network also can happen which there is no forward progress and it can cause by circular dependencies on resources each packet weighs for a buffer occupied by another packet down street so you can actually see from this example uh packet one wants to take this pass however uh this offer is not free because there is a package here that wants also to take this pass this packet cannot forward because there is also another packet that wants to take this pass um this buffer is also not free because there is a packet that wants to take this pass and then now you have packet for that wants to take this so essentially you have these circular dependencies uh that can cause deadlock right Sometimes some of these topologies are um are more uh let's say susceptible to the like mesh is not that susceptible so if as long as your um routing algorithm does not create cycles and you can actually make sure that you don't have de we're going to see but uh like XY for example is a is a deadlock free um routing algorithm but topology is like Tross since you have Circle in your topology so your topology has circles right so even using XY roting can cause deadlock for thr Network and people have working on it you know to avoid Deadlocks for to so in order to handle deadlock you you essentially you need to you can have different approaches one is that you want to have deadlock um avoidance technique essentially you don't allow deadlock to begin with so you should avoid Cycles in routing so meaning that you need to use some rout algorithm uh such that you don't get to Dead situation like Dimension order routing cannot build a circular dependency in 2D mesh for example but as I said it can make a circular dependency in Taos Network you can think about it that as as a homework option um or for so be that's because actually XY restricts the terms in each packet can't take maybe I'm going to use actually black for here uh because then we also have this turn model which is is actually quite interesting uh but another approach is that um you can avoid that like by adding more buffering that create Escape passes so for example I I said in a in a tros network or ring you have uh you have this wrap around and that make a circle right so if you consider that this wrap around always takes different buffers so you add some buffers and whenever you want to take a wrapper on you you go to another buffer right so that means that this buffer does not go to the same channel essentially you have now you have two networks these virtual buffers or virtual channels you can consider not virtual buff they are virtual channels because they are you're adding channels but they are not physical they are virtual and those can actually um avoid Deadlocks um as much as possible and another way is actually you let deadlock happen but then at some point you need to detect it and break it this is actually good uh can be good um because in many cases deadlock doesn't happen much um so the likelihood of getting to the deadlock situation is not that much so people say that okay I'm I'm not going to do dead luck avoidance technique I'm just let packet to be transfer as um as the way that they can but then I I have a technique to detect that is De like happening or not and if it happens they need to uh break so du algorithm is actually one of them that I think I have some slid for that any question here yes yes yes no actually uh deadlock is happening more in packet switch um because you do need the space uh in the down stream router in order to forward to Downstream router but the resources in Downstream router cannot be might be busy because there are some packets that they want to also forward to another rout and if you continue that and make a circular requirements then at some point no one can um basically forward right in circuit switching once you establish a pass then you don't have deadlock but of course the way that you establish the pass that can cause dead Lu because for circuit switching you need to send a header Fleet or things like that you know to establish a f and that can actually actually cause deadlock so you should be careful but but overall people consider that in circuit switching we don't have deadlock because okay you you make the pass and that pass can you know send your data position but before establishing the pass you might get to De issues very good question okay so let's see do I have time to use black Force I guess so what should I do I'm going to I think I should stop sh I want to basically discuss this term model um uh technique okay and then I mute projection okay okay it visible tonight this way is it visible from that site I hope so so uh in a mes Network so if you consider this is Norse then essentially uh for every packet you can think of uh depending on each which turn they can take you can think of H clockwise turns uh and conter clockwise essentially so for counter clockwise you have this uh North East turn east south southwest and west uh North terms right and for a counter clockwise you have essentially North West West South Southeast and East North uh basically ter right so now if you if you let your routing to takes all these T you can see that you have you can get to these circular dependencies right so one way to break these circular dependencies is to restrict some of these terms right so let's start with XY routing so with XY routing we know that XY U is deterministic right and you always need to take X and then y can you tell me uh which terms we are omitting in U clockwise and which terms we are omitting in counter clockwise with XY routing yes yes exactly and why you can can you explain that's is that clear right so essentially in XY roting you are you only in a clockwise um terms you only have east south and west north and in the counterclockwise you only have what are the terms um yeah you have West South and East North okay but here we are omitting too many terms so we only allowed four terms out of eight terms that's why in XY is actually deterministic but we can actually make it partially adaptive like you know to have some to use some past diversity and it still does not allow deadlock and that's the concept of the ter model Theory so for example you can um um yes you can actually omit one term from each of them but that's not that easy because you need to be careful I'm going to show a picture when when I project again that the remaining terms combination of them also can make a circular so if you allow three terms here and three terms here combination of these terms can make a circular dependencies that we're going to see so you cannot easily randomly uh remove one term from each of the but for example one of the famous routing algorithm uh which is uh deadlock free and also considered partially adaptive is vest first so vest first is actually uh is one of them and what he says is that if you want to go to the vest you need to first go to West and then you can go to North and South essentially say that if you want to go to West you need to be X right and if you want to go to East you can be adaptive which means in vest first algorithm uh we can we omit uh terms that we turn to West after that so essentially we don't uh we basically North West is not a lot and Southwest also is not a lot so no Northwest is from um this circle and Southwest from this so this is one of them which is is proven to be deadlock free um and the performance of w first is better than X1 but then the issue is that West First is not balanced so if you go to the West your determinacy and you go to East is not uh is adaptive so people also come up with many other variants of that like odd even you can search if you interested but those are uh those provide better adaptivity any question was that interesting I really like this topic because it was one the first research topic that I've done in 2011 was nice we we work on uh extended extended ter model so we extend this ter model um Theory to bigger stuff and we used okay hopefully I be able to share is all good great okay so I already EXP explain uh this ter model algorithm so the idea is that we analyze directions in which packets can turn in network and determine the cycles that such terms can form and then we probit just enough terms to break possible Cycles so this is actually proposing this Isa paper um here are some example like um this one is that we we basically allow every ter this one is a XY roting which we already discussed and this is actually example that says that if you randomly cut terms Um this can actually cut get to deadlock so here even though we remove one turn from here and one turn from here the the remaining of turns they can actually cause a dead and you can see that okay yes it's actually proposed for Nary um for in N dimensional mesh networks and T but mostly mesh yes but not 2D mesh only it can be proposed for 3D 4D okay so oblivious routing uh and one version of then is Valance algorithm so the goal is to balance Network load but at the same time we don't want to uh impose a lot of overhead so the idea is actually quite interesting here so you randomly choose an intermediate destination so you have a source and destination you just randomly select an intermediate note and then send your packet to that intermediate note and then that intermediate note should actually probably um eject your data or put it on some buffers that they are not in the main Network and then you can uh that intermediate note can also send your uh basically packet to the destination so with this Valance routing you can actually have uh deter deterministic routing but still you can benefit from some adaptivity because you send your packet to the destination it can be deterministic sorry you you send your packet to the um to that intermediate node and that can be deterministic and then from that intermediate node then you also use an another determinacy grouting so you can overall uh basically um balance the load so it randomizes and balance Network loads and this actually nmal because depending on which node you uh select it can actually cause n you know make your path non- minimal there also some optimization so we do this only on high load for example when when we are when basically that uh conflicts and blockings are the major B neck of the latency and we can also restrict the intermed note to be close in the same uh quadrant for example these are some papers from Valiant that you can check so for adaptive roting as as we said uh it tries to adapt and your roting uh chooses based on the status of the network it can be minimal adaptive that router uses a network State like Downstream buffer uh Downstream buffer occupancy to pick which productive output for to send a package um it can be many other metrics as well like uh how busy is your crossbar in the download stream router or how many buffers you have but it can be also non minimal fully adaptive also people call it fully adaptive and it's non minimal um essentially you allow Mis routing packets to nonproductive Output ports based on network estate so one of the topic that you're going to actually see next week uh when we want to design bufferless Notch so when you have when your network is bufferless it means that you cannot wait right there is no buffer such that you can wait so how you handle uh conflicts you have this idea of routing which we call it hot potato it's very similar to Hot Potato like you get a hot potato and you just pass it to another one you don't care that you know who is you are just giving the potato so niik has worked on that hot potato thing right bufferless knock in your master you've been working on that topic I guess some a little yeah yeah we've been discussing I remember so yeah hot potato essentially that the routing algorithm and uh it's actually used for buffal S because you cannot wait right so like this packet is a hot potato for you and just you need to just pass it and hopefully you pass it to the destination but you can also Mis wrot it sometimes okay so minimal adaptive should be aware of local congestion and Global congestion U but minimality restricts achievable link utilization and as you can imagine a non-minimal adaptive can achieve better Network utilization and load balance but now you need to guarantee live Lu Freedom as well so adaptive rting actually has this problem of Deadlock uh which we discuss a little bit how we can avoid deadlock or or we just let deadlock happen and then we uh detect it of course that the effectiveness of that also depends on how effectively you can detect the lock sometimes you are conver conservative sometimes you are not not that conservative um but when you have a non-minimal fully adaptive in addition to the you might have livel freedom livel issue as well like your packets are just transferring in the network without getting to the destination it can easily happen with hot potato right I mean hopefully at some point your potato is not that hot and you can you know eject it but I'm not kidding I mean I mean I'm kidding but you can see some analogy from this right so when your packet arrives to the destination you can uh basically okay um so an Adaptive routing is as I said is also very good for reliability because you can avoid faulty links and routers um the idea is that we can Route Around faults so it might be the case that your network is faulty sometimes your router is faulty sometimes your links are faulty as long so there should be a technique that uh understand what are the faulty components and then with adaptive roting algorithm you have this possibility to bypass and uh round um Route Around the essentially such that you can get to the destination with deterministic routing this cannot really happen because you're routing deterministic uh and you need to change the routing table to disable uh faulty routes um assuming the faulty l rotor is detected but uh yeah you can check this paper uh from uh in nox 2015 about how we can use adaptive rot algorithm in order to uh resolve faults okay yeah before starting buffer and flow Contra I'd like to get give another break let's have a five minutes break and then we'll start at 3:25 e e e e e e e e okay so let's also quickly discuss buffering and flow control in network on chip we already discussed a little bit circuit switching and pet switching and um and these two are actually there are pros and cons and that's why people have tried to use them um I mean both at the same time in a heterogeneous design u in order to because each of them is better for some specific traffic patterns and this is an example that we discussed but now let's take a look at packet switch Network in more detail so we have this pack format and essentially we have header uh that your routing and control information should be stored in the header of that packet and then there is a you have a payload which carries data um this non Hardware specific information for example uh for your write request to the cache right so your payload can be data of cash block that you want to write to your cach and can be further divided like you can do some framing uh protocol stats like divide it to fleets maybe and also you need to add some error code uh in order to make sure you know check that your data has been arrived to the destination with no error or so and this error code usually added to the tail of the packet uh so this is the packet format the people usually say header and this payload is also considered as body they say header and some bodies and then tail and that tail can be also tail and body so there are many different way to implement it so as a relative uh related Point here U let's we have this contention so two packets try to use the same Link at the same time you can see the packet from uh op wants to go to west and south wants to go to West so these two packets are use the same Link at the same time so what do we do ESS so we can buffer one of them or we can drop one of them which is not happening actually in I mean usually in network and cheap in uh in network can actually happen sometimes our packets are dropped and we need to redo the Comm communication but normally in network and we don't or interconnection network we don't drop packets usually buffer or we can misro one using deflection as we already discussed this Hot Potato things for example you can misr and of course is there a tradeoff between them and the answer is yes I mean of course there's a tradeoff um like buffer um is good but it's expensive so if that contention is not happening very frequently sometimes you prefer to misr right or drop or even drop of course when you drop you need to inform the source such that you know you drop that packet and that needs to resend uh the message but uh but if your contention is actually quite probable you might want to have buffers essentially so there are many different uh flow control methods uh circuit switching um bufferless packet or fbas which you going to learn about it a lot next week store and forward which is a kind of packet based switching and virtual cut through which is is also another variance of packet based switching and warm hul which is also another way of packet based switching by uh with this difference that you operate on fleet basically your flow control and grity is on fleet and Fleet is part of essenti like it can be head header Fleet body Fleet a fleet so let's quickly revisit circuit switching uh resource allocation granity is large so the idea is that we pre-allocate resources across multiple switches for a given flow we need to send a probe or let's say header or whatever to set up the pass for pre-allocation so no need for buffering once you establish your your pass you don't need to buffer that being said uh usually you add buffers in order to because you don't want to have a long combinational uh links right combinational circuit to the destination because combination of circuit is no easy so you want to and also um you your frequency of this circuit might be low so then people add some buffers in between uh in order to break that combinational circuit but then you don't need buffer in the sense of you know you need to uh schedule buffer or arbitrate across buffer access because I mean that buffer can be Al actually consider as some registers essentially the bond that we have in um circuit switching there is no contention when you establish the network the pass and can hand that it can also handle arbitrary message SES which is again very good so in packet switching we are actually Bing I mean we are B by the packet size and there is a fixed size usually there is a fixed size for packet or there is a maximum size for packet because you need to design your resources based on that um if your message is bigger you need to U usually uh separate it in several packets but with circuit switching you don't need to be worri about that you just establish a pass and then you send your message it can be 2 megabyte it can be 1 kilobyte or whatever so you just send your message from s of the this however it has a potentially lower link utilization uh two flows cannot use the same link um and handshake overhead to set out a circuit of course and you can see that this circuit switching is actually coming from analogy um that we had in the past with telephone network does anyone know how the telephone network has been built in the past like there were a lot of switches and from Source you were controlling all these switches in between such that you can get a pass from source to the destination and once you establish a pass then the destination phone will ring and then you can talk and there is actually a pass circuit from source to the destination I mean of course it's not like that nowadays but in the past was like that and this circuit station is very similar to that essentially so bufferless deflection rotate um I'm going to provide very uh uh basically brief overview of that here so the key idea is that packets are never buffered in the network and when two packets content for the same link one is deflected essentially so this is destination these two arrives to this and then one of them is deflected so NE traffic can be injected Whenever there is a free output link so which is um so in bufferless deflection routing input buffers are eliminated and packet are buffer in pipeline latches and on network links yes two packets uhuh okay yes yeah you can also deflect you can I mean yeah that that needs some support but it is possible that you just okay so instead of uh input a lot of these buffers you just uh basically you get rid of all these buffers and then you need this deflection routing logic uh such that you can get to the destination or without buffers you can see from this paper but we're going to also discuss it it more detail next so issues in bufferless deflection routing is livik livel is not only for bufferless but in any routing algorithm that you have some non- minimal routing you might get to this liel issue and uh uh you also have some now you have some router complexity because of of handling these issues and performance and congestion at high lows essentially so once your network is congested you need to deflect a lot of request and that can cause issues we can check the paper for more detail we have actually improve this bufferless knock in many different papers and at some point we we come up with a uh hybrid solution such that you know you have buffer list but maybe adding a little bit buffer can free a lot of issues essentially so you don't need to be uh extreme right one extreme is that add many buffers another extreme is that no buffers so sometimes something between just a little bit buffers can provide a lot of good performance in high loaded so of course you can check papers uh for more information about it and this paper also discuss Bess routing in hierarchical rings essentially and we also wrote a book chapter on bufferless and minimally buffered deflection rout and you can check all these as summary papers okay so another flow control is a store and forward essentially this is actually part of packet switching so this is first of all it's packet based flow control and it's a store and forward essentially for your packet you need to send your packet to completely to another node and that Noe needs to stor your packet completely and then forward it so you can see it actually from this um animation here soon but essentially it leads to high per packet latency because you cannot you don't pipeline your packet TRS so your packet needs to be arrive completely uh to the next node and then send it so you have per packet latency and it requires buffering for entire packet in each note which might be expensive so if your packets are large you need to have enough buffers per packet per note such that you can store the whole packet uh completely so here is this example so you send the packet completely to this intermediate note and then you can forward it store and then forward so then the question is can we do better and of course we can do better we can try to pipeline it and this is a cut through flow control another form of packet based flow control but we can start forwarding as soon as header is received and resources like buffer and ch are allocated so you send the header so this can pipeline the transmission so you send the header uh then when when header arrives to the intermidate node you need to calculate routing and also you need to allocate uh channels and buffers but once these things are done you can actually send a header and hopefully you can try to have a nice pipeline here right so this can heavily reduce the latency and actually people try to devise a lot of optimization for uh this cut through um with speculatively um using all these speculative and look ahead techniques for example they try to do the routing look ahead so once we are in this when header is sending to this node we also do the rting for the next hop L Ahad such that we can you know we are one step uh forward and also when you want to take the output Channel you can also do it speculative so you can speculatively assume that okay you're going to get the allocation slot and then that can provide U much better best latency but at some point yes I mean your speculation does not work and you need to wait essentially okay so the issue with cut through flow control is that uh when the output Port is blocked we we need to wait and basically the whole tail needs to be buffered so still uh cut through flow control needs to have buffers um like enough number of buffer slot in order to buffer the whole packet so this requires a large buffer enough to hold the largest packet and this might be uh an issue specifically in the past that uh designing buffers was not that I mean as easy as today so people uh come up with this new another flow control warm R which was actually quite um important and useful in the past um now actually we are combination of them like if whatever we have Wormhole switching and virtual culture are actually more or less the same now because we usually have enough number of buffers of course if you want to design buffer list that's another story but once you have buffers people can add enough number of buffers so in warhole um the difference is that your flow control U is based on fleets so fleets are sent across the fabric in a wormhole fashion like the Wormhole and body follows head and Tails follows body and this is in Pipeline and if head blocked rest of packet stops and the whole packet is actually stopped in the in the whole network essentially in all these routers that you have in the line so that's why actually in warm Hol switching once uh if you don't have if you have only one buffer then actually it's not that good because once you have block basically the whole packet is actually uh is occupying many many routers and that's not good so people try to it's good that with Wormhole we can get rid of that um criteria that you have you need buffers that can accommodate the maximum packet size because sometimes there might be some packets that they are very large so people say that okay we can try to have work hole switching but at the same time we can have also a lot of buffers so our buffers for example are good enough or large enough in order to accommodate average size packets but once we have sometimes quite large packets then okay um we don't need to have physical buffers for them and then we can use that so you can think of so worm is very good that you know break that restriction essentially so now the question is that how does body and tail know where to go so they essentially they follow the head so we need to State the need we need to store this state in the router and latency almost independent of distance for long messages why because you you make a pipeline so you need to spend some time for the header Fleet to reach the the destination basically header phly needs to go all these hops one by one to reach to the destination but once the header arrives hopefully every cycle you will get another Fleet so then that's why actually um especially for long packets for large packets the number of hop is not really an important metric for the latency of that packet because it's only defines the latency of the header Fleet other fleets are P line and they will arrive to the destination every second is that clear good so of course the worold has advantages over a store and forward FL control so it has lower latency and more efficient buffer utilization but it has limitation It suffers from head of line blocking if head Fleet cannot move due to the contention and other VM cannot proceed even even though links may be idle so here is an example um so you have uh these input cues uh let me remember oh yeah so yeah these these ones they want to go to Output one and these two wants to go to Output two I guess I'm not mistaken exactly yeah so uh basically these ones uh they are um they are B neck because the one of them will get the okay let me rephrase it so you have two input cues uh this one in the in the this Cube wants to go to Output one this one also wants to go to Output one right so this two needs to be arbitrated and this one gets the permission so this can go to the output one but this one cannot so this one here needs to wait right so the next fit that we have here is actually wants to go to Output two but this cannot proceed because of head off line block because we have a Q here so since one is blocked this two is also blocked so that's actually the uh the issue of head offline blocking and how people handled that they come up with virtual channels it's actually also you can see head offline blocking here with this animation you can check it but then uh how people design it is actually using virtual channels so virtual Channel flow control the idea is that we Multiplex multiple channels over one physical Channel and divide up to the divide divide up the input buffer into multiple buffers sharing a single physical Channel this is virtual Channel flow control or virtual cut through as people also say um so these are your buffers so you made each buffer to some parallel buffers so you call them V channel so previously you have one uh fif buffers for the channel and then you can actually break that fif Warfare to many uh in parallel virtual channel of course now you need to uh decide between all these virtual Channel and that make your router more complex but of course it can uh eliminate head offline blocking to some certain thing we agree because you can allocate uh virtual channels to different flows or to different vs essentially so here is an example I don't want to go through all this yeah essentially okay so this blue flow wants to go down but cannot because of this uh blocking but now red flow actually since we put it on a different Channel different virtual channel uh red can actually proceed so that's the efficiency of this virtual Channel and this is a modern rual Channel based router that you every input channel is Dem Multiplex across different rual channels and then you have a multiplexes that you select one of these virtual channel to go through the crossbar essentially this a virtual Channel based router in 2D mesh essentially but V CH actually is good for head offline blocking like to avoid head offline blocking but they also have other uses one of them is deadlock avoidance which I already kind of actually mentioned so you can enforce switching to a different set of visual channel on some terms can break the cyclic dependency of uh resources like when you want to go to the r around in a in tros or ring you can take another recharge or you can have different sets of Channel saying that okay uh at the next when I'm taking the first hop I have all the V channels available when I get the when I then after one half I have only uh One V channel is not available for me after second half another rear channel is not available and then at some so you're reducing number of rear channel as you go to the network and that actually can remember you um something similar to um if you want kind of analogy you don't make a circle you you make such things you know so it's not until you get to the destination it's very similar to that actually okay uh you can also allocate some virtual Channel as GBS uh which is actually for propos in um Doo algorithm if I'm not mistaken like in um you you let all packets to take every uh basically passes like fully adaptive routing but once you detect the deadlock you have some escaped viral Channel and those viral channels are they are using determinacy grouting so you just put your packet to those Escape VCS and say that okay um now this packet uh needs to go to skape VC and then we be roted to the destination with the determin out so that's actually Duo's D is actually one of also the founder of this uh freed I would say so yeah that's actually one of the uh yeah the algorithm that Doo proposed for uh providing this adaptive oing with not a lot of cost another very important topic is protocol level deadlock so we discuss about cash quance protocol so um your network might not have a deadlock but once you implement cash cerence protocol on the netork you might have deadlock at the protocol level I don't want to go into detail because that needs a lot of times to explain but people actually come up use different rear channels as I said for the for these different messages that's actually why we have different in Tyler we have different networks for reply and messages so you can also think of different virtual channels uh when you want to send some of these acknowledgement and some of these coherence uh request you can use different virtual channels different set of virtual channel uh people for example come up with Solutions of using three virtual channels in order to make some of these uh um basically cerence protocol work essentially okay um and also you can make some prioritization of traffic class classes so once you want to Pro you can actually have some virtual channels that they have higher priority and if if your flow is more important you can you can make sure that that flow only uses those V channels those re CHS are prioritized meaning that in your arbitration policy in your allocation policy for example you can prioritize those re channels meaning that basically you give them a slots more than okay so this a review of flow control store and forward and then we go to this cut through and warm hoold and we also discuss about this head offline blocking that we can avoid with viral channels okay another issue here is that we need to communicate buffer availability uh we can have different technique here uh one of them is credit based flow control that up stream knows how many buffers are Downstream and downstream uh passes back C this to Upstream essentially and and there should be significant Upstream signaling like special uh for a small fle essentially so you need to communicate between upstream and downstream such that they know how many buffers uh are available another flow control can be on and off uh that down stream has on off signal to up stream just say that back off don't send anymore and it can be also act and acknowledge for control so Upstream optimistically sends Downstream data and then buffer cannot be deallocated until I and KN received right so down stream needs to send acknowledgement or acknowledgement to the uh um to the opper stream router okay so here is an example um but we have two nodes here and this is the timeline so node two uh there are some fleets in node two and these fleets departs from that rotor and then note two communicate so now after some time note two realizes okay now I have more buffers then node two will communicate that new credit to node one so this will also take sometime which actually called as credit delay and then node one needs to process that credit and said okay now I have this uh node one node two has enough credit so I can send my fleet to node two and after node one send this Fleet now node one also realized that okay now I have more space so I can send credit to to the previous router of node one so and essentially this can happen and this call has created Run Run trip today essentially so run Tri credit delay is time between when buffer empties and when next Fleet can be processed uh from that buffer entry significant througho degradation if there are few buffers and important to size buffers to tolerate credit turnaround that's why I said that even with warmhouse switching people try to have enough number of buffers such that uh these credit based flow control essentially flow control can work easily with better performance another way could be on and off uh flow control so down stream has onoff signal to Upstream so note one is sending fleets and then at some point not we reaches to the basically off threshold so node two sends off signal to node one and then node one uh stop it but basically note two should send it proactively because it will take some time right to arve so note to send this off signal like predicts that soon I don't I shouldn't be able to or I don't have enough capacity to store fleets so not to proactively sends off signal but in the meantime node one you know send packet at some point node one observes off and then a stop uh and then uh basically at some point note two also Travers some other fleets after some time then we get to the another threshold which is f on and then send the on Signal to node one and node one is start process it and then keep sending to essentially is that clear okay good I think we have enough time to cover the last part hopefully and then uh we will conclude today so now let's uh quickly also discuss interconnection Network performance this is actually a very interesting curve that we have uh in any Laten load uh you know system essentially you have a in your network you have a injection rate into the network or amount of load on the network like how many packets you are injecting to the network and that's the latency which is like the average Network latency or average packet latency that we have in the network right so the thing is that there are many parameters that Define this network um we have minimum latency given by topology so every topology has a minimum latency as we discussed like the diameter like the average distance or whatever so all these are dependent on the topology and then we have a minimum latency given by by routing algorithm so your routing algorithm also can add some inefficiencies on that um topology and then all these topology routing and plus flow control U basically all they defined a latency which we call it zero load latency meaning that if you have no conflicts and or no working this going to be your minimum latency in the network right but then at the other side also you have a trut which is given by topology so each topology has a maximum throughput can theoretically sustain but then um your routing algorithm might not be that intelligent enough so it can actually reduce the bandwidth of that like the true put of that Network and then essentially all these flow control also can uh reduce your throughput so that's also the maximum throughput that you can get so we call it saturation throughput that injection rate at which latency asips right and uh this curve is actually quite exciting that we always seen when we want to analyze the performance of network and or interconnection network so the IDE latency you can actually check this um formula but um It's actually kind of makes sense so it's solely due to the wire delay between source and this destion that's your idea latency that D is your Manhattan distance which is distance between two points measure along axis at right angles um is actually coming from like the Manhattan map like the GD that you have and the V is a propagation velocity like the how fast you're sending information L is your packet size and D is your channel bage so this might be is going to be your ideal latency uh you can check but then actually latency is dedicated uh so because we have wiring like we have impractical uh in wiring so long wir segmented with insertion of routers so in our router we have rout we have so we need to go through all routers so then Manhattan um in addition to this um ID latency you also need to consider the number of hops for the for and the latency of each router T router and TC is latency due to contention so you might to also have some comp um but you can see that actually we pay for hops times T alter only once because we are using kind of Wormhole or virtual cut through that we pipeline things this is your actual latency but things that if you do some um experiment you're going to see that this actual latency is actually this but uh or it might or sorry this is your ideal latency but this is your actual latency because of the routers that you have in zero load but then as you increase the load uh your latency is getting higher and higher and at some point you get to saturation point and this x-axis here is injected load people use different metrics one of them is like number of packets um per note per per cycle like how many packets you're injecting uh like the to the to the network here also another metric which is injection low fraction of capacity if we know that the the capacity of our network is something and we call it one this shows that how many um basically packets or what fraction of that total capacity are using so now you can see that actually like we are saturated around 70% of the capacity which is actually quite good um sometimes there are some networks for some specific route traffic patterns that we get saturated very early like 30% of the capacity or even earlier so here are some example of load latency curve U from with different U traffic patterns these are some synthetic traffic patterns that people come up uh come up with and they use these different synthetic uh patterns to somehow model the uh actual traffic patterns that we have in real life but some of them are also quite theoretical so they are not might not be that meaningful in real examples and you can see that depending on the traffic uh traffic pattern depending on the load and topology here we consider only I guess one topology no actually we consider different topologies like 2506 uh node mesh C mesh which is concentrated mesh that is the mesh but like four routers is Connect Four nodes are connected to one router essentially so four processor is connected to one router if C mesh is C is four for example in this example and then these concentrated node are connected in a mesh Network so we call it cshh it's cshh with C4 here it can be C8 c16 C2 or whatever so you can see that with different topologies and different traic patterns essentially we have different uh saturation points you can check this stuff from this paper um that show a lot of this topology and also these routing algorithms and show all these load latency curve from this paper actually and once you uh have a new design of network uh which is this Mex which actually quite uh effective topology I would say because it's it's very similar to S CSH with this difference that you add some Express Channel essentially right so you have some channels that can reduce the diameter of the network and the difference here is that these uh Express uh kind of channels are not that expensive as this flattened butterfly because they are completely they are actual channels different channels but here you're reusing these channels essentially you can see that these channels is segmented um across these lines you can of course check the paper for more detail but once you have a better design of the network and um um routing algorithm and also flow control all together you can hopefully get to design of scalable knock that can actually handle um you know basically a system that we have many uh course and that can provide a scalable and Service uh scalable solution and hopefully and also guarantee Services okay so these are some good read readings and uh these are also some performance metrics for network uh packet latency people usually use which is you can report average maximum also you can report t latency for packet you also have round trip latency which is how long does it take um for the whole um message to receive like sometime you have this request and reply like this round three latency and then some you you can also report saturation through put which of course is also important especially when you want to see how your network uh behaves in high contention loads but at the same time there are also some performance metrics that they are at the application Lev level like execution time that I I already said that why is important to report execution time or for example IPC system performance sometimes also job throughput in the end which are affected by interference among trades and application any question okay so I think that includes uh today's lecture we are exactly at four so if there is no burning I think have a nice weekend and see you all next week

Transcript for:[Lecture 22] Understanding Interconnects in Computer Architecture

Transcript for:
[Lecture 22] Understanding Interconnects in Computer Architecture