LLM Inference Efficiency Techniques

uh so I'll be taking off for the next three papers so essentially what we're discussing today has been um uh how do we speed up inference and techniques surrounding it uh nothing really uh loged to the particular model and that would be a common Trend even uh for the papers which IID be going through um so as a starter um at least for the three papers which I'm covering um and in in the last weeks at least what we have looked at in improving the llm efficiency um we have been mostly looking at how do we speed up uh the the training aspects and uh or uh not really how do we deploy these models in scale in the last lecture we had some aspect of uh what is a what is a good architecture to be able to deploy these models on scale uh but that that would come under most of the first three categories for example and what we'll be looking at today is the the last four so we'll be covering speculative uh so we are primarily looking at reducing latency and improving throughput and that's the common theme across all the papers uh some of them Target one of those some of them try to improve both of them and uh we'll be covering speculative decoding blms um as uh like the solution for doing this uh so yeah for for the first paper uh we want to efficiently handle llm requests that are coming in so llm applications have been something which are on the rise in the recent days and uh let's say you have a copilot inance running and uh every single time you make an iteration on the code uh there's a API called going to the uh co-pilot server and it's responding to you so uh when there are millions of requests coming in how do we efficiently handle them and make use of the infrastructure which we have at hand so um yeah so uh let's look at regular inference procedures we already had a look at kqv attention uh but let's just look at what actually happens during the inference stage so let's say we have uh an input token coming in we convert it into an embedding let's say x and then we have uh we multiply it with um weight matrices which we have learned so we have uh WK we have WQ we have WV so U the key uh query and value weights and then um we compute um k for k v and then uh let that just be small q and then we have K and then V being computed based on these weights so that's a matrix and uh let's let's say this is a vector uh and this is the embedding of the the input token coming in so what what we are primarily interested is in K and V at least for the first paper when we are Computing um uh attention through um the dot product of U let's say q k transpose divided by root dimension of this and then we take the um yeah softmax uh okay soft Max and then we multiply with the value Matrix so um if we are doing this uh computation uh the the knive way to do it we would end up recomputing each of the elements in the uh qk transpose Matrix uh So to avoid having to recompute uh every element as we have new query tokens coming in for a sequence um we Implement something called the KV cache how does the KV cache work so um let's say I have a sentence saying I am Noble and uh we need to do this for each of the input tokens coming in and as we do that we keep building a KV cache for each token so um uh let's say each of these is a uh token which is coming in we uh create the the K Matrix and the V Matrix in uh we store this in memory and each of these um tokens so um here it will correspond to a particular column in the K Matrix and here it will be a particular uh Row in the v Matrix and as we do this over time these matrices are growing in size so um we we we can kind of say that um instead of recomputing this entire Matrix uh or uh redoing that computation what we can do is that for all the previous tokens we can have k cache which is uh like all the columns which I've computed so far I just save it into my cache um then when when new tokens come in I just add it to my cache and this is the only new computation which I do and the same thing for the value Matrix everything which has come so far uh I save it into my vcash and then uh the new computations go into a new row in the vcash and so uh as you can see as we have a large sequence length our K and V matrices will keep growing in size now um yeah so we need to uh in in the KN implementation we need continuous chunks of memory for this because of the attention operations which will be happening um so the first paper looks at how we can break this down and uh manage memory better so what exactly does the KV cache contain as a high level overview um so let's say this is my prompt so um we have that and then uh this is a current snippet like the model is generating uh the response and we currently stopped at I am so at this particular snapshot um we have uh four tokens but uh let's say this is a model with a 2048 context length so uh and you can also like if you have uh worked with the open API or any inference calling you can specify a Max tokens uh for like how much you want to generate so there is an issue of variable sequence length uh which we should keep in our minds but let's assume that we have a 248 uh sequence length so essentially I'll be reserving memory for those 2048 uh tokens and uh that's shown by The the Reserve um uh token and uh like the the model can also choose to terminate early so like if you gave me a request and I'm done with my response I can pass the US token and then I'm done with uh what I needed to do so this is just a snipp it of what is happening into the future and this particular response ends at that us to um and then towards the end that green is another uh prompt another prompt which has come in and the model is uh serving that so this is just a snippet off memory and what the authors are like the problem which the authors are trying to solve so um as you can see we have two issues which they Al called internal fragmentation and external fragmentation so because of the reserve tokens um we have a big chunk of memory being allocated as the request came in and because we do not know the output lengths of to what extent this model is going to consume memory we also have some uh padding between the the next token being saved into memory so we have external fragmentation happening as well because of the differ sequence lengths and internal fragmentation is because you don't know the output uh length of where this will terminate so what the users find what authors find is that uh based on this knif allocation of memory about the 60% Rems unus so uh so that's the problem which they are trying to solve and uh codewise if you want to turn off cach and turn on Cache you can just use the use cache in uh tensorflow and uh if you try to do that on collab with like a T4 uh it it will have a performance difference of like 5x so we do need KV cache but we need to do it efficiently and this n doesn't work um mean just reserving the entire memory block for the entire sequence length for this so you can preallocate yeah we cannot be pre-allocating this memory to do it efficiently there should be some better way to handle this and that's what the authors propos in this paper uh again just to uh look at the scale of things uh before the uh presentation I just had a look at how much code Lama consumes so essentially we have um uh let's say uh so we have two matrices so um A and B and then we have um for since for inference we usually do fp16 so we can say that's two bytes and then um we need a number of layers so Cod Lama if you look at the the biggest um 60 billion parameter model or if you look at the 60 billion parameter model for COD Lama uh so that would be approximately 120 GB just for the model RS but uh what I'm trying to look at is how much the KV cach would be so the number of layers for the 60 billion parameter model is 80 um and uh the embedding dimension of the 60 billion parameter model is 8,192 um unlike some of the smaller models which have 510 and stuff like that and uh yeah so essentially multiplying all of this so we have 2 into two bytes into 80 layers into 8,192 and um again it we have bad size coming into the equation but uh for inference bad size essentially deals with um how many requests are you processing rather than the training B size so U even if we keep this as one if we are only serving one at a time if you compute this it comes out to 10 GB so uh just so the KB cache assume full utilization of the context length would be around 10 GB um so yeah so this is like a current model which we have deployed right code Lama 2 is one of the best uh I wouldn't say best but for some tasks a good open source model is that GB per batch or 10 GB yeah so it is per request in a sense is that reused just keep adding it uh so once you're done you can obviously free it but uh let let's say like maximum yeah that is the maximum you can consume for a request2 yeah and uh this becomes even bigger of an issue if you consider the new Gemini models which came out because they boast like 10 million context link um so yeah you can see like we cannot allocate that much memory continuously and uh we should only be allocating what is required at this point of time so that we can process multiple requests as they are happening uh if you see this memory obviously it'll crash and you cannot do anything about it so um as as far as what the author's concern uh they look at uh 13 billion parameter model and it consumes around 30% of the available memory for the KV cache um I was just trying to make the point that it's much bigger of an issue right now if you do it niely um yeah so we cannot have continuous uh spaces of memory being allocated for such huge sizes and U memory sharing is something which cannot be done uh in the KN implementation why do we need to do that we'll come to it uh in a later slide yeah so um this sounds like uh something we study in OS about memory fragmentation and sharing like that's essentially what the authors do as well so uh you you can make a logical simile so you're considering incoming requests as processes uh you're considering the like we Define something as logical KV blocks which is essentially same as virtual memory and you also Define physical KV blocks which is is the representation in actual physical memory and that physical memory and similar to a page table you also Define a block table so uh what do the authors propose instead of allocating the entire uh contigous block which we just computed the size of we uh instead break it down into smaller blocks of let's say size four in this particular example but um it's better to use 16 or 32 based on experiments which we'll look at later and um if we have um blocks of size 4 and we kind of abstracting out some implementation level details of like we are storing K and V for each of these KV blocks but um so essentially you have in The Logical uh table everything's contiguous but in the physical table you're essentially mapping it using the block table to non-contiguous chunks within the physical table and um yeah so uh as the the model is generating a new token it it it has this block reserved for it so it will add it to the block because it has available memory but now I want to add a new token I do not have space anymore in in the two blocks which I have reserved so now I need to request for another block and that block is allocated to it and to complete this particular request I used two more uh blocks in that four block group so um so the the block table is just enabling you to make that mapping and keep track of uh as you need to expand so obviously if you scale this to multiple requests you can do the same thing uh you will have individual uh block tables for each of the requests but it can essentially do the same thing but something to note here is this is a particular kind of example which I pick you have the same prompt but there are two different responses happening so the which is a kind of decoding which you can do of parallel decoding or um and if you want to do uh like if if uh for example if you want uh open AI to give you two choices of responses like if you have uh just interface with a CH GPT API using gbd4 or even GPT 3.5 it will sometimes show you like hey these are two responses which I have which one do you think is better and then they use it to improve their model but it is something which is actively used and if you want to do that it this doesn't seem like a good way to go about it because you're again wasting memory for the common bits of uh these requests so um and uh yeah before I jump into that uh we do have internal fragmentation but it is limited like because it is limited by the block size because you reserved that much and then uh if any internal fragmentation does happen it's because you're not consuming that much of the block size and there is no external fragmentation by this logic because you're requesting more memory assignment required so there is no concept of external fragmentation now um how do we share KV blocks so let's say we are doing parall decoding and we want two responses uh for this particular PR what uh what the authors propose is we we have something like reference counts um so for both of these sequences we have uh blocks with a reference count of two in two requests are depending on this right now and let's say at this point I choose to diverge I choose to have two different tracks of responses um what I do is reduce my reference count to one and then there is a operation which is allowed called copy on right which essentially copies the the block which you're working with and then only that uh is repeated and then the the today we are learning part still has a reference count of two and then uh you can continue on this in the normal uh uh flow another case which the authors try to demonstrate of um like they can support is beam searching so um in and it is supported through for append and free operations which is again um part of their implementation and um what exactly is beam searching so if you're doing something like machine translation uh there are often uh like as you progress through the sequence and you are uh creating the like let's say I'm Computing English to German and uh as you proceed through the sequence the model may decide that uh oh this particular interpretation which I thought of it is not really the best way I should probably have thought of it in a different manner so beam search allows you to maintain uh particular beams uh of thought and the issue with this as compared to parallel is that beams can be deleted and you can have new beams just appear so uh in like beam one got deleted and now the the new beam 1 is uh the earlier beam zero and a branch of it so um this is again supported by this model of uh memory sharing and uh yeah another thing which is required for on the the reason that there was a reason why we had continuous memory and that was to uh like efficiently support these uh attention computations right so the authors also also introduce paged attention which is uh they just reimplement the kernels to handle the different blocks uh in this case so um essentially uh it's just a rewrite of the original equation you compute it per block uh and even the B values are for that particular block and then you assimilate it back into the attention score so uh you can see the attention computations happening for each block uh so I move through it and uh yeah the end you can just iterate over all the blocks and compute the attention score for it so what were the results the authors find the authors uh um reimplement Orca so Orca is not publicly available so they reimplement Orca and then compare their model serving technique with Orca and uh they also have uh 10 uh fast tensors which they compare against I believe there in the next graph but as you can see uh the VM implementation which is what's discussed so far uh supports a much higher uh uh bat size of like they're supporting like 30 requests uh in parallel right in a sense so um yeah yeah so so right yeah how do they how do they like Implement is it like on python level or is it on no it has to be ca level has has to be like because you're you're messing with hbm right now right so okay so there is like a ways of like for you to defining like the the block mapping stuff like in the in lab right yeah oh okay know they also try to use something to T for table table like is for that or uh I don't think about to discuss anything that yeah so uh I'll be moving on to the next uh paper from there uh before that uh we already oh sure S I had a really question yeah um so this mechanism is built on top of the attention mechanism the idea is that we can append on subsequent tokens and keep the computation from earlier yeah um as you stack Transformer layers like these attention blocks right the input to the next layer is is the output from the previous layer and we sum through Q's so that means that the X from previous layers is not necessarily the same X as you get deeper into the model even as youate continuously so what happens to Performance if we enforce a prior like that block is going to be the same uh so the the KV cache per se is not what we are modifying like the KV cache already existed long before we we just changing uh like we essentially just breaking down KV cach from like continuous blocks of memory to like uh smaller pieces right yeah so all this are based on the fact that it's not buy directional it's it's cost yeah so so you have to because this generation right right and also like these uh base models are not decoders they they decod they decod only yeah yeah that's why you can do this right uh so moving on to the next one um so we already looked at flash attention and uh the only reason I'm bringing it back up is uh because the next paper derives from some of the results in Flash attention and like tries to bring what flash attention did for the training aspects into the inference aspects so um the the reason why flash attention was able to improve the output was uh one because U they were able to break down the soft Max computation into visual chunks and um the the key inside which they brought was that we are not really computation bound we are memory bound of like um and uh the the memory operations are what's slowing us down here and the the graph again from the flash attention paper uh they showed that instead of uh writing from the hbm into the SRAM and back again and again if you uh create a fuse kernel which does all of these operations together together we can significantly speed up on attention computations um so uh and also that the the matrix multiplication wasn't what was consuming most of the time if you look at the the breakdown of the times involed um so what we are looking at is Flash decoding and um so uh the the animation is technically just showing flash attention and how it works um but the the the concern which the uh authors are trying to address here is that um if you're just doing plain inference and you don't have uh a high bat size much of your model much of your GPU can remain unused and the the way uh flash attention was implemented uh you break down uh the K and B but you don't really uh try to break it across uh batches so um the the modification which the authors propose is just uh split it into small small chunks and recompute it so uh in in Flash attention uh you had this uh scalar value which we were Computing along with the usual computation so that like the alpha and beta in a sense uh so that you can combine these back uh in the final step right so U the the same thing happens here as well across splits so you need to keep track of these values for each of the splits so that you can combine it at the last stage um yeah so on a high level uh that is the only change within flash decoding Implement splits so that uh you're processing everything in smaller chunks and uh you need to keep track of this extra scaler uh which is the log sum exponential and uh you can compute it back into the final value which you should have obtained and uh by doing this uh they are the the reason for doing this is increasing context lengths so you can have a single request coming in which is obviously like the Gemini model which I stated with like 10 million context length so if you have such huge U prompt lengths coming in uh flash attention uh by itself does not do well because of the reason stated that um you're not efficiently utilizing the GPU as you could um and you can see that from the graphs so um they try knif pyo primitive so this this paper is coming from pyo itself so they try pyo's own Primitives that actually works better than flash attention in some of the cases and uh with flash decoding they are able to preserve the performance of the model even on significantly higher context L um by just breaking down this computation and a significant amount of the speed up is just because of the way flash attention sped up with the fuse kernels and stuff like that so they are utilizing like the 50x speed up which just uh comes from using flash attention as opposed to the KN implementation off attention uh the computations which we're reading back and forth into memory um yeah and it scales better for larger context lengths which is the the main point here yeah so so what do you mean you say AC so um in uh in in Flash attention you can see like uh this is happening as a sequential operation right for that entire canv block and uh in in um which is fine for training but when you're doing inference uh and you don't have um uh like you don't have enough bat size to populate your entire GPU so what what your GPU would essentially end up doing is doing this sequentially even though you are essentially breaking down uh it into smaller operations um right so um I don't know if that so saying basically for the loop that goes through it yeah so it breaks that down into like multiple for Loops going through it right so even if you have a single query coming in you can handle it uh better like you don't have like 10 queries um like while training you can easily create like 128 B size but if you're if if you are pro like there is an aspect of latency involved here right like you don't want to be on batching up like 20 queries that are coming in to respond to the first guy yeah I guess then in this case uh you would have to do something you have to change the algorithm a bit actually right because uh yeah because before you can like when you do your computation you already have your first thing your your first you put you your computation the first put it in the The Matrix but now you're kind of doing everything and like aggregating them together yeah so some reduction step here or yeah there is a final reduction step I see I see and that scalar which we are tracking for like the weights of each of these individual components allows us to do that reduction St I see yeah so as far as I so this is basically roughly the same as flashion just like a way of yeah this is not a full paper it's just a block post yeah yeah yeah okay okay so so this is for processing like a very long yeah so the the intention is to perform better on like long Contex length and long contact L small bat sizes for inference okay but does it do better if you're going to generate like long sequen this algorithm is just specialized for like Pro processing like long input oh no it should work for both should work for both for the output yeah but like for the output you have generated this LA right how do we it for the discussion they only keep it to prompt lens oh they only like long problems yeah not with long not with the responses okay yeah I just assume the same thing could be done for resp but yeah since you're building it up incrementally oh right right same question so yeah I was just wondering like if this is only app to like long prop or long response it seems like the same uh so uh the the last paper which I'll be diving into is a continuation of what R was looking at with respect to uh speculative decoding so um essentially what you're trying to do with speculative decoding was um instead of having a decoding step which which only gives you out one token you want to have scenarios where some of the decoding steps give you a few more tokens than just a single token so in effect you can speed up your insurance so um if if you just uh like look back at just uh like build it up to this point um the the Llama tokenizer has 32,000 tokens if you randomly guess uh what your next token should be that's like a 1 in 32,000 chance that you get right but you can do something like u in in the Lang in language you don't have all tokens are not equally likely and you can use that distribution to pick something um or another thing is which there are other papers that look at it is use the prompt to speed up your inference so essentially uh the the idea is that there will be uh tokens from The Prompt which repeat within your response so you can just speed it up based on that um that is something which authors have looked at as um and using engrams as a look table which we will come to again even in this approach and um the speculative decoding much of the aspects have gone into using a smaller draft model or a helper model but uh it it it hasn't been widely adopted because of the issues which were again brought up and uh the complexity of having to manage two different models uh as well as uh training it so there was a paper which came out after speculative decoding called online speculative decoding which essentially just tries to improve your helper model to perform better um over time and uh there's also Medusa uh which try to look at using multiple attention heads uh for uh doing the process in a manner that's like easily adaptable and Meda is a more applied approach than uh speculative decoding with draft models so um with that background we can and look at look ahead decoding um uh this is just a a graphic which they give on the website of how fast uh an inference we are looking at as compared to the normal decoding process um and how do we take better guesses is the question which we're trying to answer right so um uh yeah so we trying to essentially improve the token acceptance rate so if you have uh your draft model spitting out five different Tok tokens and then all of them just essentially get discarded because the main token says this is not what I want so it's wasted compute so we want to be improving the acceptance rate um and with that one of the techniques is jaob iteration um so the jacobe method is something which we would have gone over in high school if you remember that from linear algebra but um so essentially what the authors propose is that the the core problem which you're trying to solve is predicting the M tokens in the response sequence so you have an X you have a y which is your response and you have M tokens there and you're trying to decode that uh regressively and um can we do that um faster can we like just compute all of the tokens in one go or like a specific set of them in one go um for that they rewrite the uh regressive decoding model so essentially um for the regressive step you're trying to look at what is the token which I should output given that I have this particular input right and as we uh go for like further steps as you're building out your response you do that for each particular token so they Define a function uh uh like which is shown here and um they converted into that form um uh so you essentially have a system of nonlinear equations which you're trying to solve um the jacobe method if you do remember linear algebra is applied to a system of linear equations um and that is actually an issue in the description of this paper because they link to the normal jacobe method like they link to the Wikipedia page of the normal jacobe method uh which is this like this is what you study in high school of how to use the jacobe method but it applies to only linear linear equations and uh what you essentially end up doing for a refresher is that you um once you rewrite the system of equations into each variable uh once you represent it in this form you take an initial guess of what each token is and then you iteratively like similar to like Newton r or any of those iterative methods you try to iteratively solve the system of equations and um once your method converges you have all the tokens which you need and uh um but uh this is for linear uh systems and the only reference which I could find to do this for nonlinear equations was this particular book and um yeah they essentially say that you can do the linear method for the nonlinear system and it does work the same way and one property of this computation is that with every step you solve one of the variables uh and with M steps you can solve the entire equation but that is essentially same as uh regressively doing it because if you had to take M steps um it it is same as just sequentially but in this case you're solving a system of equations to get to that that's the difference in jaob iteration and uh this is algorithm from that book about how do you do uh parallelize the nonlinear jaob iteration um yeah so just to do that visually uh this is what would happen if we do jaob iteration you have uh your prompt and then you take a guess of what the other tokens are this is like your initial guess in the Jacobi method and you you just U are randomly filling that out of um what you think the next token should be and then um the issue again here is that you're saying that the operations are uh like the the memory operations are what are taking uh time so you want to optimize the GPU as much as possible like so computations is not a concern in this particular case and we are particularly looking at models where you want to improve the latency and um there is an increased uh amount of computations which will happen as we go into the actual model but um that is not something you should look at here because the concern is latency in this particular case so um now your your mod you don't have a draft model again in this case it's just your main model uh predicting these so your main model says that um the the next token to alent uring uh is is like the token next token should be S and then you have paralyzed the operation by having a few random tokens which you guessed and then you're saying uh you're asking the model to parall compute this as well so uh the model says if this was who then the next open should be S if it's s it should be a and uh so on for many uh tokens you guess so this is a parameter which we'll set later of like how many do we want to guess forward but uh the model computes what the next should be and um uh if you let it continue um we use the the the new uh generated tokens as the uh like we update it with the the random selection and um now we have uh now we have one step forward like in the previous step we only got one accepted token and the rest of it was wasted computation so there was no merit to doing that but U uh except we updated our randomly selected initial guess and uh similar to eation we Sol one variable but the remaining variables are still kind of random and we still haven't converged yet and uh now the model has to look at all of these variables and try to compute what the uh next token would be if this was the particular variable uh in in this SP so uh again we do the same step we continue and then uh in cases where um yeah I believe this step fails as well so in in in this particular case uh you're essentially just rewriting and every step you're only getting one new token accepted but when you have a scenario where uh the next token um which the the main model predicts is same as uh that which you're getting via speculative decoding you can have uh multiple tokens be accepted which was again the objective which we set out with of like having a few steps where more than one token gets accepted and the rest of it is just wasted competition so we have we reiterated there but on the last step we accepted two tokens uh so this is uh naively doing jaob iteration um what the authors propose is a bit more complicated so instead of all of that wasted computation what if you saved the trajectory which the model was trying to take uh as in you save the n gs if you're Computing using the main model what the next token should be uh you just save like in this case it's 2 gr so it's very similar to eation and um we will be dealing with higher ends but uh just for visualization you have a two gam being created and what you can do is um before discarding everything you can kind of have a look at the two gs which you have stored and then use that as your bet of what the next token is and uh use that to continue forward in the iteration what happens when there are conflicting ones uh is there on the next side but uh at least for the visualization you have a case where uh you only have one case which matches so basically this is like a cash yeah you have a cash implemented you roughly know like what the Target Model yeah yeah because we don't have a draft model and it's all the target models distribution happening right there is some uh failure cases where this two can happen two places yeah so uh when it happens in two places you need to do a verification step and uh the actual implementation is way more complex than this because this is a 2 G which makes it very similar to speculative decoding but uh the actual case is an angram um so in this particular case you you have the N being set to four so you end up with a 4 G um with like yeah so the numbers are meaningless please look at that the numbers essentially just say what is the position of this particular token with respect to the initial token in the prediction process and um so a valid for uh valid engram in this particular case would be uh let's say you have 1 2 three and then whatever is happening at this step so we are on the last step of the jaob iteration where like the verification step will happen so that's why the fourth token is not really there in the diagram but U you essentially have an angram being created and um you also have to predict the next token based on what has happened uh and there are two parameters in this entire process there's window size of um how far we are looking ahead so that determines how many new tokens we are generating on each step so in this case you have five so that's why the window size is five and the engram size is controlling how much of the history you retaining so you're constructing a 4 G so that's why the NR size is four there and um now if you have a uh if you have scenarios where there are multiple uh correct engrams to choose from then you need to do verification so in this particular case there is an additional parameter called G which is how many you choose to verify uh in in in all the experiments the authors choose to set it to W the the window size the same as the number of uh uh things you choose to verify and uh what this shows is the attention Mass the only thing it is essentially saying is that uh same as costal attention you only looking at the tokens before it so like the the token one can only see itself and zero uh and the same thing is done for um all of the steps so um and the the verification Branch cannot look at anything in the look ahead branch which is why it's all masked out and then um you have it again only attending to its previous tokens and the starting token because you're verifying you you only have zero at this step and everything else is happening through U like iterative decomposing right so so this is for the verification branch and uh this is the the look ahead uh stuff which is happening and they just combine it together into one parel step uh to do it efficiently um yeah so this is just a comparison of what the attention mask looks like when you're comparing with Medusa so uh in Medusa you have multiple heads and like if the first head says that the next toen can be it or high of like a case where you're taking only the two ones the next head will again continue from it and say uh like three tokens and uh for the I it will again generate three tokens and that's what the attention mask looks like there just uh to contrast with what this attention mask looks like um yeah I'm I'm mostly wrapping up so the the scaling law is something which they discuss based on the the results in this paper so um what they find is that if you have a sufficiently large n and you exponentially increase the window size you can get uh a linear Trend in the step compression so you you can linearly reduce the number of decoding steps which is required um so that's a result which they find with their experimentation and what impact does this have you uh so they they applied it on Lama chat uh model and uh they observed a speed up of 1.5x and um latency reductions on human eval as well as the the math problems dat set yeah that's GSM 8K so yeah so it it does reduce latency which was the primary target of this entire thing so wasted computations again is not a concern here uh you're just trying to efficiently utilize most of the GP yeah so for my part uh we see that llm inference was bottlenecked by memory constraints and we saw how we can go about building a more efficient KV cache um we also uh the authors also had to implement like the kernel for blockwise attention or like paged attention as they call it um the the next paper looked at flash decoding which was uh trying to bring the enhancements of flash attention V1 and two into the inference step and uh the last paper which we looked at was look ahead decoding which is trying to improve speculative decoding but uh take a different approach rather than using draft models and um yeah for for Laten SEC critical applications because there's a lot of wasted computations happening there

uh so I&#39;ll be taking off for the next three papers so essentially what we&#39;re discussing today has been um uh how do we speed up inference and techniques surrounding it uh nothing really uh loged to the particular model and that would be a common Trend even uh for the papers which IID be going through um so as a starter um at least for the three papers which I&#39;m covering um and in in the last weeks at least what we have looked at in improving the llm efficiency um we have been mostly looking at how do we speed up uh the the training aspects and uh or uh not really how do we deploy these models in scale in the last lecture we had some aspect of uh what is a what is a good architecture to be able to deploy these models on scale uh but that that would come under most of the first three categories for example and what we&#39;ll be looking at today is the the last four so we&#39;ll be covering speculative uh so we are primarily looking at reducing latency and improving throughput and that&#39;s the common theme across all the papers uh some of them Target one of those some of them try to improve both of them and uh we&#39;ll be covering speculative decoding blms um as uh like the solution for doing this uh so yeah for for the first paper uh we want to efficiently handle llm requests that are coming in so llm applications have been something which are on the rise in the recent days and uh let&#39;s say you have a copilot inance running and uh every single time you make an iteration on the code uh there&#39;s a API called going to the uh co-pilot server and it&#39;s responding to you so uh when there are millions of requests coming in how do we efficiently handle them and make use of the infrastructure which we have at hand so um yeah so uh let&#39;s look at regular inference procedures we already had a look at kqv attention uh but let&#39;s just look at what actually happens during the inference stage so let&#39;s say we have uh an input token coming in we convert it into an embedding let&#39;s say x and then we have uh we multiply it with um weight matrices which we have learned so we have uh WK we have WQ we have WV so U the key uh query and value weights and then um we compute um k for k v and then uh let that just be small q and then we have K and then V being computed based on these weights so that&#39;s a matrix and uh let&#39;s let&#39;s say this is a vector uh and this is the embedding of the the input token coming in so what what we are primarily interested is in K and V at least for the first paper when we are Computing um uh attention through um the dot product of U let&#39;s say q k transpose divided by root dimension of this and then we take the um yeah softmax uh okay soft Max and then we multiply with the value Matrix so um if we are doing this uh computation uh the the knive way to do it we would end up recomputing each of the elements in the uh qk transpose Matrix uh So to avoid having to recompute uh every element as we have new query tokens coming in for a sequence um we Implement something called the KV cache how does the KV cache work so um let&#39;s say I have a sentence saying I am Noble and uh we need to do this for each of the input tokens coming in and as we do that we keep building a KV cache for each token so um uh let&#39;s say each of these is a uh token which is coming in we uh create the the K Matrix and the V Matrix in uh we store this in memory and each of these um tokens so um here it will correspond to a particular column in the K Matrix and here it will be a particular uh Row in the v Matrix and as we do this over time these matrices are growing in size so um we we we can kind of say that um instead of recomputing this entire Matrix uh or uh redoing that computation what we can do is that for all the previous tokens we can have k cache which is uh like all the columns which I&#39;ve computed so far I just save it into my cache um then when when new tokens come in I just add it to my cache and this is the only new computation which I do and the same thing for the value Matrix everything which has come so far uh I save it into my vcash and then uh the new computations go into a new row in the vcash and so uh as you can see as we have a large sequence length our K and V matrices will keep growing in size now um yeah so we need to uh in in the KN implementation we need continuous chunks of memory for this because of the attention operations which will be happening um so the first paper looks at how we can break this down and uh manage memory better so what exactly does the KV cache contain as a high level overview um so let&#39;s say this is my prompt so um we have that and then uh this is a current snippet like the model is generating uh the response and we currently stopped at I am so at this particular snapshot um we have uh four tokens but uh let&#39;s say this is a model with a 2048 context length so uh and you can also like if you have uh worked with the open API or any inference calling you can specify a Max tokens uh for like how much you want to generate so there is an issue of variable sequence length uh which we should keep in our minds but let&#39;s assume that we have a 248 uh sequence length so essentially I&#39;ll be reserving memory for those 2048 uh tokens and uh that&#39;s shown by The the Reserve um uh token and uh like the the model can also choose to terminate early so like if you gave me a request and I&#39;m done with my response I can pass the US token and then I&#39;m done with uh what I needed to do so this is just a snipp it of what is happening into the future and this particular response ends at that us to um and then towards the end that green is another uh prompt another prompt which has come in and the model is uh serving that so this is just a snippet off memory and what the authors are like the problem which the authors are trying to solve so um as you can see we have two issues which they Al called internal fragmentation and external fragmentation so because of the reserve tokens um we have a big chunk of memory being allocated as the request came in and because we do not know the output lengths of to what extent this model is going to consume memory we also have some uh padding between the the next token being saved into memory so we have external fragmentation happening as well because of the differ sequence lengths and internal fragmentation is because you don&#39;t know the output uh length of where this will terminate so what the users find what authors find is that uh based on this knif allocation of memory about the 60% Rems unus so uh so that&#39;s the problem which they are trying to solve and uh codewise if you want to turn off cach and turn on Cache you can just use the use cache in uh tensorflow and uh if you try to do that on collab with like a T4 uh it it will have a performance difference of like 5x so we do need KV cache but we need to do it efficiently and this n doesn&#39;t work um mean just reserving the entire memory block for the entire sequence length for this so you can preallocate yeah we cannot be pre-allocating this memory to do it efficiently there should be some better way to handle this and that&#39;s what the authors propos in this paper uh again just to uh look at the scale of things uh before the uh presentation I just had a look at how much code Lama consumes so essentially we have um uh let&#39;s say uh so we have two matrices so um A and B and then we have um for since for inference we usually do fp16 so we can say that&#39;s two bytes and then um we need a number of layers so Cod Lama if you look at the the biggest um 60 billion parameter model or if you look at the 60 billion parameter model for COD Lama uh so that would be approximately 120 GB just for the model RS but uh what I&#39;m trying to look at is how much the KV cach would be so the number of layers for the 60 billion parameter model is 80 um and uh the embedding dimension of the 60 billion parameter model is 8,192 um unlike some of the smaller models which have 510 and stuff like that and uh yeah so essentially multiplying all of this so we have 2 into two bytes into 80 layers into 8,192 and um again it we have bad size coming into the equation but uh for inference bad size essentially deals with um how many requests are you processing rather than the training B size so U even if we keep this as one if we are only serving one at a time if you compute this it comes out to 10 GB so uh just so the KB cache assume full utilization of the context length would be around 10 GB um so yeah so this is like a current model which we have deployed right code Lama 2 is one of the best uh I wouldn&#39;t say best but for some tasks a good open source model is that GB per batch or 10 GB yeah so it is per request in a sense is that reused just keep adding it uh so once you&#39;re done you can obviously free it but uh let let&#39;s say like maximum yeah that is the maximum you can consume for a request2 yeah and uh this becomes even bigger of an issue if you consider the new Gemini models which came out because they boast like 10 million context link um so yeah you can see like we cannot allocate that much memory continuously and uh we should only be allocating what is required at this point of time so that we can process multiple requests as they are happening uh if you see this memory obviously it&#39;ll crash and you cannot do anything about it so um as as far as what the author&#39;s concern uh they look at uh 13 billion parameter model and it consumes around 30% of the available memory for the KV cache um I was just trying to make the point that it&#39;s much bigger of an issue right now if you do it niely um yeah so we cannot have continuous uh spaces of memory being allocated for such huge sizes and U memory sharing is something which cannot be done uh in the KN implementation why do we need to do that we&#39;ll come to it uh in a later slide yeah so um this sounds like uh something we study in OS about memory fragmentation and sharing like that&#39;s essentially what the authors do as well so uh you you can make a logical simile so you&#39;re considering incoming requests as processes uh you&#39;re considering the like we Define something as logical KV blocks which is essentially same as virtual memory and you also Define physical KV blocks which is is the representation in actual physical memory and that physical memory and similar to a page table you also Define a block table so uh what do the authors propose instead of allocating the entire uh contigous block which we just computed the size of we uh instead break it down into smaller blocks of let&#39;s say size four in this particular example but um it&#39;s better to use 16 or 32 based on experiments which we&#39;ll look at later and um if we have um blocks of size 4 and we kind of abstracting out some implementation level details of like we are storing K and V for each of these KV blocks but um so essentially you have in The Logical uh table everything&#39;s contiguous but in the physical table you&#39;re essentially mapping it using the block table to non-contiguous chunks within the physical table and um yeah so uh as the the model is generating a new token it it it has this block reserved for it so it will add it to the block because it has available memory but now I want to add a new token I do not have space anymore in in the two blocks which I have reserved so now I need to request for another block and that block is allocated to it and to complete this particular request I used two more uh blocks in that four block group so um so the the block table is just enabling you to make that mapping and keep track of uh as you need to expand so obviously if you scale this to multiple requests you can do the same thing uh you will have individual uh block tables for each of the requests but it can essentially do the same thing but something to note here is this is a particular kind of example which I pick you have the same prompt but there are two different responses happening so the which is a kind of decoding which you can do of parallel decoding or um and if you want to do uh like if if uh for example if you want uh open AI to give you two choices of responses like if you have uh just interface with a CH GPT API using gbd4 or even GPT 3.5 it will sometimes show you like hey these are two responses which I have which one do you think is better and then they use it to improve their model but it is something which is actively used and if you want to do that it this doesn&#39;t seem like a good way to go about it because you&#39;re again wasting memory for the common bits of uh these requests so um and uh yeah before I jump into that uh we do have internal fragmentation but it is limited like because it is limited by the block size because you reserved that much and then uh if any internal fragmentation does happen it&#39;s because you&#39;re not consuming that much of the block size and there is no external fragmentation by this logic because you&#39;re requesting more memory assignment required so there is no concept of external fragmentation now um how do we share KV blocks so let&#39;s say we are doing parall decoding and we want two responses uh for this particular PR what uh what the authors propose is we we have something like reference counts um so for both of these sequences we have uh blocks with a reference count of two in two requests are depending on this right now and let&#39;s say at this point I choose to diverge I choose to have two different tracks of responses um what I do is reduce my reference count to one and then there is a operation which is allowed called copy on right which essentially copies the the block which you&#39;re working with and then only that uh is repeated and then the the today we are learning part still has a reference count of two and then uh you can continue on this in the normal uh uh flow another case which the authors try to demonstrate of um like they can support is beam searching so um in and it is supported through for append and free operations which is again um part of their implementation and um what exactly is beam searching so if you&#39;re doing something like machine translation uh there are often uh like as you progress through the sequence and you are uh creating the like let&#39;s say I&#39;m Computing English to German and uh as you proceed through the sequence the model may decide that uh oh this particular interpretation which I thought of it is not really the best way I should probably have thought of it in a different manner so beam search allows you to maintain uh particular beams uh of thought and the issue with this as compared to parallel is that beams can be deleted and you can have new beams just appear so uh in like beam one got deleted and now the the new beam 1 is uh the earlier beam zero and a branch of it so um this is again supported by this model of uh memory sharing and uh yeah another thing which is required for on the the reason that there was a reason why we had continuous memory and that was to uh like efficiently support these uh attention computations right so the authors also also introduce paged attention which is uh they just reimplement the kernels to handle the different blocks uh in this case so um essentially uh it&#39;s just a rewrite of the original equation you compute it per block uh and even the B values are for that particular block and then you assimilate it back into the attention score so uh you can see the attention computations happening for each block uh so I move through it and uh yeah the end you can just iterate over all the blocks and compute the attention score for it so what were the results the authors find the authors uh um reimplement Orca so Orca is not publicly available so they reimplement Orca and then compare their model serving technique with Orca and uh they also have uh 10 uh fast tensors which they compare against I believe there in the next graph but as you can see uh the VM implementation which is what&#39;s discussed so far uh supports a much higher uh uh bat size of like they&#39;re supporting like 30 requests uh in parallel right in a sense so um yeah yeah so so right yeah how do they how do they like Implement is it like on python level or is it on no it has to be ca level has has to be like because you&#39;re you&#39;re messing with hbm right now right so okay so there is like a ways of like for you to defining like the the block mapping stuff like in the in lab right yeah oh okay know they also try to use something to T for table table like is for that or uh I don&#39;t think about to discuss anything that yeah so uh I&#39;ll be moving on to the next uh paper from there uh before that uh we already oh sure S I had a really question yeah um so this mechanism is built on top of the attention mechanism the idea is that we can append on subsequent tokens and keep the computation from earlier yeah um as you stack Transformer layers like these attention blocks right the input to the next layer is is the output from the previous layer and we sum through Q&#39;s so that means that the X from previous layers is not necessarily the same X as you get deeper into the model even as youate continuously so what happens to Performance if we enforce a prior like that block is going to be the same uh so the the KV cache per se is not what we are modifying like the KV cache already existed long before we we just changing uh like we essentially just breaking down KV cach from like continuous blocks of memory to like uh smaller pieces right yeah so all this are based on the fact that it&#39;s not buy directional it&#39;s it&#39;s cost yeah so so you have to because this generation right right and also like these uh base models are not decoders they they decod they decod only yeah yeah that&#39;s why you can do this right uh so moving on to the next one um so we already looked at flash attention and uh the only reason I&#39;m bringing it back up is uh because the next paper derives from some of the results in Flash attention and like tries to bring what flash attention did for the training aspects into the inference aspects so um the the reason why flash attention was able to improve the output was uh one because U they were able to break down the soft Max computation into visual chunks and um the the key inside which they brought was that we are not really computation bound we are memory bound of like um and uh the the memory operations are what&#39;s slowing us down here and the the graph again from the flash attention paper uh they showed that instead of uh writing from the hbm into the SRAM and back again and again if you uh create a fuse kernel which does all of these operations together together we can significantly speed up on attention computations um so uh and also that the the matrix multiplication wasn&#39;t what was consuming most of the time if you look at the the breakdown of the times involed um so what we are looking at is Flash decoding and um so uh the the animation is technically just showing flash attention and how it works um but the the the concern which the uh authors are trying to address here is that um if you&#39;re just doing plain inference and you don&#39;t have uh a high bat size much of your model much of your GPU can remain unused and the the way uh flash attention was implemented uh you break down uh the K and B but you don&#39;t really uh try to break it across uh batches so um the the modification which the authors propose is just uh split it into small small chunks and recompute it so uh in in Flash attention uh you had this uh scalar value which we were Computing along with the usual computation so that like the alpha and beta in a sense uh so that you can combine these back uh in the final step right so U the the same thing happens here as well across splits so you need to keep track of these values for each of the splits so that you can combine it at the last stage um yeah so on a high level uh that is the only change within flash decoding Implement splits so that uh you&#39;re processing everything in smaller chunks and uh you need to keep track of this extra scaler uh which is the log sum exponential and uh you can compute it back into the final value which you should have obtained and uh by doing this uh they are the the reason for doing this is increasing context lengths so you can have a single request coming in which is obviously like the Gemini model which I stated with like 10 million context length so if you have such huge U prompt lengths coming in uh flash attention uh by itself does not do well because of the reason stated that um you&#39;re not efficiently utilizing the GPU as you could um and you can see that from the graphs so um they try knif pyo primitive so this this paper is coming from pyo itself so they try pyo&#39;s own Primitives that actually works better than flash attention in some of the cases and uh with flash decoding they are able to preserve the performance of the model even on significantly higher context L um by just breaking down this computation and a significant amount of the speed up is just because of the way flash attention sped up with the fuse kernels and stuff like that so they are utilizing like the 50x speed up which just uh comes from using flash attention as opposed to the KN implementation off attention uh the computations which we&#39;re reading back and forth into memory um yeah and it scales better for larger context lengths which is the the main point here yeah so so what do you mean you say AC so um in uh in in Flash attention you can see like uh this is happening as a sequential operation right for that entire canv block and uh in in um which is fine for training but when you&#39;re doing inference uh and you don&#39;t have um uh like you don&#39;t have enough bat size to populate your entire GPU so what what your GPU would essentially end up doing is doing this sequentially even though you are essentially breaking down uh it into smaller operations um right so um I don&#39;t know if that so saying basically for the loop that goes through it yeah so it breaks that down into like multiple for Loops going through it right so even if you have a single query coming in you can handle it uh better like you don&#39;t have like 10 queries um like while training you can easily create like 128 B size but if you&#39;re if if you are pro like there is an aspect of latency involved here right like you don&#39;t want to be on batching up like 20 queries that are coming in to respond to the first guy yeah I guess then in this case uh you would have to do something you have to change the algorithm a bit actually right because uh yeah because before you can like when you do your computation you already have your first thing your your first you put you your computation the first put it in the The Matrix but now you&#39;re kind of doing everything and like aggregating them together yeah so some reduction step here or yeah there is a final reduction step I see I see and that scalar which we are tracking for like the weights of each of these individual components allows us to do that reduction St I see yeah so as far as I so this is basically roughly the same as flashion just like a way of yeah this is not a full paper it&#39;s just a block post yeah yeah yeah okay okay so so this is for processing like a very long yeah so the the intention is to perform better on like long Contex length and long contact L small bat sizes for inference okay but does it do better if you&#39;re going to generate like long sequen this algorithm is just specialized for like Pro processing like long input oh no it should work for both should work for both for the output yeah but like for the output you have generated this LA right how do we it for the discussion they only keep it to prompt lens oh they only like long problems yeah not with long not with the responses okay yeah I just assume the same thing could be done for resp but yeah since you&#39;re building it up incrementally oh right right same question so yeah I was just wondering like if this is only app to like long prop or long response it seems like the same uh so uh the the last paper which I&#39;ll be diving into is a continuation of what R was looking at with respect to uh speculative decoding so um essentially what you&#39;re trying to do with speculative decoding was um instead of having a decoding step which which only gives you out one token you want to have scenarios where some of the decoding steps give you a few more tokens than just a single token so in effect you can speed up your insurance so um if if you just uh like look back at just uh like build it up to this point um the the Llama tokenizer has 32,000 tokens if you randomly guess uh what your next token should be that&#39;s like a 1 in 32,000 chance that you get right but you can do something like u in in the Lang in language you don&#39;t have all tokens are not equally likely and you can use that distribution to pick something um or another thing is which there are other papers that look at it is use the prompt to speed up your inference so essentially uh the the idea is that there will be uh tokens from The Prompt which repeat within your response so you can just speed it up based on that um that is something which authors have looked at as um and using engrams as a look table which we will come to again even in this approach and um the speculative decoding much of the aspects have gone into using a smaller draft model or a helper model but uh it it it hasn&#39;t been widely adopted because of the issues which were again brought up and uh the complexity of having to manage two different models uh as well as uh training it so there was a paper which came out after speculative decoding called online speculative decoding which essentially just tries to improve your helper model to perform better um over time and uh there&#39;s also Medusa uh which try to look at using multiple attention heads uh for uh doing the process in a manner that&#39;s like easily adaptable and Meda is a more applied approach than uh speculative decoding with draft models so um with that background we can and look at look ahead decoding um uh this is just a a graphic which they give on the website of how fast uh an inference we are looking at as compared to the normal decoding process um and how do we take better guesses is the question which we&#39;re trying to answer right so um uh yeah so we trying to essentially improve the token acceptance rate so if you have uh your draft model spitting out five different Tok tokens and then all of them just essentially get discarded because the main token says this is not what I want so it&#39;s wasted compute so we want to be improving the acceptance rate um and with that one of the techniques is jaob iteration um so the jacobe method is something which we would have gone over in high school if you remember that from linear algebra but um so essentially what the authors propose is that the the core problem which you&#39;re trying to solve is predicting the M tokens in the response sequence so you have an X you have a y which is your response and you have M tokens there and you&#39;re trying to decode that uh regressively and um can we do that um faster can we like just compute all of the tokens in one go or like a specific set of them in one go um for that they rewrite the uh regressive decoding model so essentially um for the regressive step you&#39;re trying to look at what is the token which I should output given that I have this particular input right and as we uh go for like further steps as you&#39;re building out your response you do that for each particular token so they Define a function uh uh like which is shown here and um they converted into that form um uh so you essentially have a system of nonlinear equations which you&#39;re trying to solve um the jacobe method if you do remember linear algebra is applied to a system of linear equations um and that is actually an issue in the description of this paper because they link to the normal jacobe method like they link to the Wikipedia page of the normal jacobe method uh which is this like this is what you study in high school of how to use the jacobe method but it applies to only linear linear equations and uh what you essentially end up doing for a refresher is that you um once you rewrite the system of equations into each variable uh once you represent it in this form you take an initial guess of what each token is and then you iteratively like similar to like Newton r or any of those iterative methods you try to iteratively solve the system of equations and um once your method converges you have all the tokens which you need and uh um but uh this is for linear uh systems and the only reference which I could find to do this for nonlinear equations was this particular book and um yeah they essentially say that you can do the linear method for the nonlinear system and it does work the same way and one property of this computation is that with every step you solve one of the variables uh and with M steps you can solve the entire equation but that is essentially same as uh regressively doing it because if you had to take M steps um it it is same as just sequentially but in this case you&#39;re solving a system of equations to get to that that&#39;s the difference in jaob iteration and uh this is algorithm from that book about how do you do uh parallelize the nonlinear jaob iteration um yeah so just to do that visually uh this is what would happen if we do jaob iteration you have uh your prompt and then you take a guess of what the other tokens are this is like your initial guess in the Jacobi method and you you just U are randomly filling that out of um what you think the next token should be and then um the issue again here is that you&#39;re saying that the operations are uh like the the memory operations are what are taking uh time so you want to optimize the GPU as much as possible like so computations is not a concern in this particular case and we are particularly looking at models where you want to improve the latency and um there is an increased uh amount of computations which will happen as we go into the actual model but um that is not something you should look at here because the concern is latency in this particular case so um now your your mod you don&#39;t have a draft model again in this case it&#39;s just your main model uh predicting these so your main model says that um the the next token to alent uring uh is is like the token next token should be S and then you have paralyzed the operation by having a few random tokens which you guessed and then you&#39;re saying uh you&#39;re asking the model to parall compute this as well so uh the model says if this was who then the next open should be S if it&#39;s s it should be a and uh so on for many uh tokens you guess so this is a parameter which we&#39;ll set later of like how many do we want to guess forward but uh the model computes what the next should be and um uh if you let it continue um we use the the the new uh generated tokens as the uh like we update it with the the random selection and um now we have uh now we have one step forward like in the previous step we only got one accepted token and the rest of it was wasted computation so there was no merit to doing that but U uh except we updated our randomly selected initial guess and uh similar to eation we Sol one variable but the remaining variables are still kind of random and we still haven&#39;t converged yet and uh now the model has to look at all of these variables and try to compute what the uh next token would be if this was the particular variable uh in in this SP so uh again we do the same step we continue and then uh in cases where um yeah I believe this step fails as well so in in in this particular case uh you&#39;re essentially just rewriting and every step you&#39;re only getting one new token accepted but when you have a scenario where uh the next token um which the the main model predicts is same as uh that which you&#39;re getting via speculative decoding you can have uh multiple tokens be accepted which was again the objective which we set out with of like having a few steps where more than one token gets accepted and the rest of it is just wasted competition so we have we reiterated there but on the last step we accepted two tokens uh so this is uh naively doing jaob iteration um what the authors propose is a bit more complicated so instead of all of that wasted computation what if you saved the trajectory which the model was trying to take uh as in you save the n gs if you&#39;re Computing using the main model what the next token should be uh you just save like in this case it&#39;s 2 gr so it&#39;s very similar to eation and um we will be dealing with higher ends but uh just for visualization you have a two gam being created and what you can do is um before discarding everything you can kind of have a look at the two gs which you have stored and then use that as your bet of what the next token is and uh use that to continue forward in the iteration what happens when there are conflicting ones uh is there on the next side but uh at least for the visualization you have a case where uh you only have one case which matches so basically this is like a cash yeah you have a cash implemented you roughly know like what the Target Model yeah yeah because we don&#39;t have a draft model and it&#39;s all the target models distribution happening right there is some uh failure cases where this two can happen two places yeah so uh when it happens in two places you need to do a verification step and uh the actual implementation is way more complex than this because this is a 2 G which makes it very similar to speculative decoding but uh the actual case is an angram um so in this particular case you you have the N being set to four so you end up with a 4 G um with like yeah so the numbers are meaningless please look at that the numbers essentially just say what is the position of this particular token with respect to the initial token in the prediction process and um so a valid for uh valid engram in this particular case would be uh let&#39;s say you have 1 2 three and then whatever is happening at this step so we are on the last step of the jaob iteration where like the verification step will happen so that&#39;s why the fourth token is not really there in the diagram but U you essentially have an angram being created and um you also have to predict the next token based on what has happened uh and there are two parameters in this entire process there&#39;s window size of um how far we are looking ahead so that determines how many new tokens we are generating on each step so in this case you have five so that&#39;s why the window size is five and the engram size is controlling how much of the history you retaining so you&#39;re constructing a 4 G so that&#39;s why the NR size is four there and um now if you have a uh if you have scenarios where there are multiple uh correct engrams to choose from then you need to do verification so in this particular case there is an additional parameter called G which is how many you choose to verify uh in in in all the experiments the authors choose to set it to W the the window size the same as the number of uh uh things you choose to verify and uh what this shows is the attention Mass the only thing it is essentially saying is that uh same as costal attention you only looking at the tokens before it so like the the token one can only see itself and zero uh and the same thing is done for um all of the steps so um and the the verification Branch cannot look at anything in the look ahead branch which is why it&#39;s all masked out and then um you have it again only attending to its previous tokens and the starting token because you&#39;re verifying you you only have zero at this step and everything else is happening through U like iterative decomposing right so so this is for the verification branch and uh this is the the look ahead uh stuff which is happening and they just combine it together into one parel step uh to do it efficiently um yeah so this is just a comparison of what the attention mask looks like when you&#39;re comparing with Medusa so uh in Medusa you have multiple heads and like if the first head says that the next toen can be it or high of like a case where you&#39;re taking only the two ones the next head will again continue from it and say uh like three tokens and uh for the I it will again generate three tokens and that&#39;s what the attention mask looks like there just uh to contrast with what this attention mask looks like um yeah I&#39;m I&#39;m mostly wrapping up so the the scaling law is something which they discuss based on the the results in this paper so um what they find is that if you have a sufficiently large n and you exponentially increase the window size you can get uh a linear Trend in the step compression so you you can linearly reduce the number of decoding steps which is required um so that&#39;s a result which they find with their experimentation and what impact does this have you uh so they they applied it on Lama chat uh model and uh they observed a speed up of 1.5x and um latency reductions on human eval as well as the the math problems dat set yeah that&#39;s GSM 8K so yeah so it it does reduce latency which was the primary target of this entire thing so wasted computations again is not a concern here uh you&#39;re just trying to efficiently utilize most of the GP yeah so for my part uh we see that llm inference was bottlenecked by memory constraints and we saw how we can go about building a more efficient KV cache um we also uh the authors also had to implement like the kernel for blockwise attention or like paged attention as they call it um the the next paper looked at flash decoding which was uh trying to bring the enhancements of flash attention V1 and two into the inference step and uh the last paper which we looked at was look ahead decoding which is trying to improve speculative decoding but uh take a different approach rather than using draft models and um yeah for for Laten SEC critical applications because there&#39;s a lot of wasted computations happening there

Transcript for:LLM Inference Efficiency Techniques

Transcript for:
LLM Inference Efficiency Techniques