this video is sponsored by kiwo more on them later in January 2025 the Chinese company deep seek shocked the world with the release of R1 a highly competitive language model that requires only a fraction of the compute of other leading models perhaps even more shocking is that unlike most of its American counterparts deeps has publicly released the R1 model weights inference code and extensive technical reports publishing an average of one report per month in 2024 and detailing any of the innovations that dramatically culminated in the release of R1 in early 2025 back in June of 2024 the Deep seek team introduced a technique that they call multi-head latent attention unlike many deep- seek innovations that occur at the margins of the stack multi-ad latent detention strikes at the core of the Transformer itself this is the compute architecture that virtually all large language models share this modification reduces the size of an important bottleneck called the key value cache by a factor of 57 allowing the model to generate text more than six times faster than a traditional Transformer in deep seeks implementation but how exactly was the Deep seek team able to squeeze such a significant Improvement out of such a broadly used architecture like other modern language models when you give deep seek a prompt the model generates its response onew fragment known as a token at a time mathematically this autor regressive approach means that each new token the model generates is a function of all the tokens that came before it the interactions between tokens and large language models are handled by a mechanism called attention attention works by Computing matrices called attention patterns these are the 144 attention patterns computed by the gpt2 small model when given the example input text the American flag is red white and this model uses 12 separate attention mechanisms called attention heads per layer and has 12 layers making for 144 total attention patterns deep seek R1 has 128 attention heads per layer and 61 layers making for 7,808 total patterns in both models the size of the attention pattern is equal to the number of tokens passed into the model our example input the American flag is red white and maps to nine tokens so all of our attention patterns are 9 by9 matrices attention patterns are used by attention heads to move information between token positions in the models residual stream for example this first attention pattern in the third layer of gpt2 has a high value mapping from the input token of American to the output token of flag meaning this attention head is likely applying the modifier American to the noun flag creating a unified representation for the concept American flag this eighth attention pattern in the 11th layer has high values mapping the words flag red and white to the output of the final token and and this attention head has pulled out words in our input that are relevant for predicting the correct next token of blue which this gp22 small model does correctly predict let's dig a bit deeper into exactly how the standard detention mechanism Works in models like gpt2 and build up a few equations so we can make sense of how the Deep seek team made such a powerful Improvement to compute a given an attention pattern we take the input Matrix X this could be the input to any layer of our model and will have one row for each input token and a number of columns that corresponds to the embedding dimension of the model this is the length of the vector used to represent each token gpt2 Small's embedding Dimension is 768 and deeps r1's embedding Dimension is 7168 to compute a given attention pattern we multiply our input Matrix X by two separate sets of learned weights WQ and WK in gpt2 these matrices are of Dimension 768 by 64 and result in two new matrices q and K each of Dimension 9 by 64 the rows of our Q Matrix are known as queries and the rows of our K Matrix are known as Keys the core idea of attention is that we now search for pairs of tokens that have similar queries and keys allowing the model to learn various relationships between tokens for example a token like flag it could query for words that modify its meaning while words like American can produce keys in certain attention heads that flag them as modifiers this modifier query and modifier key should produce similar key and query vectors mathematically to find similar keys in queries we can take the dot product of the keys and queries for each possible pair of our nine tokens similar keys and query vectors will generate High Dot products we can compute all these dot products at once by transposing our key Matrix and multiplying by our query Matrix resulting in a new 9 by9 Matrix where each entry corresponds to the dotproduct of a single key in query to compute our attention pattern we apply a masking operation effectively zeroing out the upper right portion of our Matrix this step is mostly important in training as it prevents the model from cheating on its task of next token prediction by just looking at the next token finally we normalize our result by dividing by the square root of our embedding dimension and applying a soft Max operation which forces each of the rows of our Matrix to add to one now that we've computed our attention pattern we need to actually use it to process our data this involves a couple more Matrix multiplies we first compute what's known as a value matrix by multiplying our input by a third weight Matrix WV this computation is identical to the way we computed our keys and query matrices just with a different set of learned weights we then mult multiply our attention pattern matrix by our value Matrix this effectively takes a weighted sum of our values following our attention pattern one way to think about this step is as processing our inputs using a neural network where the weights a are controlled by the data itself finally the attention Block in each layer has multiple heads each head performs the same computations but with different learned weights resulting in a different set of queries Keys attention patterns and values for each head the the idea here is that various attention heads can specialize in various tasks like searching for adjectives or searching for other instances of the same token to compute the final output of the attention block we stack the results from each head together and multiply by a final learned weight Matrix wo giving us the final Matrix output of our attention block the attention block is a key part of modern language models but requires a significant amount of computation since the height and width of our attention and pattern are equal to the number of input tokens the number of entries in this Matrix scales as the number of input tokens squared this is potentially a huge computational problem for large models open ai's chat GPT models now offer maximum context lengths of over 100,000 tokens for reference this is about the length of the first Harry Potter book so Computing each attention pattern for chat gpt's maximum allowed input size is equivalent to arranging the entire text of the book as a single row and column and then Computing dot products for every possible pair of tokens from the entire text fortunately there's a huge computational shortcut that we can take as large language models generate new text a single token at a time the attention patterns themselves don't actually change that much in our American flag example let's say the model generates a new token for the word blue our phrase is now the American flag is red white and blue to see what the the model says next we now pass this new 10 token input back into the model to get the 11th token and so on our new 10 token input results in key query and value matrices each of Dimension 10 by 64 but importantly since our weight matrices apply the same identical operation to each token the first nine rows of our key query and value matrices are unchanged from our original nine token input transposing our keys and multiplying by our queries to compute our new attention pattern note that the first nine rows of Q and the first nine Columns of K transpose are unchanged this means that the upper left 9 by9 Matrix of our attention pattern will also be unchanged and we only need to compute a new Final row and column to arrive at our new 10x10 attention pattern and further since we mask out the upper right corner of our attention pattern we actually only need to compute the new bottom row the bottom row of our attention pattern results from multiplying the final row of our query matrix by each column of our transposed key Matrix so to compute this final attention pattern row we need to know all of our keys but only the final new row of our query Matrix since we already computed nine out of 10 of our keys on the previous call of the model it's much more computationally efficient to store these keys in memory and just access them when the new 10 token input comes along the same logic applies to our value Matrix we need our full value Matrix to compute our new outputs but the first nine rows are unchanged so we can just cach them in memory note that there's no need to cach the queries since we only need the new Final row of our queries to update our attention pattern this idea is called key value or KV caching and is a critical part of large language model infrastructure instead of compute growing quadratically as the square of the number of input tokens key value caching means that the compute required by the model's attention blocks scales linearly with the number of input tokens now this comput shortcut does come at a cost specifically increased memory usage our system must now store the keys and values for the full history of the model session for all attention heads across all layers in memory given a model with L layers NH attention heads per layer a dimension of DH for our key and value matrices and in input tokens we must store two * n * DH * NH * L unique numbers in our KV cache assuming assuming floating Point 16 numbers the Deep seek R1 architecture and a context length of 100,000 tokens we end up needing to retrieve four megabytes per token in the model's context window resulting in a huge 400 gigabyt of memory reads for each new token we compute deep seek solution to this problem is really clever and it was great to be able to Tinker with their inference code to really get my head around it there's nothing quite like Hands-On experimenting like this for developing understanding which is why I was more than happy to partner again with this video sponsor kiwo kiwo offers Hands-On project kits that make learning genuinely fun for kids of all ages my daughter is obsessed with colors and rainbows right now and loves the color Discovery crate these Spinners are such a fun way for us to explore color mixing together it's amazing to see how the crates progress and build on each other last year she was developing fine motor skills as part of the panda club and now in the Sprouts Club she's creating and experimenting when she turns six in a few years she can join the Kiwi Co Labs Club where she'll get to work on more complex science and engineering projects like this remote controlled car I would have absolutely loved this crate as a kid my son is working on learning the names of colors so far everything is blue this Block Puzzle is such a fun interactive way for him to explore different colors at his age when my kids quickly get bored of or break many of their toys we find ourselves continually coming back to their kiwi Co crates the build quality is really great and the thoughtfulness and multi-purpose design built into each crate really keeps them engaged if you want your family to experience the awesomeness of kiwico use my code Welch labs to receive 50% off your first crate for kids three and older or 20% off your first Panda crate for kids under three big thanks to kiwo for sponsoring this video now back to deep seek solution to the KV cash problem untenably large KV caches are not a new problem one popular solution is to reuse key and value matrices across multiple attention heads in multi-query attention blocks instead of having unique key and value matrices for each attention head we share a single key and value Matrix across all Heads This reduces the required size of our kavv cache by a significant factor of the number of heads per layer 128 for the Deep seek R1 architecture however this modification does impact model performance as forcing all attention heads to use the same keys and values allows for Less specialization a less destructive version of this idea is grouped query attention where instead of forcing all attention heads in a given layer to share the same key and value matrices we create multiple groups of attention heads that share the same key and value matrices metas llama 3 models use grouped query attention with groups of eight attention heads sharing the same key and value matrices reducing the size of the KV cache by a factor of eight grouped query attention reduces KV cache size but still takes a performance hit relative to full multi-head attention now what's really remarkable about deep seeks approach called multi-head latent attention is that they were able to reduce the needed KV cache size by a factor of 57 while actually improving performance the key Insight is a novel application of a very common idea in machine learning a latent space what if the model could learn to efficiently compress its own keys and values multi-ad latent attention effectively adds an extra step between each attention head's input and the key and value matrices the idea is to project our input into a compressed latent space that like multi-query attention is shared across all attention heads in a given block however unlike multi-query attention where each head shares the same exact keys and values in multi-ad latent attention the compressed latent space is projected back up to keys and value Ates using another set of learned weights wuk and wuv where the weights are unique to each attention head this gives multi-head latent attention more flexibility than multi-query attention or grouped query attention now at face value since we've introduced a new Matrix multiply it appears that we've just trated some memory bandwidth for additional compute and after all the entire point of KV caching was to reduce the high compute needs of attention blocks however is the Deep SE team points out with some clever linear algebra we can rearrange our query computation to absorb the wuk weights and rearrange our final output computation to absorb the W UV weights since all these weights are fixed at training time we only have to compute the absorbed weights once and can avoid any additional compute during inference so when a new token comes along we simultaneously compute its query vector and the query's projection into the latent cache space in one step and then compute our attention pattern directly from the latent key value cache matrix it's a really elegant solution with multi-head latent attention the size of the needed KV Cache no longer has any dependence on the number of attention heads per layer and instead just depends on the size of the shared KV cache Matrix for deep seek R1 this is equal to the number of input tokens by 576 if implemented with traditional attention blocks R1 would require 4 megabytes of KV cache per token grouped query attention with a group size of eight would cut this down to 500 kilobytes per token and multi-ad latent attention reduces the needed cache to only 70 kilobytes per token a 57x reduction what we're left with is a true Improvement to the Transformer architecture enabling deep sec car1 to generate tokens more than six times faster than a vanilla Transformer while actually improving algorithmic performance multi-ad lat to attention allows attention heads to share key and value information in a more optimal way where the model itself learns how to compress and share this information between attention heads the Transformer architecture is one of the most significant breakthroughs in modern AI history and deep seek appears to have just made it work significantly better it's amazing to see the path that deep sea carved through their 2024 papers systematically making substantial improvements to models that required hundreds of millions of dollars in R&D and infrastructure costs the stakes have never been higher for neural networks it will be fascinating to see what new set of ideas unlocks the next level of capabilities as we build more and more intelligent systems if you enjoyed the graphics in this video I think you'll really like the poster version the poster includes a walkthrough of multi-head latent attention with detailed captions and I've rearranged the flow a bit to work better as a poster on the bottom I've included a detailed comparison between various forms of attention including the required sizes of the KV caches and a 3D model of each attention block The Matrix images in the video and poster are actually from the real deep seek model I'm mostly showing the weights from the first layer of deep seek V3 the poster looks great in a simple frame that you can pick up on Amazon and is a great way to see how MLA works and just a nice way to decorate your walls I'm offering free shipping on the poster for a limited time at Welch labs.com or you can pick it up as a limited edition bundle with a signed copy of of my imaginary numbers book finally big thank you to everyone who's purchased from the Welch lab store your purchases go a long way to helping me make more great videos