State Space Models and Mamba: A New Architecture for Language Models

hello my name is Louis Sano and this is Sano Academy and this video is about State space models and Mamba State space models are a new type of architecture that is revolutionizing large language models they were introduced in this paper by Goo and Dao and they have shown great performance and great efficiency it is a very interesting architecture because it combines the Notions of recurrent neural networks and convolutional neural networks in order to generate language so in this video I'm going to show you what a state space model is with a simple example and then we're going to go to generating language are you ready let's [Music] begin so let's start by learning what a state based model is and we're going to do it with an example which is a car in this case it's a race car so there are certain things we can do to the car and there are certain things we can measure from the car so what can we do well we can do maintenance for example we can measure the vehicle Health at a given day and we can also measure its performance so some examples of maintenance for example we can top off the fluid we can do general maintenance the vehicle Health we can look at for example the gas level the oil level the tire condition and the motor condition uh if you don't know a lot about cars don't worry I don't know a lot about cars I'm making up some variables here and for the performance let's say we can measure many things but we're going to worry about the speed and we're going to call these variables u h and y and we're going to look at you as the input because at a given day we decide what to do to the car we're going to look at at H as the general state of the vehicle something that changes every day and we're going to look at y as the output so how we measure the performance and we're going to do this sequentially so we're going to go from day T to day t + one now there are some functions or Maps between you h and y in the following way notice that if we don't do anything to the car it gets a little older every day its gas goes a little lower it's oil goes a little lower the motor gets a little older the tires get a little older this is all described by the function a these functions are all going to be matrices but for now I'm going to call them functions so a describes the wear and tear of the car from one day to the next one now at any given day we decide what maintenance we're going to give to the car we could change the fluids we could repair things or we could do nothing and that affects the state of the car so how it affects it it's given by the function B and now the performance or the speed is going to be defined on how well the car is doing so it comes from the state of the car so that is going to be described by the function C and then there's another function D which goes from the maintenance to the performance because if we do some maintenance that's also going to affect the speed of the car so that function is called D we can reorganize the diagram like this so in the left we have the state of the car at day T minus one on the right we have the state at day t on top we have the maintenance on the bottom we have the speed now for the language models we're going to use we're not going to worry so much about D actually we are going to to forget about it and why can we forget about it well we are just going to assume that B and C take care of D in other words that the reparations affect the state of the car and the state of the car affects the speed and so in some way the reparations affect the speed of the car indirectly and that takes care of D and what we're going to do is look at the model in many days in a row so day T minus 2 t minus1 t t+1 t plus2 Etc and that's the sequence we're going to be working with but let's focus on this one and for a moment let's still have D but just keep in mind that when I say d this is one that we can forget about the state at dat T minus1 is going to be called HT minus one the state at dat T is going to be called HT the input or the reparations at dat T are going to be called XT and the output of the speed is going to be called YT now let me describe some equations that happen let's look only at this part so what is HT well HT comes from two places from a up applied to HT minus1 you can think of it as the function a applied to the vector HT minus1 but it's actually going to be a matrix so this is matrix multiplication a matrix times a vector and then we add the Matrix B times the vector XT so this pretty much says the state at the date T is a * the state at the date T minus 1 plus b * the reparations I do at the date and now we can look at the other half of the diagram so what is YT YT is the speed remember and the speed comes from the state HT which is multiplied by The Matrix C and that's added to the reparations XT times The Matrix D so these are the two equations of a state space model and as I said we can forget about D and have a simpler State space model like this now let me show you a small numerical example just to have an idea of how these matrices work so remember that the vehicle help or the state is going to be given by the amount of gas the amount of oil the quality of the tires and the motor this is going to be given by a matrix where these four things appear as rows and also appear as columns now let's say that from one day to another the car loses 10% of the gas so the gasoline becomes .9 of what it was yesterday that means in this Matrix we're going to have a row like this now let's say that the oil also gets lost 5% of the oil gets lost so we get that the oil today is 095 time the oil yesterday which gives us this row over here now let's say that the tires deteriorate and it becomes 8 of yesterday so they deteriorate pretty fast cuz it's a race car and therefore we have this row over here and finally let's say that the motor deteriorates a rate of 15% per day so the Motor State today is 85 of what it was yesterday which gives us this row over here but some more stuff happens this is not just a diagonal matrix right let's say that the car uses less gas if the tires are pretty good so if we got good tires then we add a little bit to the gas that's a 0.05 therefore we have a 0.05 here in The Matrix and let's say that if if you have good oil in the car then that's good for the motor so we add over here 05 of the amount of oil to the state of the motor today and that gives us a 0.05 here in The Matrix now as I said these numbers are not realistic I'm just coming up with numbers if you are a big fan of race cars you may think wow these numbers make no sense I'm sure they don't but we're just working with an example here so this is called the state transition Matrix is the Matrix that tells me how is the state Tomorrow based on the state today now let's look at another Matrix which is the control Matrix so the control Matrix is the one that tells me how much does the input influence the state of today so let's see how adding fluids and repairing the car would help the vehicle State let's say that adding fluids adds one unit to the gas and one unit to the oil and repairing the car adds one unit to the quality of the tires and one unit to the quality of the motor so our Matrix is going to be like this on the rows we have the input or the maintenance and on the columns we have the four variables representing the state of the car and the Matrix is this one and if that's not clear you can also see the equations like this where it's 1 * gas plus 1 * oil plus 0 * Tire Plus 0 * motor and the second equation can be seen like this and now you simply have the transpose of the Matrix given by those coefficients and now let's look at the observation Matrix this is the one that turns the vehicle State into the performance or the output which in this case it's the speed and this one is simpler it's simply a vector where the rows are given by the speed and the columns are given by the varibles representing the vehicle State and let's say it's this one over here given by this equation here so4 * the gasoline plus2 * the oil plus3 * the tires plus5 * the motor is equal to the speed that is the observation Matrix and finally we have the direct action Matrix the one that we're not going to worry about very much cuz we'll remove but let's do it anyway let's say the equation is this one .1 * the fluid plus 2 * the reparations adds to the speed therefore our Matrix has as columns the input the fluids and the reparation and as rows the speed and it's given by this small Matrix over here so those are examples of the four matrices now if you want to go further with this example feel free to follow if not feel free to go further because this may be a little too many numbers but let's say that the state of the car at dayt minus 1 is 1 1 1 1 that is the vector that describes the four variables of the state of the car and let's say the input is 4.3 so we put 04 fluids and3 of reparation now let's see what the speed is going to be for these variables so first we have to Define what the state at date T is and that's given by this Vector T minus one times The Matrix a that gives us this Vector over here and that's the state at dayt well that's not exactly the state at dat T that is part of it it because the other part comes from the input times The Matrix B that gives us this Vector over here and then the sum of these two vectors is the state at dat T so that one goes over here that is HD now let's calculate the velocity well for that we look at the state at date T and we multiply it by The Matrix C which is a vector this gives us this number over here this looks like a number but it's really a 1 by one Matrix and if we're considering the Matrix d then the input also adds to the speed in the following way it's going to be the input times The Matrix D that gives us this number over here which is also a 1x one Matrix and when we add these two then we get the speed that's YT the output and so therefore the output is here so that's a small example of a state space model and as I said before the state space model is sequential because we have something for every day and here I removed the but what happens in general is that every day we have an input given by XT a state given by HT that tells us the state of the model and an output YT now this looks a lot like a recurrent neural network and if you want to know more about recurrent neural networks I invite you to check out this video in my channel which uses a food example to explain RNN so as a little summary this is the state space model we're going to use and the equations coming out of it are htal a htus 1 + bxt and YT = CH HT so next I'm going to tell you how to apply the this to [Music] language so now that we know what a state space model is let's look at how it applies to language generation so what does a large language model do well if it's given a bunch of text it finds the next word that would fit there and it generates long piece of text one word at a time so let's focus on how it's going to generate the next word in this sentence the sentences last week we saw a TV documentary about wild animals we saw giraffes with long blank now let's try to come up with the next word and just like before we're going to have a bunch of variables the variables here are going to be the context and so the context is Imagine something big describing all the things we're talking about in detail another variable we're going to have is the last word in this case it's long and the output that we want is the next word let's say the next word is next because we're going to say we saw giraffes with long next now how does the model find the next word well it has a list of all the possible words and to each one it assigns a probability and the words that are likely to come next have high probabilities like for example legs because we could see giraffes with long legs there a probability of 02 but next is the one that wins because it has a probability of 7 so this is how the model picks the next word and this model behaves a lot like the car model because the context reminds us of the state of the car the context is like the state of what we're saying it's some big Vector that describes what are we talking about the last word is the input so it's like the reparations we do to the car and the next word is the output which is like the performance of the car so we're going to make a state space model with these three variables and so this is how it works here is the context which is what we're talking about and the computer will store this in a long Vector but let's say that conceptually it stores it somewhere and it has an idea of what we're talking about it knows who are watching TV it knows that in the TV there's a program about wild animals about nature Etc so this is the state afterward T minus one now the function a or the Matrix a is going to tell us what's the state going to be at word t without knowing word t so state is going to be similar but some things will be forgotten and some things will be remembered more so for example The Matrix a will say well you're still watching TV so I'm still going to have the TV and maybe you said Lions a while ago but you haven't mentioned lions in a while so I'm going to forget them a bit and you talked about trees a while ago but I haven't heard trees in a while so I'm going to forget them and you just mentioned giraffe so I'm going to remember this one a little more and so again the Matrix a just goes from State at word t minus one to the state at word t without knowing where T so we still don't have the full State at T because we need to know what the last word is so the word is long and this goes here and B is the Matrix that's going to tell us how much the last word long affects the state so long is like the reparations we did on the car and so conceptually here the model says well you're talking about something long it could be the neck of the giraffe or the legs of the giraffe so I'm going to start looking at long things here and then finally we need the output which is the next word and the output comes from two places C and let's still have D even though D is not used here but let's just have it for fun and basically the function C and D help the model figure out that we're talking about a long thing in a giraffe which is probably the neck so the output is the word next and it goes over here that is the next word or in the car it was the performance of the car the speed as I said the state is a long vector and the words both input and output also come from a vector and the maps between vectors are given by matrices so a b c and d are matrices and of course I'm sure you see the parallel with the car in the car we had the state at dat T and T minus one and the state was the health of the car the input was the reparations and the output was the speed in this model we have that the state is the context the input is the last word we said and the output is the next word we say so we're going to have this model over here now this model gives us the exact same equations as we had with the car so HT the state at word t is a HT min-1 the state at word t minus one plus bxt which is the last word and the output y T the next word is given by C * HT the state at t plus dxt which is the input or the last word we said and this is where these variables live in the diagram so the diagram looks like this and as I said we don't really use these so let's forget about it and as I mentioned h x and y are long vectors so how do these vectors look like well HD is a vector of the length of the amount of words that we have well in reality they're tokens but let's think of them as words and if we want to encode the word long then we simply have a really long Vector which is very sparse it has mostly zeros except a one at the position of the word long so that is the vector XT YT is a little different but it has the same length because the length is the number of words that we have except this Vector is not so sparse it actually has a bunch of probabilities and by probabilities I mean a bunch of non- negative numbers that add to one and a y that strongly suggests us to use the word neck is this one because it has the probability of neck is 0.7 and sometimes it's going to tell us to use the word legs which has a probability of 0.2 and you could also use all the other words but they have really really small probabilities this is what you do to the model to make it say different things all the time you don't want these models to answer the exact same thing so it outputs a bunch of probabilities and then it samples from those probabilities for the next work now the vector HT the one for context is more complicated because it doesn't have all the words instead it's more of an abstract Vector I like to imagine it in my head as a vector that somewhere stores the topic that we're talking about let's say zoology the tense that we're using so let's say present the tone that we're using so let's say it's a serious tone and many other things it could be many things that the computer knows and we don't as a matter of fact this Vector is very hard to read but if you you want to imagine a really long Vector where the computer encodes exactly what we're talking about then that's the vector HD and this Vector has a bunch of numbers so it basically has high numbers for the features that appear in the context and low numbers for the features that don't appear in the context and if you like visual Matrix equations this is how the first equation looks like HT = a HT minus 1 + B and this is how the second equation looks like YT equal ch ht+ dxt so this is how the equations of a state space model appear and of course I mentioned we removed T but we can have it here for fun now Mamba does something a little bit more special so here is our state space model and here is a more specific case where the input words are in green the output words are in red and the context is in the middle but what Mamba does is something similar to what we did with attention mechanism so the attention mechanism decided that it's going to put more emphasis to some words that are more similar as opposed to others that are not so similar Mamba puts more attention to more words than others so instead of putting the same attention it says well some words are more important for example giraffes animals long documentary and some like about and we and so are not so important so that's pretty much what Mamba does I'm not going to get into much detail here but if you have any questions feel free to put them in the comments and maybe we'll do more content on Mamba now I mentioned that ssms are related to rnn's recurring neural networks and also to CNN which are convolutional neural networks but I haven't mentioned the CNN part so let me get into a bit more detail here so convolutional neural networks are very good for analyzing images for example if we look at this image where some pixels have value one and others have value zero now what a convolutional neural network does is it's going to summarize the IM image over here in this other grid and it does this many times that's why they have a lot of layers I'm going to show you only one layer so our convolutional NE network has a filter I'm going to pick a pretty simple filter here one with the number 1 over9 repeated nine times and as it goes through the image is going to do the following it's going to take a bit of the image of the heart and the filter and it's going to combine them in the following way it's going to multiply them entrywise and then add the result so because here everything's 1 nine it's really just taking the average of all the numbers in the window so here it goes and we can move the filter over the image as follows and fill in the summary on the left that is called a convolution and then we can also do the same thing with another filter and continue doing this many times and a convolutional neuron network does this many times until it gets to a really good summary of the image if you want to know more about convolution networks then check out this video that I have on my channel and so you can use this for images but you can also use them for sequences so they're used in language how are they used in sequences well with a one-dimensional convolution so if you have a sequence like this one we're going to summarize it over here and we're going to pick a filter just like before now the previous filter was a bunch of one or nines this one's going to have these three numbers a quarter a half and a quarter and what we're going to do is multiply the Zero by the one quarter and we get zero then we're going to slide move it to the right one step at a time and then do 0 * 1/2 + 1 * a quarter which is a quarter then we're going to move it one more time to the right and then we get 0 * a quar + 1 * a half + 1 * a quar which is 34 and we can continue doing this and then we get the numbers underneath so this is what is called a convolution and it's a mathematical operation that we summarize like this the convolution operator which is symbolized by that star gives the vector in the right in other words the convolution of u and v is W now where does this appear well it's not so obvious that it appears in a state space model but let's do a little bit of math remember that we have x0 becomes h0 thanks to the Matrix B and here there's no a because this is the first state so what are the equations well h0 is B * x0 because we have no a and now what's the output y0 well y0 is C * h0 so now let's replace the h0 As bx0 and then we get C bx0 so that's the first equation keep that one in mind now let's go one more step now we have the Matrix a that goes from h0 to H1 and our equation says H1 = a0 + bx1 and for the output y we have that Y is equal to ch1 now if you replace H1 here then you get the follow C HB x0 plus CB X1 that required a little bit of work but I encourage you to pause the video and actually work it out yourself and now if we continue going the next equation which I also encourage you to work out is going to be the following Y2 = C a^2 B x0 plus C A bx1 Plus cbx2 and I think we start seeing a pattern right so if we do this K times we get C A to the k b x0 plus C A the K -1 B X1 plus all the way to the end where the exponent of a comes down and the subscript of X comes up until we get c a b x of kus1 plus C A to the 0 B XK which is C bxk however if these equations look hard to obtain I have a shortcut the shortcut is simply following paths so let's see what would y z be well we look at all the paths in this directed graph taking care of the arrows that will take us to y z so it's only this one over here and then we multiply c b x0 in the backwards direction to get that y0 is CB x0 now let's look at how much is y1 well there's two paths that lead to y1 there's this one over here which gives us the term c a b x0 and then this one over here which gives us the path cbx1 so the sum of these two is y1 now how much is Y2 well there's three paths that take us there there's this path over here which gives us the term C A A bx0 which is CA s bx0 then there's this path over here which gives us the term c a b X1 and finally there's this straight path over here which gives us the term cbx2 so the sum of these three is going to be Y 2 now you can continue drawing this diagram and you can get that for YK there are K paths leading to that and the terms are C A to the K bx0 all the way to cbx escap and those equations look like a lot of work because we have to raise matrices to a lot of powers and multiply them and add them but with a convolution this is done very easily so check this out one of my vectors is going to be the vector x0 all the way to XK and my other Vector is going to be this Vector over here and I'm going to apply the convolution so first I have cbx 0 then I move one to the right and I get C A bx0 Plus cbx1 then I move one to the right and I get the equation for Y2 c a s bx0 plus CA bx1 plus cbx2 and so on and so forth so here I get the equation for y k minus one and here I get the equation for y k now why did I want to do it this way instead of just doing the operations well because convolutions are really fast there are a lot of tricks that can be used to calculate these convolutions in a very fast and effective way so that is very beneficial for State space models that it can be written as a convolution so that's all for thank you very much I have some acknowledgements to make in my journey of learning ssms I checked out this channel from Leticia AI coffee break with Laticia I highly encourage you to take a look at it she's got wonderful videos on ssms llms and a lot of cutting edge AI stuff and I also read this great article by Martin Gro endorsed called a visual guide to Mamba and state space models it had a very detailed description of ssms by the way Martin and J Alamar just Bish this amazing amazing book called Hands-On large language models you should definitely check it out so that's it thank you so much for your attention as usual if you like this video please give it a like share it amongst your friends H subscribe to the channel if you want to see more of this content and put a comment I love reading the comments and as a matter of fact this video came out because somebody suggested ssms in a comment you can also check out my page san. Academy with all the information or tweet at me at san. Academy or check out my book gring machine learning the details are in the comments with a 40% discount code if you are interested in buying it so thank you very much and see you on the next video

hello my name is Louis Sano and this is Sano Academy and this video is about State space models and Mamba State space models are a new type of architecture that is revolutionizing large language models they were introduced in this paper by Goo and Dao and they have shown great performance and great efficiency it is a very interesting architecture because it combines the Notions of recurrent neural networks and convolutional neural networks in order to generate language so in this video I&#39;m going to show you what a state space model is with a simple example and then we&#39;re going to go to generating language are you ready let&#39;s [Music] begin so let&#39;s start by learning what a state based model is and we&#39;re going to do it with an example which is a car in this case it&#39;s a race car so there are certain things we can do to the car and there are certain things we can measure from the car so what can we do well we can do maintenance for example we can measure the vehicle Health at a given day and we can also measure its performance so some examples of maintenance for example we can top off the fluid we can do general maintenance the vehicle Health we can look at for example the gas level the oil level the tire condition and the motor condition uh if you don&#39;t know a lot about cars don&#39;t worry I don&#39;t know a lot about cars I&#39;m making up some variables here and for the performance let&#39;s say we can measure many things but we&#39;re going to worry about the speed and we&#39;re going to call these variables u h and y and we&#39;re going to look at you as the input because at a given day we decide what to do to the car we&#39;re going to look at at H as the general state of the vehicle something that changes every day and we&#39;re going to look at y as the output so how we measure the performance and we&#39;re going to do this sequentially so we&#39;re going to go from day T to day t + one now there are some functions or Maps between you h and y in the following way notice that if we don&#39;t do anything to the car it gets a little older every day its gas goes a little lower it&#39;s oil goes a little lower the motor gets a little older the tires get a little older this is all described by the function a these functions are all going to be matrices but for now I&#39;m going to call them functions so a describes the wear and tear of the car from one day to the next one now at any given day we decide what maintenance we&#39;re going to give to the car we could change the fluids we could repair things or we could do nothing and that affects the state of the car so how it affects it it&#39;s given by the function B and now the performance or the speed is going to be defined on how well the car is doing so it comes from the state of the car so that is going to be described by the function C and then there&#39;s another function D which goes from the maintenance to the performance because if we do some maintenance that&#39;s also going to affect the speed of the car so that function is called D we can reorganize the diagram like this so in the left we have the state of the car at day T minus one on the right we have the state at day t on top we have the maintenance on the bottom we have the speed now for the language models we&#39;re going to use we&#39;re not going to worry so much about D actually we are going to to forget about it and why can we forget about it well we are just going to assume that B and C take care of D in other words that the reparations affect the state of the car and the state of the car affects the speed and so in some way the reparations affect the speed of the car indirectly and that takes care of D and what we&#39;re going to do is look at the model in many days in a row so day T minus 2 t minus1 t t+1 t plus2 Etc and that&#39;s the sequence we&#39;re going to be working with but let&#39;s focus on this one and for a moment let&#39;s still have D but just keep in mind that when I say d this is one that we can forget about the state at dat T minus1 is going to be called HT minus one the state at dat T is going to be called HT the input or the reparations at dat T are going to be called XT and the output of the speed is going to be called YT now let me describe some equations that happen let&#39;s look only at this part so what is HT well HT comes from two places from a up applied to HT minus1 you can think of it as the function a applied to the vector HT minus1 but it&#39;s actually going to be a matrix so this is matrix multiplication a matrix times a vector and then we add the Matrix B times the vector XT so this pretty much says the state at the date T is a * the state at the date T minus 1 plus b * the reparations I do at the date and now we can look at the other half of the diagram so what is YT YT is the speed remember and the speed comes from the state HT which is multiplied by The Matrix C and that&#39;s added to the reparations XT times The Matrix D so these are the two equations of a state space model and as I said we can forget about D and have a simpler State space model like this now let me show you a small numerical example just to have an idea of how these matrices work so remember that the vehicle help or the state is going to be given by the amount of gas the amount of oil the quality of the tires and the motor this is going to be given by a matrix where these four things appear as rows and also appear as columns now let&#39;s say that from one day to another the car loses 10% of the gas so the gasoline becomes .9 of what it was yesterday that means in this Matrix we&#39;re going to have a row like this now let&#39;s say that the oil also gets lost 5% of the oil gets lost so we get that the oil today is 095 time the oil yesterday which gives us this row over here now let&#39;s say that the tires deteriorate and it becomes 8 of yesterday so they deteriorate pretty fast cuz it&#39;s a race car and therefore we have this row over here and finally let&#39;s say that the motor deteriorates a rate of 15% per day so the Motor State today is 85 of what it was yesterday which gives us this row over here but some more stuff happens this is not just a diagonal matrix right let&#39;s say that the car uses less gas if the tires are pretty good so if we got good tires then we add a little bit to the gas that&#39;s a 0.05 therefore we have a 0.05 here in The Matrix and let&#39;s say that if if you have good oil in the car then that&#39;s good for the motor so we add over here 05 of the amount of oil to the state of the motor today and that gives us a 0.05 here in The Matrix now as I said these numbers are not realistic I&#39;m just coming up with numbers if you are a big fan of race cars you may think wow these numbers make no sense I&#39;m sure they don&#39;t but we&#39;re just working with an example here so this is called the state transition Matrix is the Matrix that tells me how is the state Tomorrow based on the state today now let&#39;s look at another Matrix which is the control Matrix so the control Matrix is the one that tells me how much does the input influence the state of today so let&#39;s see how adding fluids and repairing the car would help the vehicle State let&#39;s say that adding fluids adds one unit to the gas and one unit to the oil and repairing the car adds one unit to the quality of the tires and one unit to the quality of the motor so our Matrix is going to be like this on the rows we have the input or the maintenance and on the columns we have the four variables representing the state of the car and the Matrix is this one and if that&#39;s not clear you can also see the equations like this where it&#39;s 1 * gas plus 1 * oil plus 0 * Tire Plus 0 * motor and the second equation can be seen like this and now you simply have the transpose of the Matrix given by those coefficients and now let&#39;s look at the observation Matrix this is the one that turns the vehicle State into the performance or the output which in this case it&#39;s the speed and this one is simpler it&#39;s simply a vector where the rows are given by the speed and the columns are given by the varibles representing the vehicle State and let&#39;s say it&#39;s this one over here given by this equation here so4 * the gasoline plus2 * the oil plus3 * the tires plus5 * the motor is equal to the speed that is the observation Matrix and finally we have the direct action Matrix the one that we&#39;re not going to worry about very much cuz we&#39;ll remove but let&#39;s do it anyway let&#39;s say the equation is this one .1 * the fluid plus 2 * the reparations adds to the speed therefore our Matrix has as columns the input the fluids and the reparation and as rows the speed and it&#39;s given by this small Matrix over here so those are examples of the four matrices now if you want to go further with this example feel free to follow if not feel free to go further because this may be a little too many numbers but let&#39;s say that the state of the car at dayt minus 1 is 1 1 1 1 that is the vector that describes the four variables of the state of the car and let&#39;s say the input is 4.3 so we put 04 fluids and3 of reparation now let&#39;s see what the speed is going to be for these variables so first we have to Define what the state at date T is and that&#39;s given by this Vector T minus one times The Matrix a that gives us this Vector over here and that&#39;s the state at dayt well that&#39;s not exactly the state at dat T that is part of it it because the other part comes from the input times The Matrix B that gives us this Vector over here and then the sum of these two vectors is the state at dat T so that one goes over here that is HD now let&#39;s calculate the velocity well for that we look at the state at date T and we multiply it by The Matrix C which is a vector this gives us this number over here this looks like a number but it&#39;s really a 1 by one Matrix and if we&#39;re considering the Matrix d then the input also adds to the speed in the following way it&#39;s going to be the input times The Matrix D that gives us this number over here which is also a 1x one Matrix and when we add these two then we get the speed that&#39;s YT the output and so therefore the output is here so that&#39;s a small example of a state space model and as I said before the state space model is sequential because we have something for every day and here I removed the but what happens in general is that every day we have an input given by XT a state given by HT that tells us the state of the model and an output YT now this looks a lot like a recurrent neural network and if you want to know more about recurrent neural networks I invite you to check out this video in my channel which uses a food example to explain RNN so as a little summary this is the state space model we&#39;re going to use and the equations coming out of it are htal a htus 1 + bxt and YT = CH HT so next I&#39;m going to tell you how to apply the this to [Music] language so now that we know what a state space model is let&#39;s look at how it applies to language generation so what does a large language model do well if it&#39;s given a bunch of text it finds the next word that would fit there and it generates long piece of text one word at a time so let&#39;s focus on how it&#39;s going to generate the next word in this sentence the sentences last week we saw a TV documentary about wild animals we saw giraffes with long blank now let&#39;s try to come up with the next word and just like before we&#39;re going to have a bunch of variables the variables here are going to be the context and so the context is Imagine something big describing all the things we&#39;re talking about in detail another variable we&#39;re going to have is the last word in this case it&#39;s long and the output that we want is the next word let&#39;s say the next word is next because we&#39;re going to say we saw giraffes with long next now how does the model find the next word well it has a list of all the possible words and to each one it assigns a probability and the words that are likely to come next have high probabilities like for example legs because we could see giraffes with long legs there a probability of 02 but next is the one that wins because it has a probability of 7 so this is how the model picks the next word and this model behaves a lot like the car model because the context reminds us of the state of the car the context is like the state of what we&#39;re saying it&#39;s some big Vector that describes what are we talking about the last word is the input so it&#39;s like the reparations we do to the car and the next word is the output which is like the performance of the car so we&#39;re going to make a state space model with these three variables and so this is how it works here is the context which is what we&#39;re talking about and the computer will store this in a long Vector but let&#39;s say that conceptually it stores it somewhere and it has an idea of what we&#39;re talking about it knows who are watching TV it knows that in the TV there&#39;s a program about wild animals about nature Etc so this is the state afterward T minus one now the function a or the Matrix a is going to tell us what&#39;s the state going to be at word t without knowing word t so state is going to be similar but some things will be forgotten and some things will be remembered more so for example The Matrix a will say well you&#39;re still watching TV so I&#39;m still going to have the TV and maybe you said Lions a while ago but you haven&#39;t mentioned lions in a while so I&#39;m going to forget them a bit and you talked about trees a while ago but I haven&#39;t heard trees in a while so I&#39;m going to forget them and you just mentioned giraffe so I&#39;m going to remember this one a little more and so again the Matrix a just goes from State at word t minus one to the state at word t without knowing where T so we still don&#39;t have the full State at T because we need to know what the last word is so the word is long and this goes here and B is the Matrix that&#39;s going to tell us how much the last word long affects the state so long is like the reparations we did on the car and so conceptually here the model says well you&#39;re talking about something long it could be the neck of the giraffe or the legs of the giraffe so I&#39;m going to start looking at long things here and then finally we need the output which is the next word and the output comes from two places C and let&#39;s still have D even though D is not used here but let&#39;s just have it for fun and basically the function C and D help the model figure out that we&#39;re talking about a long thing in a giraffe which is probably the neck so the output is the word next and it goes over here that is the next word or in the car it was the performance of the car the speed as I said the state is a long vector and the words both input and output also come from a vector and the maps between vectors are given by matrices so a b c and d are matrices and of course I&#39;m sure you see the parallel with the car in the car we had the state at dat T and T minus one and the state was the health of the car the input was the reparations and the output was the speed in this model we have that the state is the context the input is the last word we said and the output is the next word we say so we&#39;re going to have this model over here now this model gives us the exact same equations as we had with the car so HT the state at word t is a HT min-1 the state at word t minus one plus bxt which is the last word and the output y T the next word is given by C * HT the state at t plus dxt which is the input or the last word we said and this is where these variables live in the diagram so the diagram looks like this and as I said we don&#39;t really use these so let&#39;s forget about it and as I mentioned h x and y are long vectors so how do these vectors look like well HD is a vector of the length of the amount of words that we have well in reality they&#39;re tokens but let&#39;s think of them as words and if we want to encode the word long then we simply have a really long Vector which is very sparse it has mostly zeros except a one at the position of the word long so that is the vector XT YT is a little different but it has the same length because the length is the number of words that we have except this Vector is not so sparse it actually has a bunch of probabilities and by probabilities I mean a bunch of non- negative numbers that add to one and a y that strongly suggests us to use the word neck is this one because it has the probability of neck is 0.7 and sometimes it&#39;s going to tell us to use the word legs which has a probability of 0.2 and you could also use all the other words but they have really really small probabilities this is what you do to the model to make it say different things all the time you don&#39;t want these models to answer the exact same thing so it outputs a bunch of probabilities and then it samples from those probabilities for the next work now the vector HT the one for context is more complicated because it doesn&#39;t have all the words instead it&#39;s more of an abstract Vector I like to imagine it in my head as a vector that somewhere stores the topic that we&#39;re talking about let&#39;s say zoology the tense that we&#39;re using so let&#39;s say present the tone that we&#39;re using so let&#39;s say it&#39;s a serious tone and many other things it could be many things that the computer knows and we don&#39;t as a matter of fact this Vector is very hard to read but if you you want to imagine a really long Vector where the computer encodes exactly what we&#39;re talking about then that&#39;s the vector HD and this Vector has a bunch of numbers so it basically has high numbers for the features that appear in the context and low numbers for the features that don&#39;t appear in the context and if you like visual Matrix equations this is how the first equation looks like HT = a HT minus 1 + B and this is how the second equation looks like YT equal ch ht+ dxt so this is how the equations of a state space model appear and of course I mentioned we removed T but we can have it here for fun now Mamba does something a little bit more special so here is our state space model and here is a more specific case where the input words are in green the output words are in red and the context is in the middle but what Mamba does is something similar to what we did with attention mechanism so the attention mechanism decided that it&#39;s going to put more emphasis to some words that are more similar as opposed to others that are not so similar Mamba puts more attention to more words than others so instead of putting the same attention it says well some words are more important for example giraffes animals long documentary and some like about and we and so are not so important so that&#39;s pretty much what Mamba does I&#39;m not going to get into much detail here but if you have any questions feel free to put them in the comments and maybe we&#39;ll do more content on Mamba now I mentioned that ssms are related to rnn&#39;s recurring neural networks and also to CNN which are convolutional neural networks but I haven&#39;t mentioned the CNN part so let me get into a bit more detail here so convolutional neural networks are very good for analyzing images for example if we look at this image where some pixels have value one and others have value zero now what a convolutional neural network does is it&#39;s going to summarize the IM image over here in this other grid and it does this many times that&#39;s why they have a lot of layers I&#39;m going to show you only one layer so our convolutional NE network has a filter I&#39;m going to pick a pretty simple filter here one with the number 1 over9 repeated nine times and as it goes through the image is going to do the following it&#39;s going to take a bit of the image of the heart and the filter and it&#39;s going to combine them in the following way it&#39;s going to multiply them entrywise and then add the result so because here everything&#39;s 1 nine it&#39;s really just taking the average of all the numbers in the window so here it goes and we can move the filter over the image as follows and fill in the summary on the left that is called a convolution and then we can also do the same thing with another filter and continue doing this many times and a convolutional neuron network does this many times until it gets to a really good summary of the image if you want to know more about convolution networks then check out this video that I have on my channel and so you can use this for images but you can also use them for sequences so they&#39;re used in language how are they used in sequences well with a one-dimensional convolution so if you have a sequence like this one we&#39;re going to summarize it over here and we&#39;re going to pick a filter just like before now the previous filter was a bunch of one or nines this one&#39;s going to have these three numbers a quarter a half and a quarter and what we&#39;re going to do is multiply the Zero by the one quarter and we get zero then we&#39;re going to slide move it to the right one step at a time and then do 0 * 1/2 + 1 * a quarter which is a quarter then we&#39;re going to move it one more time to the right and then we get 0 * a quar + 1 * a half + 1 * a quar which is 34 and we can continue doing this and then we get the numbers underneath so this is what is called a convolution and it&#39;s a mathematical operation that we summarize like this the convolution operator which is symbolized by that star gives the vector in the right in other words the convolution of u and v is W now where does this appear well it&#39;s not so obvious that it appears in a state space model but let&#39;s do a little bit of math remember that we have x0 becomes h0 thanks to the Matrix B and here there&#39;s no a because this is the first state so what are the equations well h0 is B * x0 because we have no a and now what&#39;s the output y0 well y0 is C * h0 so now let&#39;s replace the h0 As bx0 and then we get C bx0 so that&#39;s the first equation keep that one in mind now let&#39;s go one more step now we have the Matrix a that goes from h0 to H1 and our equation says H1 = a0 + bx1 and for the output y we have that Y is equal to ch1 now if you replace H1 here then you get the follow C HB x0 plus CB X1 that required a little bit of work but I encourage you to pause the video and actually work it out yourself and now if we continue going the next equation which I also encourage you to work out is going to be the following Y2 = C a^2 B x0 plus C A bx1 Plus cbx2 and I think we start seeing a pattern right so if we do this K times we get C A to the k b x0 plus C A the K -1 B X1 plus all the way to the end where the exponent of a comes down and the subscript of X comes up until we get c a b x of kus1 plus C A to the 0 B XK which is C bxk however if these equations look hard to obtain I have a shortcut the shortcut is simply following paths so let&#39;s see what would y z be well we look at all the paths in this directed graph taking care of the arrows that will take us to y z so it&#39;s only this one over here and then we multiply c b x0 in the backwards direction to get that y0 is CB x0 now let&#39;s look at how much is y1 well there&#39;s two paths that lead to y1 there&#39;s this one over here which gives us the term c a b x0 and then this one over here which gives us the path cbx1 so the sum of these two is y1 now how much is Y2 well there&#39;s three paths that take us there there&#39;s this path over here which gives us the term C A A bx0 which is CA s bx0 then there&#39;s this path over here which gives us the term c a b X1 and finally there&#39;s this straight path over here which gives us the term cbx2 so the sum of these three is going to be Y 2 now you can continue drawing this diagram and you can get that for YK there are K paths leading to that and the terms are C A to the K bx0 all the way to cbx escap and those equations look like a lot of work because we have to raise matrices to a lot of powers and multiply them and add them but with a convolution this is done very easily so check this out one of my vectors is going to be the vector x0 all the way to XK and my other Vector is going to be this Vector over here and I&#39;m going to apply the convolution so first I have cbx 0 then I move one to the right and I get C A bx0 Plus cbx1 then I move one to the right and I get the equation for Y2 c a s bx0 plus CA bx1 plus cbx2 and so on and so forth so here I get the equation for y k minus one and here I get the equation for y k now why did I want to do it this way instead of just doing the operations well because convolutions are really fast there are a lot of tricks that can be used to calculate these convolutions in a very fast and effective way so that is very beneficial for State space models that it can be written as a convolution so that&#39;s all for thank you very much I have some acknowledgements to make in my journey of learning ssms I checked out this channel from Leticia AI coffee break with Laticia I highly encourage you to take a look at it she&#39;s got wonderful videos on ssms llms and a lot of cutting edge AI stuff and I also read this great article by Martin Gro endorsed called a visual guide to Mamba and state space models it had a very detailed description of ssms by the way Martin and J Alamar just Bish this amazing amazing book called Hands-On large language models you should definitely check it out so that&#39;s it thank you so much for your attention as usual if you like this video please give it a like share it amongst your friends H subscribe to the channel if you want to see more of this content and put a comment I love reading the comments and as a matter of fact this video came out because somebody suggested ssms in a comment you can also check out my page san. Academy with all the information or tweet at me at san. Academy or check out my book gring machine learning the details are in the comments with a 40% discount code if you are interested in buying it so thank you very much and see you on the next video

Transcript for:State Space Models and Mamba: A New Architecture for Language Models

Transcript for:
State Space Models and Mamba: A New Architecture for Language Models