Deep Sequence Models and S4

map that Maps a sequence through a sequence um in the top right here is illustrating kind of the way you think about them in keeping your networks uh you normally have a layer that takes in a tensor of a certain shape oh sorry yeah all right there it is yeah uh yeah you you think of a network of something that takes in a tensor of that shape uh batch size sequence Dimension and Maps it to Something in the same shape and then you can stack these in a network but again we're we're gonna ignore like all these other things and we just need to think of it as a one-dimensional sequence um and so to kind of illustrate what I mean by Deep sequence model is you take this sequence model which we think of as the Black Box sequence sequence map such as convolutional attention um you stick them into a standard deeper architecture with linear layers or normalization layers there's zero connections and so on um and then the difference between a lot of these neural networks is really just what is this core internet um so if you have a 1D convolution this is uh when the CNN if you use a self-attention layer it's a Transformer essentially and so what I call Deep SSM Source statements models are going to be deep neural networks built around a state space model which is going to be a simple sequence of sequence map um and you incorporate it into a deep neural network in the same way that you would build a Transformer or a CNN so that part is not really our Focus we're going to be looking at what is this inner Black Box um just add a little bit more context for people who might know a bit about statistics models um they're a really old concept that I've been around for a long time they're usually a statistical model um that have been around since the 1960s um but in the classical sense they are a probabilistic model where you define like a you you oppose a model on your data distribution and you assume your state space model can model that and you kind of have these simple linear models that you can then do statistical inference on um but for us we're thinking of them really in a deep learning sense and so this is different because instead of thinking of it as doing a probabilistic model of the data our though the way we think about them is that they're just a deterministic feature transformation just like you view a convolution as a feature abstraction or just it's just you know transforming a sequence of other sequence you're just treating it as a as a function um and the the difficult parts for deep learning are not about statistical inference but it's about how do you compute these things efficiently um how do you do back prop efficiently how do you um how do you define your model in a way that allows it to be extracting meaningful features so these are uh they're called safe space models but they are a different flavor than what appeared previously um and the first uh the model that was that kind of started this line is the one called S4 which is uh what the stock is about Okay so um I'll see how much of this I can get through but uh the first two parts of this talk I'll first talk about these ssms or what I call Deep ssms um and how they kind of operate in deep learning and why they're nice uh in some of their properties um and then I'll talk a little bit more about how S4 relates to these more General ssms and uh its special properties so for this first part let's just talk about what a state space model is um this uh this model is a really it's actually a really simple model that's defined by these two equations um so here I'll unpack this a little bit more in one second but basically um it's defined through a differential equation um and this overall creates a map from this function uh U so here every all these variables are functions of time uh use a function of time and then bypassing it through this model you get an output y that's also a function um this these are actually first appeared in controls and they're kind of Illustrated uh using this control diagram up top um which is not really going to be relevant for our purposes but just for some context and as I said the first model is uh was the common filter uh from 1960 which is still very commonly used um now for us in deep learning there's going to be they're going to be nice because they have a bunch of really nice properties for sequence modeling um and so these three properties are going to be uh that you can view them as continuous models as for print models and as computational models and so we'll go through all these in one by one okay so the first one um so let me just kind of illustrate how this model works as a sequence and sequence map um so again we we're thinking of it as a parameterized function from a sequence 1D sequence 1D sequence um to be more explicit the parameters of this function were those a b c and d matrices um and then it's going to take an input and give you an output now the way I defined it remember is that it's actually a uh I Define it through functions it's a differential equation that operates on continuous functions so these State space models actually map functions and functions instead of sequences sequences and more explicitly the way they do that is you can think of you can just look at each equation one at a time so let's look at the first equation um here at the bottom this black line is the input function U of T and now you pass it through this differential equation defined by parameters A and B um and it Maps it up to a higher dimensional function so here n represents the state Dimension which is the the order of the intermediate State um and this uh right so basically if you just pass this through the differential equation um you get a deterministic output X which will be higher dimensional so more explicitly here you think of b as uh like a column Vector um so uh it blows out the dimension and then you think of a as a n by n Square Matrix and this different this is a simple uh linear differential equation that uh gives you the state X which is going to be a vector so here in blue you can think of like that is the blue line is X and at any given time x of T here is a vector of four numbers okay and then um the second equation is much simpler so uh this is a map from the state x to the output y just by projecting it down back from one dimension so C here is uh you think of it as a row Vector that just takes a linear combination of your state features and projects it back out and that's the whole definition of what these two equations do so so again it's a it's a very simple map from a 1D function to one function um yeah any questions yeah just like that yes that's right so X is a latent function U is the input function X is a latent thing and Y is the output function and this is just a map from U to y That's defining like a deterministic transformation uh yeah actually I kind of regret using this notation this notation is from controls and originally we're like this for that connection but it actually confuses people sometimes because as I mentioned this is actually quite different from the controls usage and um in deep learning you normally have inputs X outputs Y and yeah at some point I might actually consider redoing the notation but anyways okay so the first property I mentioned the continuous representation um is really just saying that this map operated on functions and not just sequences so this has some concrete benefits because um you cannot functions are in some sense more General in sequences um you can always discretize a function to get back a sequence and uh this has some really concrete benefits such as being able to operate on a regularly sampled data uh more stuff like that where you might want to Define an underlying continuous process um or another example is that um some types of data are really based on an underlying function or an underlying signal such as lots of perception data like images and audio there's like a underlying audio waveform which is continuous and so um there's this fuzzy concept of inductive bias that uh this sort of like it just has better inductive bias for modeling uh certain types of data that are more continuous now the second you will be about recurrence and so the the motivation here is that um this is a this is a nice representation that operation functions but in the real world you always have data that comes in the form of sequences right you you get that time series that are sampled so now how do you make this model operate on actual sequences which was our original goal so what you need to do is discretize circumferential equation to go from continuous time to discrete time um the way you do that is using these standard techniques called there's a lot of discretization techniques and uh you can really treat them as kind of these black box formulas but they're just to really briefly illustrate the idea you basically just approximate the differential equation and um there's a lot of standard techniques for numerical approximation of them and that then you can convert it into a discrete form um here this is illustrating the most simple illustration the example is called the order discretization so basically um you can think of simulating your differential equation by moving it Forward say like one second at a time and to get from your current step to your next step um you just use this first sort of approximation to say that my next step is going to be incremented by the derivative of my function um so if you expand this out um and plug it back in uh you yeah so you you just take the derivative formula differential equation plug it back in and then you can get a exact formula for what the next step of the sequence is going to be um as a as a function of the current cell um yeah this is recorded so I'll actually not spend too much time here because these equations are actually very simple if you just type them for a second um but the the upshot is basically that you can uh turn this continuous time model into a discrete time recurrent update so with the differential equation um became instead becoming a differential equation it got converted into a recurrence right here so uh the second line here just says that uh yeah your next hidden state is going to be a linear combination of your previous state and then incoming input and then you project it back out and uh these two equations are really basically just the essence of an RNN so look our neural networks are basically defined by updating our previous hidden State um we're currently and then projecting it back to the output so uh you can basically view this as a very simple linear RNN where the recurrent Dynamics are given by this a matrix and then there's a couple other parameters um and yeah so one thing to emphasize is that the underlying parameters of the model were ABCD and actually this extra parameter Delta which represents the step size um and then the first step in your computation graph will be to follow these fixed formulas to generate discrete matrices a bar and B bar and then those are the matrices that Define the recurrence which is what you'll actually use to step your model and compute it so I had some more intuition for disposition that I'll skip um so but uh yeah so so what what's the point of recurrence so first of all this gives us our first way to actually compute the model uh when we get one uh input at a time then you just update your States generate the next stage under the next output um and more importantly rnns have a lot of nice properties that other types of sequence models don't have so uh and that's because the recurrent and they have this finite side state so we all know about like language modeling for example um where models are trained Auto aggressively so they predict the next token and when you want to generate something as a language model you have to take your output and feed it back into your input in a passive Network again and the feedback in um so this Auto aggressive computation is recurrent and rnns are actually much faster at doing that generation just because they're hidden their state size is constant and so every single time step takes a constant amount of time um actually yeah so this is an illustration of it so let's say that we've seen like some amounts of the input and we're trying to we get one more input and we want to get uh generate one more output um this output is actually a function of the entire history of your input it's not just like a it's not just a function of the current time step right you're taking in the whole history but despite that and despite the input size increasing over time every single output you generate takes a constant amount of time so that's a way to kind of illustrate why recurrence is powerful and this is not true at all for things like Transformers so um a Transformer model is going to slow down the more you the longer a sequence gets uh so this is the second representation which is basically um first of all how do you oops first of all how do you take your model and your continuous model and make it discrete so it's now this is now an actual sequence and sequence map which is what we wanted um and second of all how do you compute it in a online or Auto aggressive fashion um now the main downside of recurrence is that if you actually know your entire input sequence in advance then you this is actually slow in practice like on Modern Hardware um because it's completely sequential and you have to unroll this one at a time um and modern accelerators like gpus and tpus you need to be parallel to uh the efficient um this is actually one of the probably one of the main reasons why rnns kind of didn't scale as well computationally and then uh more lost popularity a couple years ago so the last representation basically is about how to overcome this and be efficient um in parallel when you have all the data so the main idea here is that this was the recurrent form again I dropped the D term because um that's it's very easy to compete with the D term so we can ignore it um but yeah here again is just the recurrent formulation and you can basically unroll this in closed form um so like your first input your first state X is going to be this B Bar times the first input u u naught um then you multiply the whole thing by a bar and add the next input by iterating that the Box equation um and then to compute X2 you multiply holding by a bar again add the next input so basically you can just like continue iterating this um and the second equation is just a linear projection of this so um the the formulas are again very simple and the the takeaway is that your outputs so again this is a map from you the sequence U to the sequence y um and the point is all every element of the output sequence you have a closed formula for it in terms of the input um this is the general form so without even without needing to like unroll the recurrence in like in on the machine you can basically unroll it analytically in equations and um every so yeah so you know what the output is as a function of the input exactly um and if you look at this you'll notice that it's actually a convolution so um this is the full equation uh so the way to think about this is that I had my SSM parameters ABC um I discretize them to a bar V bar and C bar is actually just c um and then you can use that to Define by just by Computing this formula right there uh this defines a uh a vector called K Bar um which is the same length as a sequence and K Baria is basically what we call the SSM convolution kernel because the output y so this is a map for the SSM is a map from U to Y and the output Y is just a single convolution of the input with this particular convolution funnel so this is actually um I think one of the easier ways to think about the model or like the uh the cleanest way to relate it to Classic learning architectures um because basically you can just you can think of this deep SSM as basically just the same as a CNN or it's a it's a variant of a CNN where the convolution kernels are just parameterized differently so instead of defining a convolution kernel using a finite window convolution where every element has a parameter um these These are sometimes called implicit convolution kernels or implicit cnns where it's again just a CNN but the convolution kernel is defined implicitly through this formula um K Bar that is a function of the actual underlying parameters ABC um and one motivation for that is that this convolution kernel is actually infinitely long so uh what and this kind of relates to one of the some of the properties of ssms that we'll see which is uh one thing that the S4 model was originally known for was being able to handle really long sequences um and one way to think about that is well all of all of the views shed some insight into that so as an RNN rnns have on infinite context because you have a state that just keeps getting updated through time and as a CNN you can also see it because the convolution kernel is infinitely long so the model is kind of attending to everything in the past um but now if you want to parameterize an Infinity non-convolution that's clearly impossible unless it's uh it's a formula that's based on some compressed parameters so that motivates that's another way to think about this implicit composition um any questions about this like The rnno39 View Okay so that's oh yeah here's another illustration of it so um the SSM was a map from the input U to a state X to an output Y and um the way that it kind of works as a convolution is that instead of computing these three things one at a time uh you actually can completely skip Computing the state and uh write the entire output as a convolution of the input with this particular convolution kernel which is Illustrated in green here um and this is just a standard convolution take an infinitely long convolution in front of the green one and then you're sliding it along the sequence and taking the dot products um to generate every output so this is kind of like another way to do the same thing uh but the benefit of this is that convolutions are very well studied object and you can compute this whole map if you have the entire input and the entire kernel you can compute this whole map very fast um using fanner techniques like the fast Fourier transform um and so this overcomes the original issue we had with recurrence um of sequentiality so this whole thing can be computed in parallel over the entire sequence Lane and quite fast on like gpus yeah so the uh yeah oh great question so um uh the kernel is actually implicitly infinitely long as I mentioned but if you have a finite sequence of like L um when you convolve it by an infinitely long kernel uh it's the same as truncating the Coronado the same length as the sequence because the rest of the kernel doesn't will never be that part of anything right um so uh and basically in practice what happens is that you you take the you take the input length of your sequence which is capital L here and then we're going to define the convolution kernel truncated to that length and compute just that kernel using some special techniques um yeah so for this basic the model I'm showing here um it's actually a particular flavor of SSM called time invariant ssms which have this dual view over currency convolutions and the point to emphasize is that they are um they're Computing the exact same model so the trade-offs between these are all about the computational trade-offs um and the the computational trade-offs basically are um the for this original model basically the trade-offs were that during training time you generally want to use the convolution view because um because you see normally during training you see like an entire batch of your entire sequence and you can um you want to do this computation in parallel so uh you can compute the kernel and then use the convolution to uh compute the whole map um for example in language modeling even though the objective function is auto aggressive during training you actually see like every single element of the sequence at the same time because you use some people call it teach enforcing um so you always want to use the convolution view in general um the recurrent you generally you only want to use in settings where you need to be like Auto aggressive or online so some settings include like I will see an application later to uh Auto aggressive Generation Um but like language modeling is an example where during inference time you'll want to use the RNN um other settings like uh reinforcement learning might be similar where you like your actors might want to be um they kind of like they like when you do an action and then you get a new observation and then you only then can you do the next action so anything where you're like not seeing the sequence ahead of time you'll need to use the returns but otherwise the composition is generally faster um now there is a caveat which is that there's active research in figuring out how to make the recurrent view faster because it's actually more flexible in some ways um but that's uh that's an ongoing research um yeah did that answer a question okay uh yeah so that was the all the about the convolutional view which is about paralyzable computation um there's an online question so what is your interpretation of the Nathan space for these models uh good question so that's actually kind of going to be the subject of the next part which is the original S4 model how to define these in a way to have a very particular Lane state in general um so more generally though you can Define ssms without this then the extra stuff like just these basic ssms um I don't know that there's like a I don't know what sort of interpretation you can look for there's there's lots of things um but like as an RNN usually there isn't necessarily like a clean interpretation of the state um in this case because it's a linear RNN and you can impose structure on it uh you can come up with special things that like have nicer interpretations for example um uh when you have a diagonal SSM which we'll see a little bit later you can view your latent State as kind of defining dynamics that smooth out your history at varying rates um you can view uh another interesting thing is that you can actually view a local CNN as an SSM where the the that the the a matrix is like a shift Matrix and you view the stage as kind of just a buffer of your history so basically every single different SSM will have a different interpretation um the next part we'll see is going to have a the original S4 model has a one very particular duplication so yeah so let's talk about S4 now so um you can basically view this as a particular case of these of a state space model um and to Define it the definition is actually very simple um we're going to take this basic SSM that we saw and now we're going to uh use special formulas for the AMD matrices which are called uh hippo matrices and that's that's pretty much the whole definition it's just an SSM with special matrices for this um but also to compute the model there is you're going to need to uh use some very bespoke algorithms to be able to compute this efficiently so we'll kind of uh now we'll go through both of these uh so first why what are these matrices and what did they do um okay so hippo basically is a particular definition of the first SSM equation uh so again it's giving you like it basically specifies fixed formulas for A and B and um and just to recap how to think about it um hippo is a map from the input U to a state X and it was motivated by the recurrent view so the motivation is uh it's essentially if you're taking this equation and you're getting new inputs and you're updating your state um can you come up with a state that has a very meaningful interpretation um and in particular the motivation of this was can you design a stage that is actually capturing the input's history um in a way that allows for long-range memory so the original motivation was uh to address language dependencies uh with rnns where rnns were famously known to suffer from addressing large dependencies and so hippo was kind of from an RNN perspective how can we do this in a principled way and the way the the motivation was just the design of state that can capture the input history which shouldn't should allow you to have longer dependencies uh and more specifically let's look at the setup so we have an input U that we're observing one time step at a time so each U1 U23 these are just scalar numbers um and every time we see an input we're updating our state X which was this fixed slave factor of like n um and we asked the question at every given time step T we have a state X of T which is a vector and can this stage be used to reconstruct the entire history of the inputs we've seen so far so I think this is a very natural question um and we call this online function reconstruction because they're getting an input function and you're trying to reconstruct it online um so the answer is of course yes you can do this um and that's where the formulas came from so the the answer to like the how do you do the Reconstruction so basically this equation was the uh the thing that answered that question um and we're going to work in continuous time so the question now is given an input function U that you're observing online continuously um what are your what are some dynamics of the state that uh will allow the state to evolve in a way that's continually maintaining some memory of input um the way that it evolves is exactly this equation the uh first order linear differential equation um with some particular formulas for the matrices so I'll illustrate this in just one second but just for terminology this differential equation is what we call the hippo operator and it's an operator because um you're thinking of this as a function of function map where you take this uh this input U of T and hippo is going to map it to this higher dimensional State X of T that is an online uh compression of the history uh and the matrices here A and B are called type of matrices and a is the more important one okay so oh actually I had sorry uh yeah this I had some more slides that I skipped that kind of illustrate the process in more detail um but let me skip this okay so let me just illustrate what does this actually look like um so here we had this I have the same the same uh function U that I've been using the black line is u H um and now we pass it through hippo which gives me this thing X in blue which uh you can think of as actually we're generating this we're currently online um and now to illustrate the function reconstruction aspect if you look at the last the last time step I've consumed this input sequence of my ten thousand and in this particular illustration I've used a uh a hidden state of size 64. so if that capital N was 64. I'm only Illustrated four things just for make it free um but what is directors have a vector of links before um at the end of the whole thing I consumed my target sequence updated in my state and I now have a micro length 65 memory and I can use by a linear projection of that 64 numbers um I can use a formula projected back I'll take 10 000 numbers and that's the red one so you can see that um obviously with the finance size State we can't remember the entire studio um and so the question is like how what sort of approximation can you make to it and uh that particular formula I will show you is giving you an exponential approximation to the input um so you can see the red line is um it's very accurate for the recent past and in case uh more formally the framework involved with prior or hyper parameter that we call a measure um which which is illustrated by the green line which says um if I'm giving you a reconstruction and I want it to be a good approximation how do I measure the quality of that approximation um we needed to find a weight function that says how much you care about various parts of the input so here the screen one the screen one is the exponential measure which just says that I care about approximating this part of lower than that part so that's why the Reconstruction looks as well um and now the more interesting part is that this was all motivated by doing this in an online way and that means that if I am observing the input one at a time I'm constantly updating my stage which are the blue lines and at every single point in time that state can be used to give me that reconstruction and you can see how the reconstructions Evolve through time so at the beginning uh it's very accurate and then um as you move on you're going to be remembering the current recent history better and kind of losing we're getting a little bit about the past but you're still maintaining some information about the podcast any questions about this um and yeah so this kind of answers the uh the question in chat about the uh what does the state mean so again here uh with S4 which is going to be based on hippo the state is an uh a compression of the entire passively input and this was motivated to give your language dependencies um because once you project it back out um you're hopefully able to leverage this information from the past okay um so that's the definition of S4 it's just plugging in these formulas into the SSM now the issue is that actually Computing this SSM is actually quite non-trivial um I skipped over some details earlier so now why is it hard um let's look at this SSM again and uh so again it's a parameters app from a 1D sequence to a 1D sequence um here I'm using a function view but yeah just assuming this has been this bit ties to it like L um your this map is consuming a input with L numbers which is the import sequence and it's outputting l numbers and so uh ideally this should take L time um or not too much more so if you look at things like uh normal rnns they scale linearly with L um and so on um now if we look at the SSM remember that it mapped the input through this latency X Which is higher dimensional uh in the previous hippo illustration I had n equals six before uh in general you're going to want something maybe on the order of 100 um and if you think about it that means that when you do this computation just materializing about latent state is going to be 100 times L uh numbers that you have to store in in your memory um or and computation Wise It's also going to be at least 100 times out so these uh more generally basically you're going to need at like oml space and time just to materialize the state which is a lot more than the bound here like we want to just L um and now I I mentioned there's like a separate view there's a convolutional view which allows you to bypass the state but the hardness of the problem Still Remains um in this case the convolution itself is fast but Computing the convolution kernel is going to be very expensive um so there's no free one here um like one way you can do is all those nice properties of ssms come with the cost which is the computation costs of actually materializing these three properties um so so in practice the very first like the early versions of S4 we try to do this uh kind of more naively it was more of a proof of concept thing about the model but computationally it was incredibly expensive and not really feasible in deep learning right that's because right yeah um yeah so the reason this doesn't happen with normal rnns is because they're defined a little bit differently um but here the main thing is that these ssms they kind of have a they have a hidden State that's much bigger than the size of the input so normally in rnns like you have an input of slide like a thousand and then you mix them and then your hidden State also has size a thousand but here if you have an input of size a thousand your state will be a hundred times a thousand so it's going to be very massive um yeah so and that's kind of one of the main differences between the ways these are defined and RN and certified um it's like it's very reasonable to call these like ssms a theme is that that we've been over is that they are rnns and they are cnns and so various people will insist that the rnns or insists of their cnns and that's true um but they also have these extra properties where it's useful to have you know distinguish the this intersection SSM um right so uh yeah yeah so these these models basically uh this is one reason why I think actually they were never used in deep learning um even though they've been around in other scientific fields for a really long time um one thing is just because of the different perspective from statistical to deep learning perspective another one is that they're really impractical to compute and deep learning um so now the whole Insight of bus four or the structure State space sequence model is um that in order to overcome this you need to impose structure on your matrices or structure on your ssms and then you can complete this whole map faster [Music] um so I mentioned uh the computation and the time and space complexity was n times L where n is the flow up into the state Dimension um and you can reduce this down to linear which um is a really massive saving so this is what made it kind of practical to um actually use um so the algorithm is pretty complicated and there's simplifications later so no one really has to look at it hopefully um but the idea is that again we have this SSM here we are using fixed formulas for it um you can see that these formulas are highly structured in certain ways um and you have to kind of like figure out the right way the right structure to look at it with um and then with a specific structure found a way to compute this uh really efficiently using this pretty complicated algorithm um but yeah this this was kind of the main point of the original S4 paper actually was just like how do you complete these things reasonably um but yeah so there's been a lot of work following it that simplifies that a lot so um and and these are the ones that I think are quite practical and a lot easier to use so these are uh what are is more commonly used in practice nowadays so I'll just briefly mention some of those um essentially all of them are based on diagonal matrices instead so I didn't actually quite Define the structure used here but you can kind of transform this Matrix into something that's close to a diagonal matrix and I'm not quite diagonal for various reasons um but later on it was found that ways to simplify it further into a diagonal matrix um and so the official one uh like like the the basic version of this is called s4d which is just the diagonal version of S4 and that very complicated algorithm from before which was for computing the convolution kernel um can be simplified down to just a couple lines of code um so this this box just is how do you um I mentioned like to in order to use the CNN mode you have to complete the population kernel and uh that can be non-trivial but for diagonal matrices it's quite simple and you can do in two lens code um technically one of the things that made this hard was can you like Define these diagram matrices while also having nice properties because um in early versions of all of this basically I also found that uh a natural question is like do you need hippo and all of that um what happens if you just have say a random stasis model and uh it turned out to be quite hard to find the ones that worked so actually it seemed like the hippo theory was actually quite important um and if you just make like a random SSM or a random diagonal SSM it doesn't really work that well um and so one of the one of the insights needed to make this the first diagonal one worked was uh discovering that there is a it's actually very surprising that um there's a diagonal approximation to this thing that um basically has the same Dynamics um I don't want that time to illustrate this but there's a you can view the convolution kernel as like um some set of basis functions this is the one for hippo or S4 and by using a particular diagonal approximation you get these fuzzy approximations on the same kernels um so again this is actually really surprising fact actually even even now to me it's still surprising um uh yeah and then Computing the whole thing is very easy so uh this is like an entire self-contained code sample of initializing and Computing the entire forward passing the model um this is all available online this is kind of like uh another way to view it um as an RNN you have this diagonal a matrix and when you're if you think of an RNN and you're like multiplying by this a matrix if it's diagonal then essentially every single Channel or feature is operating completely independent right because when you multiply by diagonal matrix it's just a scale of multiplication element-wise multiplication so if you drew an RNN cell for this which is you know like people like drawing on stem cells and stuff um it would kind of look like this where every single channel is actually operating independently which is one reason why it's very simple um and as a convolutional model there's other interpretations for it that are also pretty simple okay any questions about uh this okay so um I think in the meantime we can look at a couple experiments for fun um these are actually some of the older experiments there's been like a bunch of follow-up work and a lot of other interesting stuff but hopefully I have all the good ones um yeah so one thing uh as I mentioned to emphasize is that um S4 was known for two the thing I was originally known for was for modeling long sequences which was one of the motivations through hippo um but actually I think that one of the main strengths is not necessarily long sequences but uh sequences that are continuous which I alluded to at the very beginning so you can call the signal data which appears a lot so like audio waveforms um images and videos um any sorts of Time series there's lots of this type of data in the world so here's an example um this was a this was a classic uh time series Benchmark where the input data was um these bio signals like EKG and uh EEG and PPG I think um and now given one of these input time series the goal was to predict the patient's average heart rate or respiratory rate like a couple of these Biometrics um so if you look at this so this this particular data set was challenging because um the inputs were then four thousand uh normally for a long time people had a hard time scaling things past even like a couple hundred or a thousand um and this sort of data has very different characteristics than data such as text Data um if you visualize the signals they look like this and you can see that there's like very strong periodicity um that's a common feature of signals um if you like if when we zoom out like this we can see that but if you zoomed in all the way actually the signals are very smooth so they're kind of because they're sampled at a high rate from an underlying continuous signal um so this is the exact type of thing that these ssms are really good at um there was a couple papers that did a bunch of benchmarks before this um crap I think my computer might have power soon it's okay um yeah so this is a table of results for just the regression on the on these time series you're just trying to get as close as possible possible to predicting the patient's heart rate or something um and there's lots of baselines here including more classical ml approaches like um boosting methods and Forest and so on and then a bunch of deep learning sequence models such as these rnns and cnns um and as far as just much better than all the basements here um one thing that I think is interesting to point out is that uh coming back to this question of inductive bias um I think that like people throw a Transformers at everything nowadays um supposed to be like a magic pill but um they're actually like I think they're very very good at modeling some types of data but they don't have the right inductive bias for other types of data like this so if you just threw a uh Transformer at um this sort of very smooth very long data it really struggles um and other similar things too like there's a there's lots of places where you basically need to have a CNN encoder uh or something like that before you can feed it to a Transformer because um it's not going to work on the raw like continuous signals like this um okay now we can sort of see some of the properties at work more specifically so here's another example of a signal where which is um speech classification um so again audio waveforms are continuous and they're sampled at an extremely high rate uh like 16 000 Hertz or or even higher um and uh this this is a data set where you get these one second speech Clips which is uh what is sequence about 16 000 you pass it through a deep deeper Network and you're just trying to classify the uh which word was spoken in this quick one um so first of all like because the sequence is so long and other problems uh lots of models don't work well so um rnns and these like uh Transformers and um I mean a vanilla Transformer won't even work here because the sequence is too long but there's these like Transformer approximate uh approximate Transformers um they don't work rnns don't work and uh other like Nero OD based methods don't work um cnns do work here and because ssms are cnns they also work very well um but what's I think what's what's a lot more interesting in the setting is uh what you can ask the question that um so audio is sampled from an underlying continuous waveform right now if you sample it at different rates you're still getting the same it's still the same underlying signal and it's still you know it's still the same word that's being spoken um but if you fend in this blue signal sample at a certain rate versus this orange signal sample at a different rate to the model they might look very very different um and so in fact like if you look at CNN's like normal local cnns they will completely break if you train them on one resolution and tested them on another resolution because the convolution kernel is kind of very attuned to um exact life right like if you have convolution kernels of with three three adjacent things on the blue signal look very different at the Radio Systems um and so if you do this zero shot frequency change is what one word that we called it um CNN's so the CNN Baseline that worked well just doesn't work here at all uh unsurprisingly but because ssms are a continuous model uh this is one example of a concrete empirical benefit that you can get from it um what happens here is uh at test time without training at a different resolution you can just change that disabilization uh step size so uh remember there's like this actually calcium parameter I mentioned um it kind of represents the spacing between the things and so if you train out R1 then if you're testing it at a different resolution you just scale the Delta appropriately and it will just work out in the Box so this was the point of the continuous View um now what about the recurrent view I've kind of gone over this about like it's very good at Auto aggressive settings so uh an example is um you might have heard of wavenet which was the first Model to be able to do uh raw speech generation that it was the first like speech generated model um and these were all auto aggressive models um where you take it you just sample like this one one sample at a time um and there's a lot of challenges because you can see that uh the audio signals look very different at very various resolutions um they're very smooth when you zoom in they're very important they're very periodic and so on when they zoom out um and uh uh yeah so it's hard to model and it's also hard to sample fast so um you can throw in an SSM and it gets this unbounded context that we talked about so you can have a huge context but still sample very efficiently because it's an RNN that has a finite State size um and wavenet has actually been even though it's been it was like 2016 or something it's still actually the best auto regressive speech model um based on these like validated convolutions and so on um well so Auto aggressive sleep sampling isn't that popular nowadays because there's other there's other approaches but if you were to do auto regression um waiting that's still the best uh and we try training it on this particular data set of speech um where uh basically these one second clips of uh people saying English digits like zero one two three um and wavenet just didn't really work at all I think there might be a demo here but I'm not sure if my volume is on in charge yeah I'm trying to uh okay let me let me try this so here's a wavenet example [Music] it's basically just garbage um and then here's the S4 samples painting the same way or so sorry four five six so it's uh it's night and day here um so this is one example of something that kind of actually leverages um all of the properties um so the way you do this is that uh during training time you get this uh you get a very you get this sequence and you're doing Auto aggressive prediction of the next one but during training time you do it across all time steps at the same time so you need to be parallel and you use the convolutional mode to train and then at inference time you switch it into the recurrent mode which generates things Auto regressively much faster um and the continuous one is more fuzzy here but it's just the reason like those are all about the computational properties like the RNA and CNN modes are about computation but why is it actually modeling like why does it sound so much better than wavenet um that can just be attributed to the inductive bias of being having the right bias for modeling these types of continual signals and being able to get lots of context into the model um but yeah I usually use this to illustrate the current view um yeah this is all yeah this is a model called Sashimi I uh uh basically it's just using S4 as a black box into a different deep neural network architecture um and one thing I always emphasize is that as a Samsung S4 are not a deep neural network so I mentioned this at the beginning but just to reiterate um I think of them kind of as like a primitive building block that's just a simple linear sequence of sequence transformation and you can incorporate them into deep neural networks in lots of ways just like you can incorporate like uh convolution introducer Network in many ways um so this Sashimi was a particular architecture and you can also use it in different settings it doesn't have to be Auto aggressive um you can use it in diffusion models too where uh by replacing the architecture in a diffusion model um with uh something based on S4 it also improves that model [Music] um okay I think that's pretty much everything um I guess the I was just mentioned the one the thing that S4 was originally known for was uh modeling language dependencies and um attention was famously famously struggles with this which is part of why lots of variants have come out uh that are supposed to be better um but there was this Benchmark called language Arena that uh gave a suite of six tasks that were pretty hard for and they had sequences of like a thousand to sixteen thousand it was pretty hard and all the Transformer variants were about the same which was not very good um and when S4 came out it improved the state of the art here by over 30 points uh actually this this is an outdated number um but uh yeah so it was kind of known originally for doing being able to do this long-range modeling um as well as solving some very very difficult tasks that uh no Transformer can even achieve better than the man against the map um yeah and then there was also Benchmark kind of the speed of these different models against each other where um because of the the special fast algorithm that I mentioned uh it's also basically it can be much faster than Transformers especially as a single Saint gets along okay um I think that's all the time I had there's some other stuff here about comparing like ssms versus other models um it's not that important so well okay here's actually when somebody's lies so um yeah so save these models old model but uh different use in their classical statistical sense than uh what we use um in deep learning we think of them as having lots of computational benefits um and they kind of generalize or are very related to lots of classical sequence models like RNs and cnns um S4 was the first deep or structured SSM and um it was based on the hippo theory for this online function reconstruction or online memorization um and in practice I recommend people to use the follow-up diagonal variants which can be empirically as effective but can be or are much simpler to implement um and for email me I can send you some resources and blog posts actually I'm currently writing up a um well I procrastinated on my thesis for a year and I'm finishing it up which will also be a general tutorial on SMS [Music] um yeah that's everything uh any more questions maybe so while you're coming up with your questions like there's already one question online so I'll just read out for you sure okay so uh Cardiff asked since S4 has some close connections to vanilla is vanilla is okay so how do you overcome gradient exploding or Vanishing issues that arise to the long term online uh yeah great question so um there's a number of perspectives on this um one way to look at it is so you can look at it mechanically you can look at it kind of like at a higher level um one the higher level answer is that uh when you use something like hippo um hippo was designed for this long-range memory and uh is basically constructed in a way that by definition avoids um Advantage ingredients I mean another way to look at it is that uh hippo was designed to like to remember all this history and um essentially that forces it to not have managing uh ingredients um or like it's Dynamics don't Decay through time but that's that may be not very satisfying um you can also look at this very mechanically which is that um if you look at the so one reason why it's this is better than normal rnns is because of the fact that it's linear and when it's linear you get control over the um Dynamics you can control the Dynamics of the model much more precisely um and in particular for example if you look at that a matrix that or the a bar Matrix that you're powering up if you can control the eigenvalues of that Matrix then you know exactly how fast your uh your your like signal vanishes through time um and so basically yeah you can you can constrain it in a way to make sure that it will have it will vanish exactly like it will capture dependencies for exactly as long as you want um and when you plug in the hippo Matrix it sort of does that by default um through other interpretations but even without that there's many ways to make sure that it's um not managing yeah cool thanks okay so another question yeah is that generated online uh yes so this is a good question too so um some people will point out that um if you have if you have a fixed model and you want to compute it you don't need to continually recompute that convolution kernel you only need to compute it once um like if you're ABC if your parameters are fixed then you can compute the kernel once and cache it and then during the forecast all you need to do is the convolution part which is actually shoot um but this only works after training when you've already like trained your model's parameters and they're frozen uh during the training forward as your models parameters are constantly changing while you're doing great descent and you'll need to recompute that composition kernel on every single mini batch um yeah so so that's why it's very important to complete that kernel really fast which was why um all those algorithms were developed okay so what's your perspective regarding the use of S4 um to graph structured data for example when connectivity is structured between channels is not yeah so this is something that um I haven't really uh worked on myself there has been um there has been some work on this there was somebody in the Stanford Medical Department who was working on EEG data where you have like say like 20 channels and these channels are all based on locations on your head like electrodes and so there's also this like spatial graph structure between them and they found an extension by combining it with gnns uh that improved the model um I don't actually know the details of that but you can combine this with other techniques with gnns to to improve things um personally I think in practice especially in deep learning there's honestly there's a lot of mystery about like how things work and why they work but in practice normally if you without baking in that structure it works pretty well just as you start spelling your models um that's the lesson people keep learning just making models bigger and you don't really need to bake in those structures um one thing I did I maybe didn't mention is that in the Deep neural network architecture here um I kind of mentioned that if you have multi-channel data you kind of um and I I defined S4 as this like 1D to 1D map right so what happens if you have multiple channels in your input uh you basically just Define an independent SSM per Channel and they operate completely independently um and but then after you do that then you'll need to mix the channels in a way using like linear layers MLPs or perhaps you can use some other like GNN or other structures to mix them so that's kind of a flexible uh something something else orthogonal to the main assistant part so it's essentially like one ss S4 layer for the time uh movie another let's say Channel mixing layers yeah combined signals yeah I think it's that these two things yeah yeah and then you just make it deeper and and it works usually so in the content is a deep models more promising than other RNs and Transformers um so I think it depends on exactly the setting but I think that I do think that online learning is one of the most interesting application areas for this um so so one question is kind of like in general recurrent models have a number of benefits over other models just because um if you're in a setting where you need a like if you're in a stateful setting where you need to be like updating your state as you go such as um I don't know and like RL robotics or something um partially observed environments uh all right like required models just have a natural advantage and that's in fact like one of the places where they kind of like lasted they're still in use and not everyone has switched to Transformers um and then within recurrent models I do think of ssms and S4 essentially as probably like just the best version of an RNN that you can have um and lots of I think both because of computational reasons and because of modeling reasons like the yeah ingredients problem the how do you get a long context problem um it empirically is pretty much just always perform better than like lstms or other things um I think there might be more yeah I I can look at the zoom now so I think yeah this is just gone through all the questions okay great okay great so one more question yeah yeah um so this is that's something that you can do but you would have to uh view the model completely in recurrent form so one way to look at this is that for the convolutional view of ssms is tied to the fact that they are time in time and variant or like if you're uniformly sampling things then you get the convolution of you but any other setting it breaks and that's why I mentioned uh at one point I said that the recurrent view is more flexible or more powerful um so to do a regular sampling you just you use the recurrent view um but uh let's say that you know like your inputs are non-uniformly spaced but you know the distance between each one then every single update in your recurrence we just plug in a different value of delta into the price so every single so you have these uh like ABC parameters and now every single step you take a different Delta you plug in a discretization formula which gives you an a bar which is going to be different per timestamp now um and then you unroll the recurrence for that step using that a bar and um yeah that kind of just nationally models this like a regular sample case but I guess this will basically introduce different kind of state discretion areas right different types of uh if you're coming to companies to your The Continuous time build yeah over this model yeah that's true so there there is um always a uh if you're yeah anytime you have like a discrete sequence that there's going to be a gap to the pure continuous view which is like fully accurate that allows can be a little bit of discretization error perhaps it can compound over time um I guess that's something that's not been studied that closely in terms of how how that compounds um but there are some kind of like small scale empirical experiments that suggests that you can do this regularly sampled stuff quite well um there's also been I think some follow-ups on I saw like uh one that I cleared last year that combined hippo actually with um neural Odes to be able to handle a regular sampling uh very well actually so all these ideas are there's lots of ways to make these videos okay let me see um so I mean I'm happy to stay later and answer questions but I guess like we can officially finish so I think we should finish right now okay so let's give another uh uh foreign

Transcript for:Deep Sequence Models and S4

Transcript for:
Deep Sequence Models and S4