Transcript for:
Understanding Long Short-Term Memory Networks

hi everybody and welcome to another video in the deep learning for audio with - series this time we're gonna introduce a super exciting type of network recurrent Network calls a long short-term memory network right but before getting into debts let's remember like what we've done last time and why we need this LS TMS right so we lot in the last video we looked into simple errands and we saw that they're really good for time series type of data but we also found out that they have a few issues mainly that they really don't have a long-term memory and that's the network can really use for a context from the past and that's because like it can't learn patterns with long dependencies and this is like a huge issue as we've seen last time because I mean a lot of like audio/music data depends like on long patterns and so we need to find a way to overcome this issue introducing long shorts and memory networks or LSD ms so LSD ms are a special type of recurrence and neural networks and the idea is that we have a memory cells that will enable us to learn longer term patterns now don't get super excited here because alessi ms i performed really well and they've helped us so much for example in music generation and a bunch of difference on other tasks that have to do with audio but the point being is that they detect pattern which up to like a hundred steps but they start to struggle when we have hundreds and hundreds e let alone thousands of stats so let's get started and understand how there's lsdm networks work but for doing that we want to create a comparison between simple errands and LS gems now here we have a nice diagram where we have a simple errand unrolls now this is not my diagram and I'm using like this super cool graphics that come from a an article that's like super importance like in the community and it's a blog post that's called understanding lsdm networks and I linked the article below in the description section of this video and I urge you to just like go there and check that out kids like it has oh it's really really good to understand like LS CMS and it's a bit like complementary to what I'm presenting here now getting back let's see the good stuff so here we have this diagram where we have a unrolled like recurrent layer but with a simple RNN and if you guys remember remember from last time so here the memory cell itself is a simple dense layer with a tan H hyperbolic tangent activation function and we have both the inputs at time T so XT and the states vector from the previous time step that contribute and our companies concatenate it together and fed into the tonnage and then we get like an output out there and the output is like it's twofold right so we have like the output and then we have the new current state which is like the the state vector but like output and say fetter are the same now let's take a look at LS the ends and here we go it's a little bit different right so I mean the whole architecture is basically the very same so we can unroll in lsdm like in the same way that we do with a simple RN n units what really changes it's this guy here which is basically the cell itself so there are a bunch of different things that go on like in this cell - like the simple RN n cell yeah doesn't do at all and it's all of these things that will enable us with another scam to learn longer term patents cool okay so now let's take a look at an El El SDM cell from a very high point of view so you may have noticed this but an el SCM cell contains a simple or an M cell it's at an age dense layer and we'll see this in a while but I mean so the idea is that you can think of like this LST M as a as a kind of like augmentation of a simple RN n cell where we have like these are nm salads and a bunch of other components so one of the most important components there is a second state vector that we can call the cell States and basically this is the one that will have like information about a long term memory so it's kind of stores patterns like there are longer term and then we have a bunch of like gates the forged gates the input gates and the output gates and all of these gates which are basically connected with a with a sigmoid dance layer acts as a filter so they filter the information and the decide like what she forgets what to input and what's rapid now john be scared about all of this complexity because we're going to break down all of these things one by one so and here we have it in all of its beauty the lsdm cell so there are a bunch of things here so let's start like from the simple ones so just like the things that more or less we already know so it is X T it's basically just like the input and so this is like a data point and let's remember here we are analyzing like the the behavior of this cell at each time step and so XT is just like a point in a sequence right a sample good then we have the output over here which is HT and it's just like the output of this cell we cheese somehow connected with the hidden state because it's basically the same thing so the output and the hidden states are the same thing and then we have this secondary state vector that's called the cell state over here which we call CT so it's the cell state at time step T good so now let's move on so here you might have seen this but this is our simple RNN cell which is already in the lsdm here and as you can see so here we have this tan H and this is a dense layer now with like this all of these units here with this sigmoid thing Sigma tan H which like the yellow box they represents dense layers with an activation function that in this case is their tan H and in these other cases is like a sigmoid function but going back to the simple R&M cell so this is a simple errand and cell so we have a tan H dense layer and the input set is a is X T so the basic like the the data as well as the h t minus 1 which is the state vector or a hidden state from the previous time step good ok so let's take a look at this hidden state so we can think of this hidden state as the shorts and memory of the LST M so it keeps information kind of like the stuff like that's happening like in the in the last like in the current state and it kinds of like yes stores that kind of information then we also have like this secondary state vector which we can call the cell states and this cell state is the one responsible for the long term memory so for storing longer term dependencies and patterns and as you can see here the the cell state flows quite nicely really the LST m and it's only updated in two points so here where we have like this multiplication which is a element-wise multiplication we'll see this in a while but so here we have like this multiplication here we have like this some element-wise sum between like few mattresses that we'll see so all of this to say that for the cell set we have very few computations just two and when you have few competitions the result is that you can stabilize the gradients and so you are kind of like a better suited to avoid vanishing radiance which is like the main issue with training RN ends cool ok so let's take a look at this cheer updates that we have like a little bit like more specifically so the first one decides what to forget so when we get like to that point the cell state gets updated and we decide what you forget from these long-term memory state factor the second one decides what new info to add to the cell state and so if you think about this it's basically so the cell state gives us like tries to keep track of the most important information it drops the things that are like less important and it adds things that are very important and so like when we train the network we basically trained analyst M to be very effective at understanding which patterns to forget because they are not that important all in all and which patterns T to remember because they are super important for our tasks good ok so now we're going to look into the full gets segment I call it like this way forget component of the LS TM and so here as we said like this component yet kind of responsible for forgetting stuff from the cell stage okay so I'm going to drop some math here and don't be scared about that because like it's quite intuitive and if you've followed so far like the the series it's not really that different from the stuff that I've covered when we started looking into computation in neural networks and yeah and all of that stuff but if you don't remember that I have the video for that should be over here just like click that and check that out okay so back to the important stuff now so we have this F T which is I forget matrix for T and it is like the results of the forget gate and the forget gate being like this guy here like this Sigma dense layer over here so what happens here it's very simple so we concatenate the input at the current time step with the state vector the hidden vector from the previous time step over here and we concatenate that and then we apply a segment function to like this guy here and then here we have this bf is just like a bias term and this WF is the weight matrix for like this dance layer for the forget layer good and so once we do this we get a metrics that's that's like the filter for what we should forget and we'll get to that in a second but basically when we calculate that we are at this point here so we've just gone through the forget gate now we are using a sigmoid function and if you guys remember the sigmoid function X ranks the output between zero and one and this is great to use as a filter because basically we're gonna have like for all of the values of these ft matrix a value between zero and one so the things that are leave the indexes that are closer to zeros are the relative indexes that we're gonna forget in the sell state whereas like when we have indexes with values that are closer to one we're gonna keep those values so zero forget one is remember but now how do we forget stuff or how do we remember self yeah that's when these element wise multiplication kicks in so what we do basically is we take the sell state at t minus 1 so the the cell state like from the previous time step and then we perform an element-wise multiplication with the ft matrix what obviously to perform elementwise multiplication you need to have like these two mattresses that have the same dimension and so we can just like multiply element by element there and the result is this CTF which is a very heavy kind of like yeah convention to say this is like the the cell state from the previous time step where we decide what you forget at this time step right okay so this could feel a little bit abstract so I'm gonna provide you with an example okay so we have like our a nice equation over here and here let's say we have like the cell states that's given by three values right okay so it's 124 so and this is the cell state at t minus 1 and then with with like calculated the input gates and we have this value over here for ft which is 1:01 now let's try to get to CTF so how should we do that well that's super simple so we take CT minus 1 and with element-wise multiplied with ft that we have 1 2 4 x 1 0 1 and if you do multiplication element index by index a 1 by 1 is 1 2 by 0 is 0 4 by 1 is 4 so the result is 1 0 4 so here we've decided what to retain and what say forget so take a look at the first item and the third item so index 0 and say like in this list of see t minus 1 and the result is that we are keeping that information in and why is that well that's because the ft is equal to 1 for those two indexes so the filter that we are using is just like telling us yeah I want to keep that information because I believe its importance what about the poor second index in CTF there well unfortunately we are dropping that as you can see here so it becomes 0 and the reason why it's very easy to understand is because the in the the forgets matrix there we have like for the correspondent index 0 which acts like as a as a filter that drops that value right so now we have an understanding of how like this forget thing works on the cell state ok so now let's move on to the next step so we said that we have forget input and output as like the main components of a and lsdm so now let's work it's an input so the input here it's kind of like it's made by two parts right so we have like our simple RNN module which is this tan H a dense layer and then we have the input gates which is this sigmoid dense layer over here okay so for the time being let's calculate the input let's process the input gate and and get like I a matrix out of it right and this is gonna act as a filter on the simple RNN component right let's see how this works so for the this I T which is the yeah as a search like the results of the input gate we're gonna get like I again and matrix that has like the same dimensionality as the what comes out of like the tannic layer over here and we get it by just using a sigmoid function that we apply to the concatenation of h at t minus 1 and XT + 2 which like we apply or we multiply this w I metrics and obviously as always we have like a bias term here but let's not butter about that cool so here with this I T we have like this value this matrix that comes like at this point now let's see what happens like with at the other point over here so we're basically gonna build a C T prime so which is basically like the new cell state that is a function obviously of the hidden state at the previous time step as well as the input data at the current times time step and the tree again our combined concatenated together cool but this time we are using at an h2 a hyperbolic tangent as the non-linearity and now this is like where we are at at this point so this value C prime C T prime is over here good so after the tan age layer good okay so now we should calculate the the element wise multiplication between C T Prime and I T so basically what we are doing here is we are taking the the new cells age the information that we want to pass in the new cell set here see per se T Prime and we are modulating that we are filtering that with IT the same way we've done with the full gap segment right and so here these two matrices are gonna have the same dimensionality we're gonna multiply them and I T is going to decide what's important to keep in the input so what's relevant and what's just like something garbage that ashumen care at all and the result is this CT I so it's basically the the cell state at time T but the Empress right so the new stuff that we want to add to the cell state basically right okay so now the next step is to arrive at CT say which is basically this guy here so that the cell state at the current time step and in order to do that what we do is at this point we do this element wise some and so with some CTF to see see I what so let's just like remember like all of these like different elements so CTF is basically told us like what to forget from the previous state and so now we want to use that MAGIX and add the new stuff to it which is this CT i over here that came out of out of like this purple square right and we are adding them up over here and so first part tells us like what to forgets about the in in the long-term memory the second elements CGI tells us like what it's important to add as new information right and the result is CT the south state whoa this is some cool stuff guys okay so we now need to understand only like the last component of the LST M which is the output which is I would say like really really important right okay so once again we have another gates which is this sigmoid layer here and we calculate these output filter so this is a matrix and we call it ot which is calculated which is going to be like over here so once we've applied the sigmoid function once again to the concatenation of the hidden state at the previous time step with the inputs ater the current time stat time stuff right good so now what remains to do is arrive at H T which is the hidden state for the current time step as well as the output that will hopefully feed into the dense layer over here so let's see how we get to HT it's quite straightforward because once again this is like something that happens like at this point over here we have HT that's given by the element wise multiplication between the filter the output filter Otzi with the with the cell state passed through tan H now this tan H over here is not a dense layer is just like the the function itself and you may be wondering but why are we using that can we just use it see which is like they deed the cell set well the great thing about 10 H once again is that it squeezes the the the bias between minus 1 and 1 and so I mean the body's like constrain and it can explode at that point which is great good okay so HT is given like by these things that's good and yeah so once we've done like all of this we are like at this point here so just after like this multiplication operator over there and at that point as I said we take HT and we used it like for two reasons so one is like we will use it as the hidden state for the current for the current time step and we also output it over here and this HT is going to be fed into the dense layer for arriving hopefully at a good prediction good this was quite intense but this is like the Alice TM so now you know about long certain memory cell states but you should also understand that the one that with sin it's kind of like the the basic form of it but there are a bunch of difference darienne's there and one of the most important ones I would say is the gated recurrent unit or GRE Yi or gray and so here you have like the diagram phrase again so if you want to learn more about grooves I have linked an article widget which is like really good and you should go like check that out because I'm not gonna I'm not gonna get into Gray's like right now cuz I think like Alice stems are already like quite interesting and then are quite like something good but main point thing that should I grew is a variation of an analyst TM a basic Alastair and but again you still have like a bunch of like gates and more or less like the principles that we used like in grooves are somehow like similar to LSD em good so that's great so we we've basically gone through like all L STM's theory and now you should have like a very good and deep understanding of how Alice gems work and these idea of like a retaining long-term memory as well as like short-term memory and using like this long-term state vectors to do like better predictions to have like better because we have better concepts from the past now what are we going to do next well it's time for us to move from theory to implementation so the first step that we'll do is gonna be like pre-processed some data for and getting in ready for using it into RN ends so this is going to be the topic for next video I hope you've enjoyed this video if that's the case again just like subscribe if you want to have like more videos like this and remember to hit the notification bell if you have any questions as always please leave them in the comment section below I'll try to answer as many as many questions as I can and I guess I'll see you next time Cheers