Transcript for:
Understanding Sequence Modeling in Neural Networks

Okay, so maybe those at the top can take their seats and we can get started. My name is Ava and this is lecture 2 of 6S191. Thank you, John.

Thank you, everyone. It should be a good time, it's a packed time. So today, in this portion of the class, we're going to talk about problems that we call sequence modeling problems. And in the first lecture with Alexander, we built up really about what deep learning is, what are the essentials of neural networks, what is a feed-forward model, and basically how do we train a neural network from scratch using gradient descent. And so now we're going to turn our attention to a class of problems that involve sequential data or sequential processing of data.

And we're going to talk about how we can now build neural networks that are well-suited to tackle these types of problems. And we're going to do that step-by-step, starting from the intuition and building up our concepts and our knowledge from there, starting back right where we left off with perceptrons and feed-forward models. So to do that, I'd like to first motivate what we even mean when we talk about something like sequence modeling or sequential data. So let's start with a super simple example. Let's say we have this image of a ball, and it's moving somewhere in this 2D space, and your task is to predict where this ball is going to travel to next.

Now, if I give you no prior information about the history of the ball, its motion, how it's moving, so on, Your guess on its next position is probably going to be nothing but a random guess. However now, if I give you, in addition to the current position of the ball, information about where that ball was in the past, the problem becomes much easier. It's more constrained.

And we can come up with a pretty good prediction of where this ball is most likely to travel to next. I love this example because while it's a visual of a ball moving in a 2D space, this gets at the heart of what we mean when we're talking about sequential data or sequential modeling. And the truth is that beyond this, sequential data is really all around us.

My voice, as I'm speaking to you, the audio waveform is sequential data that could be split up into chunks and sequences of sound waves and processed as such. Similarly, language, as we express and communicate in the written form in text, is very naturally modeled as a sequence of either characters, individual letters in this alphabet, or words or chunks that we could break up the text into smaller components and think about these chunks one by one in sequence. Beyond that, it's Everywhere, right? From everything from medical readings like EKGs to financial markets and stock prices and how they change and evolve over time, to actually biological sequences like DNA or protein sequences that are representing an encoding of life, and far, far beyond, right?

So I think it goes without saying that this is a very rich and very diverse type of data. class of problems that we can work with here. So when we think about now how we can build up from this to answer specific neural network and deep learning modeling questions, we can go back to the problem Alexander introduced in the first lecture, where we have a simple task, binary classification, am I going to pass this class?

We have some single input and we're trying to generate a single output, a classification based on that. With sequence modeling, we can now handle sequences of data that are sequences, meaning that we can have words in sentences in a large body of text, and we may want to reason about those sequences of words. For example, by taking in a sentence and saying, OK, is this a positive emotion, a positive sentiment associated with that sentence, or is it something different? We can also think about how we can generate sequences based on other forms of data.

Let's say we have an image and we want to caption it with language. This also can be thought of as a sequence modeling problem, where now, given a single input, we're trying to produce a sequential output. And finally, we can also consider tasks where we have sequence in, sequence out.

Let's say you want to translate speech or text between two different languages. This is very naturally thought of as a many-to-many or a translation-type problem that's ubiquitous in a lot of natural language translation types of frameworks. And so here, right, again, emphasizing the diversity and richness of the types of problems that we can consider.

when we think about sequence. So let's get to the heart of, from a modeling perspective and from a neural network perspective, how we can start to build models that can handle these types of problems. And this is something that I personally kind of had a really hard time wrapping my head around of initially when I got started with machine learning. How do we take something where we're mapping input to output and build off that to think about sequences and deal with this kind of time nature to sequence modeling problems.

I think it really helps to, again, start from the fundamentals and build up intuition, which is a consistent theme throughout this course. So that's exactly what we're going to do. We're going to go step by step and hopefully walk away with understanding of models for this type of problem. Okay, so this is the exact same diagram that Alexander just showed, right? the perceptron.

We defined it where we have a set of inputs, x1 through xn, and our perceptron neuron, our single neuron, is operating on those to produce an output by taking its weight matrix, doing this linear combination, applying a non-linear activation function, and then generating the output. We also saw how we can now stack perceptrons on top of each other to create what we call a layer. where now we can take an input, compute on it by this layer of neurons, and then generate an output as a result. Here, though, still we don't have a real notion of sequence or of time. What I'm showing you is just a static single input, single output, that we can now think about collapsing down the neurons in this layer to a simpler diagram, right, where I've just taken those neurons and...

simplified it into this green block. And in this input-output mapping, we can think of it as an input at a particular time step, just one time step t. And our neural network is trying to learn a mapping in between input and output at that time step. Okay, now I've been saying, okay, sequence data, it's data over time.

What if we just took this very same model and applied it over and over again to all the individual time steps in a data point. What would happen then? All I've done here is I've taken that same diagram, I've just flipped it 90 degrees, it's now vertical, where we have an input vector of numbers, our neural network is computing on it, and we're generating an output. Let's say we have some sequential data, and we don't just have a single time step anymore.

We have multiple individual time steps. We start from X0, our first time step in our sequence, and what we could do is we could now take that same model and apply it stepwise, step-by-step, to the other slices, the other time steps in the sequence. What could be a potential issue here that could arise from treating our sequential data in this kind of isolated step-by-step view?

Yeah, so I heard some comments back that inherently, right, there's this dependence in the sequence, but this diagram is completely missing that, right? There's no link between time step 0, time step 2. Indeed, right, in this setting, we're just treating the time steps in isolation, but I think we can all hopefully appreciate that output at a later time step, we wanted to depend on the... input and the observations we saw prior, right? So by treating these in isolation, we're completely missing out on this inherent structure to the data and the patterns that we're trying to learn.

So the key idea here is what if now we can build our neural network to try to explicitly model that relation, that time step to time step relation. And one idea is Let's just take this model and link the computation. between the time steps together. And we can do this mathematically by introducing a variable that we call H.

And H of T stands for this notion of a state of the neural network. And what that means is that state is actually learned and computed by the neuron and the neurons in this layer and is then passed on and propagated time step to time step. and iteratively and sequentially updated.

And so what you can see here now as we're starting to build out this modeling diagram is we're able to now produce a relationship where the output at a time step t now depends on both the input at that time step as well as the state from the prior time step that was just passed forward. And so this is a really powerful idea, right? Again, this is an abstraction that we can capture in the neural network, this notion of state, capturing something about the sequence, and we're iteratively updating it as we make observations in this sequence data.

And so this idea of passing the state forwards through time is the basis of what we call a recurrent cell, or neurons with recurrence. And what that means is that the function and the computation of the neuron is a product of both the current input and this past memory of previous time steps, and that's reflected in this variable of the state. And so on the right, on the right-hand side of the slide, what you're seeing is basically that model, that neural network model. unrolled or unwrapped across these individual time steps.

But importantly, it's just one model that still has this relation back to itself. So this is kind of the mind-warpy part where you think about how do we unroll and visualize and reason about this operating over these individual time steps or having this recurrence relation with respect to itself. So this is the core idea, this notion of recurrence of a neural network architecture that we call RNNs, recurrent neural networks. And RNNs are really one of the foundational frameworks for sequence modeling problems. So we're going to go through and build up a little more details and a little more of the math behind RNNs now that we have this intuition about the state update and about the recurrence relation.

Okay, so our next step, all we're going to do is just formalize this thinking a little bit more. The key idea that we talked about is that we have the state H of T, and it's updated at each time step as we're processing the sequence. That update is captured in what we call this recurrence relation.

And this is a standard neural network operation, just like we saw in lecture one, right? All we're doing is we're having this cell state variable, H of T. We're learning a set of weights, W, and that set of weights, W, is going to be a function of both the input at a particular time step and the information that was passed on from the prior time step in this variable, H of T. And what is really important to keep in mind is for a particular RNN layer, right, We have the same set of weight parameters that are just being updated as the model is being learned.

Same function, same set of weights. The difference is just we're processing the data time step by time step. We can also think of this from another angle in terms of how we can actually implement an RNN, right? We can begin, we think about initializing the hidden state.

and initializing an input sentence broken up into individual words that we want this RNN to process. To make updates to the hidden state of that RNN, all we're going to do is basically iterate through each of the individual words, the individual time steps in the sentence, and update the hidden state and generate an output prediction as a function of the current word and the hidden state. And then at the very end, we can then take that learned model, that learned updated hidden state, and generate now the next word prediction for what word comes next at the end of the sentence. And so this is this idea of how the RNN includes both the state update and finally an output that we can generate per time step.

And so to walk through this component, right, we have this input vector, X of T. We can use a mathematical description based on the nonlinear activation function and a set of neural network weights to update the hidden state H of T. And while this may seem complicated, right, this is really very much similar to what we saw prior. All we're doing is we're learning the matrix of weights. We are learning an individual matrix for updating the hidden state and then one for updating the input.

We're multiplying those by their inputs, adding them together, applying a non-linearity, and then using this to update the actual state variable H of T. Finally, then, we can then output an actual prediction at that time step as a function of that updated internal state, h of t. Right, so the RNN has updated its state. We apply another weight matrix and then generate an output prediction according to that.

Question? Choose different nonlinear functions into the tan h and... If so, how do you have intuition on which ones to choose? Yes, absolutely.

So the question is, how do we choose the activation function besides tan h? You can indeed choose different activation functions. We'll get a little bit later in the lecture how we dictate that intuition. And we'll also see there are examples of slightly more complicated versions of RNNs that actually have...

multiple different activation functions within one layer of the RNN. So this is another strategy that can be used. So this is the idea now of updating the internal state and generating this output prediction.

And as we kind of started to see, right, we can either depict this using this loop function or by basically unrolling the state of the RNN. over the individual time steps, which can be a little more intuitive. The idea here is that you have an input at a particular time step, and you can visualize how that input and output prediction occurs at these individual time steps in your sequence. Making the weight matrices explicit, we can see that this ultimately leads to both updates to the hidden state and predictions to the output, and...

furthermore, re-emphasizing the fact that it's the same weight matrix, right, for the input to hidden state transformation, that hidden state to output transformation that's effectively being reused and re-updated across these time steps. Now, this gives us a sense of how we can actually go forward through the RNN to compute predictions, to actually learn the weights of this RNN, we have to compute a loss and use the technique of backpropagation to actually learn how to adjust our weights based on how we've computed the loss. And because now we have this way of computing things time step by time step, what we can simply do is take the individual metric of the loss from the individual time steps, sum them all together, and get a total value of the loss across. the whole sequence. One question.

Progression differ from setting the bias. A bias is, you know, something that comes in separate from the x of that particular time. This is different than serving as bias?

Yes, yes. So what I'm talking about here is specifically how the weights, the learned weights, are updated as a function of, you know, learning the model and how they're actually... the weight matrix itself is applied to, let's say, the input and transforms the input.

In this visualization and the equations we showed, we kind of abstracted away the bias term. But the important thing to keep in mind is that matrix multiplication is a function of the learned weight matrix multiplied against the input or the hidden state. Okay, so similarly, right, this is now a little bit more detail on the inner workings of how we can implement an RNN layer from scratch using code in TensorFlow.

So as we introduced, right, the RNN itself is a layer, a neural network layer. And what we start by doing is first by initializing those three sets of weight matrices. that are key to the RNN computation, right? And that's what's done in this first block of code where we're seeing that initialization. We also initialize the hidden state.

The next thing that we have to do to build up an RNN from scratch is to define how we actually make a prediction, a forward pass, a call to the model. And what that amounts to is taking that hidden state update equation and translating it into Python code. that reflects this application of the weight matrix, the application of the non-linearity, and then computing the output as a transformation of that. And finally, at each time step, both that updated hidden state and the predicted output can be returned by the call function from the RNN. This gives you a sense of kind of the inner workings and computation translated to code.

But in the end, right, TensorFlow and machine learning frameworks abstract a lot of this away, such that you can just take in and define kind of the dimensionality of the RNN that you want to implement and use built-in functions and built-in layers to define it in code. So again, right, this flexibility that we get from thinking about sequence allows us to think about different types of problems and different settings in which sequence modeling becomes important. We can again look at setting where now we're processing these individual time steps across the sequence and generating just one output at the very end of the sequence, right? Maybe that's a classification of the emotion associated with a particular sentence.

We can also think about taking a single input and now generating... outputs at individual time steps. And finally, doing the translation of sequence input to sequence output.

And you'll get hands-on practice implementing and developing a neural network for this type of problem in today's lab, in the first lab of the course. So building up from this, right, we've talked about kind of how an RNN works and what's the underlying framework. But ultimately, when we think about sequence modeling problems, we can also think of, you know, what are the unique aspects that we need a neural network to actually effectively capture to be able to handle these data well.

We can all appreciate that sequences are not all the same length, right? A sentence may have five words, it may have a hundred words. We want the flexibility in our model to be able to handle. both cases. We need to be able to maintain a sense of memory to be able to track these dependencies that occur in the sequence, right?

Things that appear very early on may have an importance later on, and so we want our model to be able to reflect that and pick up on that. The sequence inherently has order. We need to preserve that, and we need to learn a conserved set of...

parameters that are used across the sequence and updated. And RNNs give us the ability to do all of these things. They're better at some aspects of it than others, and we'll get into a little bit of why that is.

But the important thing to keep in mind is, as we go through the rest of the lecture, is what is it that we're actually trying our neural network to be able to do in practice in terms of the capability it has? So let's now get into more specifics about a very typical sequence modeling problem that you're going to encounter. And that's the following. Given a stretch of words, we want to be able to predict the next word that comes following that stretch of words. So let's make this very concrete, right?

Suppose we have this sentence. This morning I took my cat for a walk. Our task could be just as follows.

Given the first words in this sentence, we want to predict the word that follows. walk. How we can actually do this before we think about building our RNN, the very first thing we need to do is have a way to actually represent this text, represent this language to the neural network.

Remember again, right, neural networks are just numerical operators, right? Their underlying computation is just math implemented in code. And They don't really have a notion of what a word is.

We need a way to represent that numerically so that the network can compute on it and understand it. They can't interpret words. What they can interpret and operate on is numerical inputs.

So there's this big question in this field of sequence modeling and natural language of how do we actually encode language in a way that is understandable? and make sense for a neural network to operate on numerically. This gets at this idea of what we call an embedding.

And what that means is we want to be able to transform input in some different type of modality like language into a numerical vector of a particular size that we can then give as input to our neural network model and operate on. And so with language, there are different ways that we can now think about how we can build this embedding. One very simple way is, let's say we have a vast vocabulary, a set of words, all the different and unique words in English, for example. We can then take those different and unique words and just map them to a number, an index, such that each distinct word in this vocabulary has a distinct index.

Then we construct these vectors that have the length, the size of the number of words in our vocabulary and just indicate with a binary 1 or 0 whether or not that vector represents that word or some other word. And this is an idea of what we call a one-hot embedding or a one-hot encoding. And it's a very simple but very powerful way to represent...

language in a numerical form such that we can operate on it with a neural network. Another option is to actually do something a little fancier and try to learn a numerical vector that maps words or other components of our language to some sort of distribution, some sort of space, where the idea is things that are related to each other in language should numerically be similar, close to each other in this space, and things that are very different should be numerically dissimilar and far away in this space. And this, too, is a very, very powerful concept about learning and embedding and then taking those learned vectors forward to a downstream neural network.

So this solves a big problem about how we actually encode language. The next thing in terms of how we tackle this sequence modeling problem is we need a way to be able to handle these sequences of differing length. Right. Sentence of four words, sentence of six words. The network needs to be able to handle that.

The issue that comes with the ability to handle these variable sequence lengths is that now, as your sequences get longer and longer, Your network needs to have the ability to capture information from early on in the sequence and process on it and incorporate it into the output maybe later on in the sequence. And this is this idea of a long-term dependency or this idea of memory in the network. And this is another very fundamental problem to sequence modeling that you'll encounter in practice. The other aspect that we're going to touch on briefly is, again, the intuition behind order. The whole point of sequence is that, you know, things that appear in a programmed or defined way capture something meaningful.

And so even if we have the same set of words, if we flip around the order, the network's representation and modeling of that should be different and capture that dependence of order. All this is to say is in this example of natural language taking the question of next word prediction, it highlights why this is a challenging problem. for a neural network to learn and try to model. And fundamentally, how we can think about keeping that in the back of our mind as we're actually trying to implement and test and build these algorithms and models in practice.

One quick question, yes. In large embedding, how do you know what dimension of space you're supposed to use to move things together? This is a fantastic question. And about...

how large we set that embedding space, right? You can envision, right, as the number of distinct things in your vocabulary increases, you may first think, okay, maybe a larger space is actually useful. But it's not true that strictly increasing the dimensionality of that embedding space leads to a better embedding.

And the reason for that is it gets sparser the bigger you go. And effectively, then, what you're doing is you're just making a lookup table that's more or less closer to a one-hot encoding. So you're kind of defeating the purpose of learning that embedding in the first place.

The idea is to have a balance of a small but large enough dimensionality to that embedding space, such that you have enough capacity to map all the diversity and richness in the data, but it's small enough that it's efficient. and that embedding is actually giving you an efficient bottleneck and representation. And that's kind of a design choice that there are, you know, works that show what is effective embedding space for language, let's say. But that's kind of the balance that we keep in mind.

I'm going to keep going for the sake of time, and then we'll have time for questions at the end. Okay, so that gives us, you know, RNNs, how they work, where we are at with these sequence modeling problems. Now we're going to dive in a little bit to how we actually train the RNN using that same algorithm of backpropagation that Alexander introduced. If you recall, in a standard feedforward network, right, the operation is as follows.

We take our inputs, we compute on them in the forward path to now generate an output. And when we backprop, when we try to... update the weights based on the loss. What we do is we go backwards and back propagate the gradients through the network back towards the input to try to adjust these parameters and minimize the loss.

And the whole concept is we have our loss objective and you're just trying to shift the parameters of the model, the weights of the model, to minimize that objective. With RNNs now, there's a wrinkle, right? Because we now have this loss that's computed time step to time step as we are doing this sequential computation and then add it at the very end to get a total loss.

What that means is now when we make our backward pass in trying to learn backpropagation, we just have to backpropagate the gradients per the time step and then finally across all the time steps from the end all the way back to the beginning of the sequence. And this is the idea of backpropagation through time because the errors are additionally backpropagated along this time axis as well to the beginning of the data sequence. Now, you could maybe see why this can get a little bit hairy, right? If we take a closer look at how this computation actually works, what backprop through time means is that As we're going stepwise, time step by time step, we have to do this repeated computation of weight matrix, weight matrix, weight matrix, weight matrix, and so on. And the reason that this can be very problematic is that this repeated computation, if those values are very large and you multiply or take the derivative with respect to those values in a repeated fashion, You can get gradients that actually grow excessively large and grow uncontrollably and explode such that the network learning is not really tractable.

And so one thing that's done in practice is to effectively try to cut these back, scale them down to try to learn effectively. You can also have the opposite problem, where if you start out and your values are very, very small, and you have these repeated matrix multiplications, your values can shrink very quickly to become diminishingly small. And this is also quite bad.

And there are strategies we can employ in practice to try to mitigate this as well. The reason why that this notion of gradient diminishing or vanishing gradients is a very real problem for actually learning an effective model, is that we're shooting ourselves in the foot in terms of our ability to model long-term dependencies. And why that is, is... As you grow your sequence length, right, the idea is that you're going to have to have a larger memory capacity and then be able to better track these longer-term dependencies. But if your sequence is very large and you have long-term dependencies, but your gradients are vanishing, vanishing, vanishing, you're losing all ability as you go out in time to actually learn something useful and keep track of those dependencies within the model.

And what that means is now the network's capacity to model that dependency is reduced or destroyed. So we need real strategies to try to mitigate this in the RNN framework because of this inherent sequential processing of the data. In practice, going back to one of the earlier questions about how we select the activation functions, one very... A common thing that's done in RNNs is to choose the activation functions wisely to be able to try to kind of mitigate a little bit this shrinking gradient problem by having activation functions that are either 0 or 1, namely the ReLU activation function.

Another strategy is to try to initialize the weights those actual first values of the weight matrices smartly to be able to get them at a good starting point, such that once we now start making updates, maybe we're less likely to run into this vanishing gradient problem as we do those repeated matrix multiplications. The final idea, and the most robust one in practice, is to now build a more robust neural network layer where recurrent cell itself. And this is the concept of what we call gating, which is effectively introducing additional computations within that recurrent cell to now be able to try to selectively keep or selectively remove or forget some aspects of the information that's being inputted into the recurrent unit.

We're not going to go into detail about that. how this notion of gating works mathematically for the sake of time and focus. But the important thing that I want to convey is that there's a very common architecture called the LSTM, or long short-term memory network, that employs this notion of gating to be more robust than just a standard RNN in being able to track these long-term dependencies.

The core idea to... take away from that and this idea of gating is, again, we're thinking about how information is updated numerically within the recurrent unit. And what LSTMs do is very similar to how the RNN on its own functions have a variable, a cell state, that's maintained. The difference is how that cell state is updated is using some additional layers of computation to selectively forget some information and selectively keep some information. And this is the intuition behind how these different components within an LSTM actually interact with each other to now give basically a more intelligent update to the cell state that will then better preserve the core information that's necessary.

The other thing I'll note about this is that this operation of forgetting or keeping, I'm speaking about it in a very high-level and abstract way. But what I want you to keep in mind as well is that this is all learned as a function of actual weight matrices that are defined as part of these neural network units, right? All of this is our way of abstracting and reasoning about the mathematical operations at the core of a network. or a model like this. Okay.

So to close out our discussion on RNNs, we're going to just touch very briefly on some of the applications where we've seen them employed and are commonly used. One being music generation. And this is what you're actually going to get hands-on practice with in the software labs, building a recurrent neural network from scratch and using it to generate new songs. And so this example that I'll play is actually a demo from a few years ago of a music piece generated by a recurrent neural network-based architecture that was trained on classical music and then asked to produce a portion of a piece that was famously unfinished by the composer Franz Schubert, who died before he could complete.

this famous unfinished symphony. And so this was the output of the neural network that was asked to now compose two new movements based on the prior true movements. Let's see if...

It goes on, but you can appreciate the quality of that. And I would also like to briefly highlight that on Thursday, we're going to have an awesome guest lecture that's going to take this idea of music generation to a whole new level. So stay tuned.

I'll just give a teaser and a preview for that. More to come. We also introduce, again, this problem of sequence classification, something like assigning a sentiment to an input sentence.

And again, we can think of this as a classification problem where now we reason and operate over the sequence data, but we're ultimately trying to produce a probability associated with that sequence, whether a sentence is positive or negative, for example. So this gives you, right, two flavors, music generation, sequence-to-sequence generation, and also classification that we can think about with using recurrent models. But, you know, we've talked about, right, these design criteria of what we actually want any neural network model to do when handling sequential data. It doesn't mean the answer has to be an RNN.

In fact, RNNs is... have some really fundamental limitations because of the very fact that they're operating in this time step by time step manner. The first is that to encode really really long sequences, the memory capacity is effectively bottlenecking our ability to do that.

And what that means is that information in very long sequences can be lost by imposing a bottleneck in the size of that hidden state. that the RNN is actually trying to learn. Furthermore, because we have to look at each slice in that sequence one by one, it can be really computationally slow and intensive to do this when things get longer and longer.

And as we talked about with respect to long-term dependencies, vanishing gradients, the memory capacity of a standard RNN is simply not that much for being able to track. sequence data effectively. So let's break this down, these problems down a little further, right?

Thinking back to our high level goal of sequence modeling, we want to take our input, broken down time step by time step, and basically learn some neural network features based on that. And use those features to now generate series of outputs. RNN say, okay, we're going to do this by linking the information time step to time step via the state update and via this idea of recurrence. But as we saw, right, there are these core limitations to that iterative computation, that iterative update.

Indeed, though, if we think about what we really want, what we want is we now want to no longer be constrained to thinking about things time step by time step. So long as we have a continuous stream of information, we want our model to be able to handle this. We want the computation to be efficient. We want to be able to have this long memory capacity to handle those dependencies and enrich information. So what if we eliminated this need to process the information sequentially, time step by time step, get away with recurrence entirely?

How could we learn a neural network in this setting? A naive approach that we could take is we'll say, OK, well, you know, we have sequence data. But what if we all mush it all together, smash it together and concatenate it into a single vector, feed it into the model? calculate some features, and then use that to generate output.

Well, this may seem like a good first try, but yes, while we've eliminated this recurrence, we've completely eliminated the notion of sequence in the data. We've restricted our scalability because we've said, okay, we are going to put everything together into a single input. We've eliminated order, and again, as a result of that, we've lost this memory capacity. The core idea that came about five years ago when thinking about now how can we build a more effective architecture for sequence modeling problems was rather than thinking about things time step by time step in isolation, let's take a sequence for what it is and learn a neural network model that can tell us What parts of that sequence are the actually important parts?

What is conveying important information that the network should be capturing and learning? And this is the core idea of attention, which is a very, very powerful mechanism to monitor neural networks that are used in sequential processing tasks. So... To prelude to what is to come and also to a couple lectures down the line, I'm sure everyone in this room, hopefully, has raised your hand if you've heard of GPT or ChatGPT or BERT. Hopefully everyone.

Who knows what that T stands for? Transformer, right. The transformer is a type of neural network architecture that is built on attention as its foundational mechanism, right?

So in the remainder of this lecture, you're going to get a sense of how attention works, what it's doing, and why it's such a powerful building block for these big architectures like transformers. And I think attention is a beautiful concept. It's really elegant and intuitive, and hopefully we can convey that in what follows.

Okay, so the core nugget, the core intuition, is this idea of let's attend to and extract the most important parts of an input. And what we'll specifically be focusing on is what we call self-attention, attending to the input, parts of the input itself. Let's look at this image of the hero Iron Man, right? How can we figure out what's important in this image?

A super naive way, super naive, would be just scan pixel to pixel and look at each one, right? And then be able to say, okay, this is important, this is not, so on. But our brains are immediately able to look at this and pick out, yes, Iron Man is important. We can focus in and attend to that. That's the intuition.

Identifying which parts of an input to attend to and pulling out the associated feature that has this high attention score. This is really similar to how we think about searching, searching from across a database or searching across an input to pull out those important parts. So let's say now you have a search problem. You came to this class with the question, how can I learn more about neural networks, deep learning, AI? One thing you may do in addition to coming to this class is to go to the internet go to YouTube and say, let's try to find something that's going to help me in this search.

Right. So now we're searching across a giant database. How can we find and attend to what's the relevant video in helping us with our search problem? Well, you start by supplying an ask, a query, deep learning. And now that query has to be compared to what we have in our database, titles of different videos that exist.

Let's call these keys. And now our operation is to take that query and our brains are matching what... My query is closest to, right? Is it this first video of beautiful, elegant sea turtles in coral reefs?

How similar is my query to that? Not similar. Is it similar to this second key entity, the lecture 2020, Introduction to Deep Learning? Yes.

Is it similar to this last key? No. So we're computing this effective attention mask of this metric of how similar our query is to each of these keys.

Now we want to be able to actually pull the relevant information from that, extract some value from that match. And this is the return of the value that has this highest notion of this intuition of attention. This is a... metaphor, right, an analogy with this problem of search, but it conveys these three key components of the attention mechanism and how it actually operates mathematically.

So let's break that down and let's now go back to our sequence modeling problem of looking at a sentence in natural language and trying to model that. Our goal with a neural network that employs self-attention is to look at this input. and identify and attend to the features that are most important. What we do is first, right, we're not going to handle this sequence time step by time step, but we still need a way to process and preserve information about the position and the order. What is done in self-attention and in transformers is an operation that we call a position-aware encoding or a positional encoder.

We're not going to go into the details of this mathematically, but the idea is that we can learn an embedding that preserves information about the relative positions of the components of the sequence. And there are neat and elegant math solutions that allow us to do this very effectively. But all you need to know for the purpose of this class is we take our input and do this computation that gives us a position-aware embedding. Now, we take that positional embedding and compute a way of that query, that key, and that value for our search operation. Remember, our task here is to pull out what in that input is relevant to our search.

How we do this is the message of this class overall, by learning neural network layers. And in Intention and in Transformers, That same positional embedding is replicated three times across three separate neural network linear layers that are used to compute the values of the query, of the key, and the value. These three sets of matrices. These are the three principal components of that search operation that I introduced with the YouTube analogy.

Query, key, and value. Now, we do that same exact task of computing the similarity to be able to compute this attention score. What is done is we take the query matrix, the key matrix, and define a way to compute how similar they are.

Remember, right, these are numerical matrices in some space. And so, intuitively, you can think of them as vectors in the high-dimensional space. When two vectors are in that same space, we can look and measure mathematically how close they are to each other using a metric by computing the dot product of those two vectors, otherwise known as the cosine similarity.

And that reflects how similar these query and key matrices are in this space. That gives us a way... once we employ some scaling to actually compute a metric of this attention weight.

And so now thinking about what this operation is actually doing, remember, right, this whole point of query and key computation is to find the features and the components of the input that are important, this self-attention. And so what is done is if we take the, let's say, visualizing the words in a sentence, We can compute this self-attention score, this attention weighting, that now allows us to interpret what is the relative relationship of those words in that sentence with respect to how they relate to each other. And that's all by virtue of the fact that we're again learning this operation directly over the input and attending to parts of it itself. We can then basically squish that similarity to be between 0 and 1 using an operation known as a softmax. And this gives us concrete weights that compute these attention scores.

The final step in the whole self-attention pipeline. is now to take that attention weighting, take our value matrix, multiply them together, and actually extract features from this operation. And so it's a really elegant idea of taking the input itself and taking these three interacting components of query, key, and value to not only identify what's important, but actually extract relevant features based on those attention scores. Let's put it all together, step by step.

The overall goal here, identify and attend to the most important features. We take our positional encoding, captures some notion of order and position. We extract these query key value matrices using these learned linear layers. We compute the metric of similarity using the cosine similarity. compute it through the dot product, we scale it and apply a softmax to put it between 0 and 1, constituting our attention weights, and finally we take that entity, multiply with the value, and use this to actually extract features back relative to the input itself that have these high attention scores.

All this put together forms what we call a single self-attention head. And the beauty of this is now you have a hierarchy, and you can put multiple attention heads together to design a larger neural network like a transformer. And the idea here is that this attention mechanism is really the foundational building block of the transformer architecture.

And the way that this architecture is so powerful is the fact that we can parallelize and stack these attention heads together to basically be able to attend to different features, different components of the input that are important. So we may have, let's say, one attention mask that's attending to Iron Man in the image, and that's the output of the first attention head. But we could have other attention heads that are picking up on other relevant features and other components of this complex space. So...

Hopefully, right, you've got an understanding of the inner workings of this mechanism and its intuition and the elegance of this operation. And attention is now really, we're seeing it and the basis of the transformer architecture applied to many, many different domains and many different settings. Perhaps the most prominent and most notable is in natural language with models like GPT, which is the basis of a tool like ChatGPT. And so we'll actually get hands-on experience building and fine-tuning large language models in the final lab of the course, and also go more into the details of these architectures later on as well.

It doesn't just stop there though, right? Because of this natural notion of what is language, what is sequence, the idea of attention and of a transformer extends far beyond just human text and written language, right? We can model sequences in biology like DNA or protein sequences using these same principles and these same structures of architectures to now reason about biology in a very complex way to do things like accurately predict the three-dimensional shape of a protein based solely on sequence information.

Finally, right? Transformers and the notion of attention have been applied to things that are not intuitively sequence data or language at all, even in tasks like computer vision, with architectures known as vision transformers that are again employing the same notion of self-attention. So, To close up and summarize, right, this is a very whirlwind sprint through what sequence modeling is, how RNNs are a good first starting point for sequence modeling tasks, using this notion of time-step processing and recurrence. We can train them using backprop through time.

We can deploy RNNs in other types of sequence models for a variety of tasks, music generation and beyond. We saw how we can go beyond recurrence to actually learn self-attention mechanisms to model sequences without having to go time step by time step. And finally, how self-attention can form the basis for very powerful, very large architectures like large language models.

So that concludes the lecture portion for today. Thank you so much for your attention and for bearing with us through this. this sprint and this boot camp of a week.

And with that, I'll close and say that we now have open time to talk amongst each other, talk with the TAs and with the instructors about your questions, get started with the labs, get started implementing. Thank you so much.