Transcript for:
CS25 Transformers United: Introductory Lecture

hey everyone welcome to the first an introductory lecture for cs25 transformers united so cs25 was a class that the three of us created and taught at stanford in the fall of 2021 and the subject of the class is not as the picture might suggest it's not about robots that can transform into cars it's about deep learning models and specifically the particular kind of deep learning models that have revolutionized multiple fields starting from natural language processing uh to things like computer vision and reinforcement learning to name a few um we have an exciting set of videos lined up for you we have some truly fantastic speakers come and give talks about how they were applying transformers in their own research and we hope you will enjoy and learn from these talks um this video is purely an introductory lecture to talk a little bit about transformers and uh before we get started i'd like to introduce the instructors so my name is advair i am a software engineer at a company called applied intuition before this i was a master's student in cs at stanford and i'm one i am one of the co-instructors for cs25 uh chaitanya div if the two of you could introduce yourselves so hi everyone i am a phd student at sample before this i was uh pursuing a master's here um researching a lot in general modeling enforcement learning and robotics uh so nice to meet you yeah that was div for since he didn't say his name chaitanya if you want to introduce yourself yeah uh hi everyone my name is chaitanya and i'm currently working as an ml engineer at a startup uh called move works before that i was a master student at stanford specializing in nlp and was a member of the prize filling stanford's team for the alexa price challenge right awesome so moving on to the rest of this talk essentially what we hope you will learn watching these videos and what we hope uh the people who took our class in the fall of 2021 learned is three things one is we hope you will have an understanding of how transformers work secondly we hope you will learn and by the end of these talks understand how transformers are being applied beyond just natural language processing and thirdly we hope that some of these talks will spark some new ideas within you and hopefully lead to new directions of research new kinds of innovation and things of that sort and to begin we're going to talk a little bit about transformers and introduce some of the context behind transformers as well and for that i'd like to hand it off to div so hi everyone um so welcome to our transformer seminar uh so i will start first with an overview of the attention timeline uh and how it came to be uh the the key idea about transformers was the self-potential mechanism uh that was developed in 2017 and this all started with this one paper called retention is all you need i was on youtube uh before 2017 we used to have this prehistoric era where we had uh older models like rnns lstms and uh simpler attention mechanisms and eventually like the growth in transformers has exploded into other fields and has become prominent in like all of machine learning and uh i'll go and see and show how this has been used so in the prehistoric era there used to be rnns there were different models like the sequencer sequence lstms gius they were good at encoding um some sort of memory but they did not work for encoding long sequences and they were very bad at encoding context so here is an example where you if you have a sentence like i grew up in france dot dot dot so i speak fluent dash then you want to fill this with like a french based on the context but like a lsd model might not know what it is and might just make a very big mistake here similarly we can show some sort of correlation map uh here where if we have a pronoun like it we wanted to correlate to one of the past nouns that we have seen so far like a animal um but again uh older models were like really not good at this context encoding so where we are currently now is on the verge of takeoff we've begun to realize the potential of transformers in different fields we have started to use them to solve long sequence problems in protein folding such as like the alpha pole model from gp from open from deepmind which gets 95 percent accuracy on different challenges in offline rl uh we can use it for few shot and zero structuralization for text and image generation and we can also use this for like content generation so here is example from openai where you can give a different text prompt and have an ai generator fictional image for you and uh so there's a talk on this that you can also watch in youtube which basically says that ls students are dead and long-lived transformers so what's the future so we can enable a lot more applications for transformers um they can be applied to any form of sequence modeling so you can use we could use them for video understanding we can use them for finance and a lot more so basically imagine all sorts of genetic modeling problems nevertheless there are a lot of missing ingredients so like the human brain we need some sort of external memory unit which is the hippocampus for us and they are saying some early works here so one nice work you might want to check out is called this neural tuning machines similarly uh the current attention mechanisms are very computationally complex in terms of time and they scale correctly which we will discuss later and we want to make them more linear and the third problem is that we want to align our current sort of language models with the how the human brain works and human values and this also big issue okay so now i will deep dive deep dive i will dive deeper into the tension mechanisms and uh show how they came out to be so initially there used to be very simple mechanisms um where attention was inspired by the process of importance reading or putting attention on different parts of uh image where like similar to a human where you might focus more on like a foreground uh if you have an image of a dog compared to like the rest of the background so in the case of soft retention what you do is you learn this simple soft attention weighting for each pixel which can be a weight between zero to one uh the problem over here is that this is a very expensive computation and then you can as show as as it's shown in the figure on the left uh you can see we are calculating this attention for the whole image um what you can do instead is you can just uh calculate a zero to one attention map where we directly put a one on wherever the dog is and a zero wherever uh it's a background uh this is like uh less computation expensive but the problem is it's non-differentiable and makes things harder to drain uh going forwards we also have different varieties of basic attention mechanisms that came that were proposed before self-retention uh so the first variety here is global tension models so in global attention models for each hidden layer input a hidden layer output you learn an attention weight uh a of b and this is element twice multiplied with your current output to calculate your final output uh yt um similarly you have local attention models where instead of calculating the global attention for over the whole sequence length you only calculate the attention over a small window and uh and then you wait by the tension of the window in into like the current output uh to get like the final output uh so moving on uh i will pass on to chattanooga to discuss self-retention mechanisms and transforms yeah uh thank you this for covering a brief overview of how the primitive versions of attention work now just before we talk about self-attention uh just a bit of a trivia that this term was first introduced by a paper which from line it all which was uh which provided a framework for a self-attentive uh uh frame a self-attentive mechanism for sentence and weddings and now moving on to the main crux of the transformers paper which was the self-retention block so self-retention is the basis is the main uh comp building block for the how for what makes the transformers model work so well and to enable them and make them so powerful so to think of it more easily uh we can uh break down the self-retention as a search retrieval problem so the problem is that given a query queue and uh we the we need to find a set of keys k which are most similar to q and return the corresponding key values called v now these three vectors can be drawn from the same source for example we can have that q k and v are all equal to a single vector x where x can be output of a previous layer in transformers these vectors are obtained by applying different linear transformations to x so as to enable the model to capture more complex interactions between the different tokens at different places of the sentence now how attention is computed is just a weighted summation of the similarities between the query and key vectors which is weighted by the respective value for those keys and in the transformers paper they used the scale dot product as a similarity function for the queries and keys and another important aspect of the transformers was the introduction of multi-head self-retention so what multi-head self-retention means is that the retention is for at every layer the self attention is performed multiple times which enables the model to learn multiple representation subspaces so in a way you can think of it that each head is has a uh has a power to look at different things and to learn different semantics for example one head can be learning to try to predict uh what is the part of speech for those tokens one head might be learning what is the syntactic structure of the sentence and uh and all those things that are there uh to understand what this uh what the upcoming uh sentence means now to better understand what the self-attention works and what are the different computations there is a short video so in this so as you can see there are three incoming tokens so input one input to input three we apply linear transformations to get the key value of vectors for each input and then give once a query queue comes we calculated similarity with the key with respect to key vectors and then multiply those scores with the value vector and then add them all up to get the output the same computation is then performed on all the tokens and we get the output of the self retention layer so as you can see here the final output of the send self retention layer is in dark green that's at the top of the screen so now again for the final token we perform everything same queries multiplied by keys we get the similarity scores and then those similarity scores obey the value vectors and then we finally perform the addition to get the self attention output of the transformers apart from self retention there are some other necessary ingredients that makes uh the transformers so powerful one important aspect is the uh presence of positional representations or the embedding layer so the way rnns worked very well was that since they process each the information in a sequential ordering so there was this notion of uh ordering right and which is also very important in understanding language because we all know that we read any piece of text from left to right in most uh in most of the languages and also right to left in some languages so there is a notion of ordering which is lost in kind of self-attention because every word is attending to every other world that's why this paper introduced a separate uh embedding layer for introducing positional representations the second important aspect is having non-linearities so if you think of uh all the computation that is happening in the cell potential there it's all linear because it's all matrix multiplication but as we all know that deep learning models uh work well when they are able to uh when they are able to learn more complex mappings between input and output which can be uh attained by a simple mlp and the third important component of the self of the transformers is the masking so masking is what allows to parallelize the operations uh since every word can attend to every other word in the decoder part of the transformers which otherwise gonna be talking about later is the problem comes that you don't want the decoder to look into the future because that can result in data leakage so that's why masking helps the decoder to avoid that future information and learn only what has been how what uh what the model has processed so far so now on to the uh the encoder decoder architecture of the transformers yeah thanks tethanya for talking about self-attention so self-retention is sort of the key ingredient or one of the key ingredients that allows transformers to work so well but at a very high level the model that was proposed in the vaswani at all paper of 2017 was like previous language models in the sense that it had an encoder decoder architecture what that means is let's say you're working on a translation problem you want to translate english to french the way that would work is you would read in the entire input of your english sentence you would encode that input so that's the encoder part of the network and then you would generate token by token the corresponding friends translation and the decoder is the part of the network that is responsible for generating those tokens so you can think of these encoder blocks and decoder blocks as essentially something like lego they have these sub components that make them up and in particular the encoder block has three main sub components the first is a self retention layer um that chaitanya talked about earlier and as talked about earlier as well you need a feed forward layer after that because the self-retention layer only performs linear operations and so you need something that can capture the non-linearities um you also have a layer norm after this and lastly there are residual connections between different encoder blocks the decoder is very similar to the encoder but there's one difference which is that it has this extra layer because the decoder doesn't just do multi-head attention on the output of the previous layers so so for context the encoder does multi-head attention so each self-retention layer in the encoder block in each of the encoder blocks does multi-head attention looking at the previous layers of the encoder blocks the decoder however does that in the sense that it also looks at the previous layers of the decoder but it also looks at the output of the encoder and so it needs us a multi-head attention layer over the encoder blocks and lastly there's masking as well um so if you are because every token can look at every other token um you want to sort of make sure in the decoder that you're not looking into the future so if you're in position three for instance you shouldn't be able to look at position four and position five so those are sort of all the components that led to the creation of the model in um the vasani idol paper and um let's talk a little bit about the advantages and drawbacks of this model um so the two main advantages which are huge advantages and which are why transformers have done such a good job of revolutionizing many many fields within deep learning um are as follows so so the first is there is this constant path length between any two positions in a sequence because every token in the sequence is looking at every other token and this basically solves the problem that they've talked about earlier with long sequences you don't have this problem with long sequences where if you're trying to predict a token that depends on a word that was far far behind in a sentence you don't have the problem of losing that context now the distance between them is only one in terms of the path length also because the nature of the computation that's happening transformer models lend themselves really well to parallelization and because of the advances that we've had with gpus basically if you take a transfer model with n parameters and you take a model that isn't a transformer say like i can have stm with also with n parameters training the transformer model is going to be much faster because of the parallelization that it leverages so those are the advantages the disadvantages are basically self-attention takes quadratic time because every token looks at every other token order n square as you might know does not scale and there's actually been a lot of work in trying to tackle this so we've linked to some here big bird lin former and reformer are all approaches to try and make this linear or quasi-linear essentially and yeah we highly recommend to recommend going through jay alamer's blog the illustrated transformer which provides great visualizations and explains everything that we just talked about in great detail yeah and i'd like to pass it on to chaitanya for applications of transformers yeah so now moving on to like some of the recent work uh some of the work that uh very shortly followed the transformers paper so one of the models that came out uh was gpt the gbt architecture which was released by open eyes so uh so open ei had the latest model that opened here has in the gpt series is the gpd3 so it consists of only the decoder blocks from transformers and is trained on our traditional language modeling task which is uh predicting the current token which is separating the next token given the last uh t tokens that that the model has seen and for any downstream tasks now the model can just you can just train a classification layer on the last hidden state which can be which can have any number of labels and since the model is generative in nature you can also use the pre-trained network as uh for generative kind of tasks such as summarization and natural language and natural language generation for things for that instance another important aspect that gpt 3 gained popularity was its ability to to be able to perform in context learning what the authors called into context learning so this is the ability wherein the model can perform can learn under few short settings uh what what the task is to complete the task without performing any gradient updates for example let's say the model is shown a bunch of addition examples and then if you pass in a new uh input and leave the uh and just leave it at uh at equal to sign the model tries to predict the act next token uh which very well comes out to be the sum of the uh the sum of the numbers that's that is shown another example can be also the spell correction task or the translation task so this is this was the ability that uh made gpt 3 so much uh uh talked about in the nlp world and uh right now also like many applications have been made using gp3 which includes uh the one of them being the vs code co-pilot which uh tries to which tries to generate a piece of code given our doctrine kind of natural language text another major uh model that came out that was based on the transformers architecture was bird so but uh lends its name from it's an acronym for bi-directional encoding encoder representations of transformers it consists of uh only the encoder blocks of the transformers which is unlike gpd3 which had only the decoder blocks now this prop because of this change there there comes a problem because uh because bert has only the encoder block so it sees the entire piece of text it cannot be pre-trained on a live language modeling task because of the problem of data leakage from the future so what the authors came up with was a clever idea and they uh and they came up with a novel task called mass language modeling which was uh which included to replace certain birds with a placeholder and then the model tries to predict those words given the entire context now apart from this token level task there was the authors also added a second objective called the next sentence prediction which was a sentence level task wherein given two chunks of text the model tried to predict whether the second sentence followed the other sentence or not follow the first sentence or not and now for after pre-training this model for any downstream task the model can be further fine-tuned with an additional classification layer just like it was in chip e3 so ah these are the two models that were that have been like very popular and have made a lot of applications made their way in lot of applications but the landscape has changed quite a lot since we have taken this class there are models with different pre-training techniques like elektra d berta and there are also models that do well in like other modalities and which we're going to be talking about in other lecture series as well so yeah that's all from this lecture and thank you for tuning in yeah um just want to just want to end by saying thank you all for watching this and we have a really exciting set of videos with truly amazing speakers and we hope you are able to derive value from that thanks a lot thank you thank you everyone you