[Music] translation it's done with a transform ER stat Quest hello I'm Josh starmer and welcome to statquest today we're going to talk about Transformer neural networks and they're going to be clearly explained Transformers are more fun when you build them in the cloud with lightning bam right now people are going bonkers about something called chat GPT for example our friend statsquatch might type something into chat GPT like right and awesome song in the style of statquest translation it's done with a transform ER anyway there's a lot to be said about how chat GPT works but fundamentally it is based on something called a Transformer so in this stat Quest we're going to show you how a Transformer works one step at a time specifically we're going to focus on how a Transformer neural network can translate a simple English sentence let's go into Spanish vamos now since a Transformer is a type of neural network and neural networks usually only have numbers for input values the first thing we need to do is find a way to turn the input and output words into numbers there are a lot of ways to convert words into numbers but for neural networks one of the most commonly used methods is called word embedding the main idea of word embedding is to use a relatively simple neural network that has one input for every word and symbol in the vocabulary that you want to use in this case we have a super simple vocabulary that allows us to input short phrases like let's go and to go and we have an input for this symbol EOS which stands for end of sentence or end of sequence because the vocabulary can be a mix of words word fragments and symbols we call each input a token the inputs are then connected to something called an activation function and in this example we have two activation functions and each connection multiplies the input value by something called a weight hey Josh where do these numbers come from great question Squatch and we'll answer it in just a bit for now let's just see how we convert the word let's into numbers first we put a 1 into the input for let's and then put zeros into all of the other inputs now we multiply the inputs by their weights on the connections to the activation functions for example the input for let's is one so we multiply 1.87 by 1 to get 1.87 going to the activation function on the left and we multiply 0.09 by 1 to get 0.09 going to the activation function on the right in contrast if the input value for the word 2 is 0 then we multiply negative 1.45 by 0 to get 0 going to the activation function on the left and we multiply 1.50 by 0 to get 0 going to the activation function on the right in other words when an input value is 0 then it only sends zeros to the activation functions and that means to go and the EOS symbol all just send zeros to the activation functions and only the weight values for let's end up at the activation functions because its input value is 1. so in this case 1.87 goes to the activation function on the left and 0.09 goes to the activation function on the right in this example the activation functions themselves are just identity functions meaning the output values are the same as the input values in other words if the input value or x-axis coordinate for the activation function on the left is 1.87 then the output value the y-axis coordinate will also be 1.87 likewise because the input to the activation function on the right is 0.09 the output is also 0.09 thus these output values 1.87 and 0.09 are the numbers that represent the word leads bam likewise if we want to convert the word go into numbers we set the input value for go to 1. and all of the other inputs to zero and we end up with negative 0.78 and 0.27 as the numbers that represent the word go and that is how we use word embedding to convert our input phrase let's go into numbers bam note there is a lot more to say about word embedding so if you're interested check out the quest also note before we move on I want to point out two things first we reuse the same word embedding Network for each input word or symbol in other words the weights in the network for let's are the exact same as the weights in the network for go this means that regardless of how long the input sentence is we just copy and use the exact same word embedding Network for each word or symbol and this gives us flexibility to handle input sentences with different lengths the second thing I want to mention is that all of these weights and all of the other weights we're going to talk about in this Quest are determined using something called back propagation to get a sense of what back propagation does let's imagine we had this data and we wanted to fit a line to it back propagation would start with a line that has a random value for the y-axis intercept and a random value for the slope and then using an iterative process back propagation would change the y-axis intercept and slope one step at a time until it found the optimal values likewise in the context of neural networks each weight starts out as a random number but when we train the Transformer with English phrases and known Spanish translations back propagation optimizes these values one step at a time and results in these final weights also just to be clear the process of optimizing the weights is also called training bam note there is a lot more to be said about training and back propagation so if you're interested check out the quests now because the word embedding networks are taking up the whole screen let's shrink them down and put them in the corner okay and now that we know how to convert words into numbers let's talk about word order for example if Norm said Squatch eats pizza then squash might say yum in contrast if Norm said Pizza eats squash then squash might say yikes so these two phrases Squatch eats Pizza and pizza eats squash use the exact same words but have very different meanings so keeping track of word order is super important so let's talk about positional encoding which is a technique that Transformers use to keep track of word order we'll start by showing how to add positional encoding to the first phrase Squatch eats Pizza note there are a bunch of ways to do positional encoding but we're just going to talk about one popular method that said the first thing we do is convert the words squash eats pizza into numbers using word embedding in this example we've got a new vocabulary and we're creating four word embedding values per word however in practice people often create hundreds or even thousands of embedding values per word now we add a set of numbers that correspond to word order to the embedding values for each word hey Josh where do the numbers that correspond to word order come from in this case the numbers that represent the word order come from a sequence of alternating sine and cosine squiggles each squiggle gives a specific position values for each word's embeddings for example the y-axis values on the green squiggle give us position encoding values for the first embeddings for each word specifically for the first word which has an x-axis coordinate all the way to the left of the green squiggle the position value for the first embedding is the y-axis coordinate zero the position value for the second embedding comes from the orange squiggle and the y-axis coordinate on the orange squiggle that corresponds to the first word is one likewise the blue squiggle which is more spread out than the first two squiggles gives us the position value for the third embedding value which for the first word is zero lastly the red squiggle gives us the position value for the fourth embedding which for the first word is one thus the position values for the first word come from the corresponding y-axis coordinates on the squiggles now to get the position values for the second word we simply use the y-axis coordinates on the squiggles that correspond to the x-axis coordinate for the second word lastly to get the position values for the third word we use the y-axis coordinates on the squiggles that correspond to the x-axis coordinate for the third word note because the sine and cosine squiggles are repetitive it's possible that two words might get the same position or y-axis values for example the second and third words both got negative 0.9 for the first position value however because the squiggles get wider for larger embedding positions and the more embedding values we have then the wider the squiggles get then even with a repeat value here and there we end up with a unique sequence of position values for each word thus each input word ends up with a unique sequence of position values now all we have to do is add the position values to the embedding values and we end up with the word embeddings plus positional encoding for the whole sentence Squatch eats Pizza yum note if we reverse the order of the input words to be Pizza eats squash then the embeddings for the first and third words get swapped but the positional values for the first second and third word stay the same and when we add the positional values to the embeddings we end up with new positional encoding for the first and third words and the second word since it didn't move stays the same thus positional encoding allows a Transformer to keep track of word order bam now let's go back to our simple example where we are just trying to translate the English sentence let's go and add position values to the word embeddings the first embedding for the first word Let's gets zero and the second embedding gets one and the first embedding for the second word go gets a negative 0.9 and the second embedding gets 0.4 and now we just do the math to get the positional encoding for both words bam now because we're going to need all the space we can get let's consolidate the math in the diagram and let the sine and cosine and plus symbols represent the positional encoding now that we know how to keep track of each word's position let's talk about how a Transformer keeps track of the relationships among words for example if the input sentence was this the pizza came out of the oven and it tasted good then this word it could refer to pizza or potentially it could refer to the word oven Josh I've heard of good tasting pizza but never a good tasting oven I know Squatch that's why it's important that the Transformer correctly Associates the word it with pizza the good news is that Transformers have something called self-attention which is a mechanism to correctly associate the word ID with the word Pizza in general terms self-attention works by seeing how similar each word is to all of the words in the sentence including itself for example self-attention calculates the similarity between the first word the and all of the words in the sentence including itself and self-attention calculates these similarities for every word in the sentence once the similarities are calculated they are used to determine how the Transformer encodes each word for example if you looked at a lot of sentences about pizza and the word ID was more commonly associated with pizza than oven then the similarity score for pizza will cause it to have a larger impact on how the word ID is encoded by the Transformer bam and now that we know the main ideas of how self-attention Works let's look at the details so let's go back to our simple example where we had just added positional encoding to the words let's and go the first thing we do is multiply the position encoded values for the word let's by a pair of weights and we add those products together to get Negative 1.0 then we do the same thing with a different pair of weights to get 3.7 we do this twice because we started out with two position encoded values that represent the word leads and after doing the math two times we still have two values representing the word leads Josh I don't get it if we want two values to represent let's why don't we just use the two values we started with that's a great question Squatch and we'll answer it in a little bit grr anyway for now just know that we have these two new values to represent the word let's and in Transformer terminology we call them query values and now that we have query values for the word let's use them to calculate the similarity between itself and the word go and we do this by creating two new values just like we did for the query to represent the word let's and we create two new values to represent the word go both sets of new values are called key values and we use them to calculate similarities with the query for let's one way to calculate similarities between the query and the keys is to calculate something called a DOT product for example in order to calculate the dot product similarity between the query and key for let's we simply multiply each pair of numbers together and then add the products to get 11.7 likewise we can calculate the dot product similarity between the query for let's and the key for go by multiplying the pairs of numbers together and adding the products to get negative 2.6 the relatively large similarity value for let's relative to itself 11.7 compared to the relatively small value for lets relative to the word go negative 2.6 tells us that let's is much more similar to itself than it is to the word go that said if you remember the example where the word it could relate to pizza or oven the word it should have a relatively large similarity value with respect to the word Pizza since it refers to pizza and not oven note there's a lot to be said about calculating similarities in this context and the dot product so if you're interested check out the quests anyway since let's is much more similar to itself than it is to the word go then we want let's to have more influence on its encoding than the word go and we do this by first running the similarities course through something called a soft Max function the main idea of a soft Max function is that it preserves the order of the input values from low to high and translates them into numbers between 0 and 1 that add up to one so we can think of the output of the softmax function as a way to determine what percentage of each input word we should use to encode the word let's in this case because let's is so much more similar to itself than the word go we'll use one hundred percent of the word let's to encode less and zero percent of the word go to encode the word let's note there's a lot more to be said about the soft Max function so if you're interested check out the quest anyway because we want 100 of the word let's to encode let's we create two more values that will cleverly call values to represent the word let's and scale the values that represent let's by 1.0 then we create two values to represent the word go and scale those values by 0.0 lastly we add the scaled values together and these sums which combine separate encodings for both input words let's and go relative to their similarity to Let's are the self-attention values for leads bam now that we have self-attention values for the word let's it's time to calculate them for the word go the good news is that we don't need to recalculate the keys and values instead all we need to do is create the query that represents the word go and do the math by first calculating the similarity scores between the new query and the keys and then run the similarity scores through a softmax and then scale the values and then add them together and we end up with the self-attention values for go note before we move on I want to point out a few details about self-attention first the weights that we use to calculate the self-attention queries are the exact same for let's and go in other words this example uses one set of weights for calculating self-attention queries regardless of how many words are in the input likewise we reuse the sets of weights for calculating self-attention keys and values for each input word this means that no matter how many words are input into the Transformer we just reuse the same sets of weights for self-attention queries keys and values the other thing I want to point out is that we can calculate the queries keys and values for each word at the same time in other words we don't have to calculate the query key and value for the first word first before moving on to the second word and because we can do all of the computation at the same time Transformers can take advantage of parallel Computing and run fast now that we understand the details of how self-attention Works let's shrink it down so we can keep building our Transformer bam Josh you forgot something if we want two values to represent let's why don't we just use the two position encoded values we started with first the new self-attention values for each word contain input from all of the other words and this helps give each word context and this can help establish how each word in the input is related to the others also if we think of this unit with its weights for calculating queries keys and values as a self-attention cell then in order to correctly establish how words are related in complicated sentences and paragraphs we can create a stack of self-attention cells each with its own sets of Weights that we apply to the position encoded values for each word to capture different relationships among the words in the manuscript that first describes Transformers they stacked eight self-attention cells and they called this multi-head attention why eight instead of 12 or 16 I have no idea bam okay going back to our simple example with only one self-attention cell there's one more thing we need to do to encode the input we take the position encoded values and add them to the self-attention values these bypasses are called residual connections and they make it easier to train complex neural networks by allowing the self-attention layer to establish relationships among the input words without having to also preserve the word embedding and positioning coding information bam and that's all we need to do to encode the input for this simple Transformer double bam note this simple Transformer only contains the parts required for encoding the input word embedding positional encoding self-attention and residual connections these four features allow the Transformer to encode words into numbers encode the positions of the words encode the relationships among the words and relatively easily and quickly train in parallel that said there are lots of extra things we can add to a Transformer and we'll talk about those at the end of this Quest bam so now that we've encoded the English input phrase let's go it's time to decode it into Spanish in other words the first part of a transformer is called an encoder and now it's time to create the second part A decoder the decoder just like the encoder starts with word embedding however this time we create embedding values for the output vocabulary which consists of the Spanish words ear vamos e and the EOS end of sequence token now because we just finished encoding the English sentence let's go the decoder starts with embedding values for the EOS token in this case we're using the EOS token to start the decoding because that is a common way to initialize the process of decoding the encoded input sentence however sometimes you'll see people use SOS for startup sentence or start of sequence to initialize the process Josh starting with SOS makes more sense to me then you can do it that way Squatch I'm just saying a lot of people start with EOS anyway we plug in 1 for Eos and zero for everything else and do the math and we end up with 2.70 and negative 1.34 as the numbers that represent the EOS token bam now let's shrink the word embedding down to make more space so that we can add the positional encoding note these are the exact same sine and cosine squiggles that we used when we encoded the input and since the EOS token is in the first position with two embeddings we just add those two position values and we get 2.70 and negative 0.34 as the position and word embedding values representing the EOS token bam now let's consolidate the math in the diagram and before we move on to the next step let's review a key concept from when we encoded the input one key concept from earlier was that we created a single unit to process an input word and then we just copied that unit for each word in the input and if we had more words we just make more copies of the same unit by creating a single unit that can be copied for each input word the Transformer can do all of the computation for each word in the input at the same time for example we can calculate the word embeddings on different processors at the same time and then add the positional encoding at the same time and then calculate the queries keys and values at the same time and once that is done we can calculate the self-attention values at the same time and lastly we can calculate the residual connections at the same time doing all of the computations at the same time rather than doing them sequentially for each word means we can process a lot of words relatively quickly on a chip with a lot of computing cores like a GPU Graphics Processing Unit or multiple chips in the cloud well likewise when we decode and translate the input we want a single unit that we can copy for each translated word for the same reasons we want to do the math quickly so even though we're only processing the EOS token so far we add a self-attention layer so that ultimately we can keep track of related words in the output now that we have the query key and value numbers for the EOS token we calculate itself attention values just like before and the self-attention values for the EOS token are negative 2.8 and negative 2.3 note the sets of Weights we used to calculate the decoder self-attention query key and value are different from the sets we used in the encoder now let's consolidate the math and add residual connections just like before bam now so far we've talked about how self-attention helps the Transformer keep track of how words are related within a sentence however since We're translating a sentence we also need to keep track of the relationships between the input sentence and the output for example if the input sentence was don't eat the delicious looking and smelling pizza then when translating it's super important to keep track of the very first word don't if the translation focuses on other parts of the sentence and omits the don't then we'll end up with eat the delicious looking and smelling Pizza and these two sentences have completely opposite meanings so it's super important for the decoder to keep track of the significant words in the input so the main idea of encoder decoder attention is to allow the decoder to keep track of the significant words in the input now that we know the main idea behind encoder decoder attention here are the details first to give us a little more room let's consolidate the math and the diagrams now just like we did for self-attention we create two new values to represent the query for the EOS token in the decoder then we create keys for each word in the encoder and we calculate the similarities between the EOS token in the decoder and each word in the encoder by calculating the dot products just like before then we run the similarities through a softmax function and this tells us to use one hundred percent of the first input word and zero percent of the second when the decoder determines what should be the first translated word now that we know what percentage of each input word to use when determining what should be the first translated word we calculate values for each input word and then scale those values by the soft Max percentages and then add the pairs of scaled values together to get the encoder decoder attention values bam now to make room for the next step let's consolidate the encoder decoder attention in our diagram note the sets of Weights that we use to calculate the queries keys and values for encoder decoder attention are different from the sets of Weights we use for self-attention however just like for self-attention the sets of Weights are copied and reused for each word this allows the Transformer to be flexible with the length of the inputs and outputs and also we can stack encode or decoder attention just like we can stack self-attention to keep track of words in complicated phrases bam now we add another set of residual connections that allow the encoder decoder attention to focus on the relationships between the output words and the input words without having to preserve the self-attention or word and position encoding that happened earlier then we consolidate the math and the diagram lastly we need a way to take these two values that represent the EOS token in the decoder and select one of the four output tokens ear vamos e or EOS so we run these two values through a fully connected layer that has one input for each value that represents the current token so in this case we have two inputs and one output for each token in the output vocabulary which in this case means four outputs note a fully connected layer is just a simple neural network with weights numbers we multiply the inputs by and biases numbers we add to the sums of the products and when we do the math we get four output values which we run through a final soft Max function to select the first output word vamos bam note vamos is the Spanish translation for Let's Go triple boom no not yet so far the translation is correct but the decoder doesn't stop until it outputs an EOS token so let's consolidate our diagrams and plug the translated word vamos into a copy of the decoder's embedding layer and do the math first we get the word embeddings for vamos then we add the positional encoding now we calculate self-attention values for vamos using the exact same weights that we used for the EOS token now add the residual connections and calculate the encoder decoder attention using the same sets of Weights that we used for the EOS token now we add more residual connections lastly we run the values that represent vamos through the same fully connected layer and softmax that we used for the EOS token and the second output from the decoder is eos so we are done decoding triple bam at long last we've shown how a Transformer can encode a simple input phrase let's go and decode the encoding into the translated phrase of vamos in summary Transformers use word embedding to convert words into numbers positional encoding to keep track of word order self-attention to keep track of word relationships within the input and output phrases encoder decoder attention to keep track of things between the input and output phrases to make sure that important words in the input are not lost in the translation and residual connections to allow each subunit like self-attention to focus on solving just one part of the problem now that we understand the main ideas of how Transformers work let's talk about a few extra things we can add to them in this example we kept things super simple however if we had larger vocabularies and the original Transformer had 37 000 tokens and longer input and output phrases then in order to get their model to work they had to normalize the values after every step for example they normalize the values after positional encoding and after self-attention in both the encoder and the decoder also when we calculated attention values we used the dot product to calculate the similarities but you can use whatever similarity function you want in the original Transformer manuscript they calculated the similarities with a DOT product divided by the square root of the number of embedding values per token just like with scaling the values after each step they found that scaling the dot product helped encode and decode long and complicated phrases lastly to give a Transformer more weights and biases to fit to complicated data you can add additional neural networks with hidden layers to both the encoder and decoder bam now it's time for some Shameless self-promotion if you want to review statistics and machine learning offline check out the statquest PDF study guides in my book the stat Quest Illustrated guide to machine learning at stackwest.org there's something for everyone hooray we've made it to the end of another exciting stat Quest if you like this stack Quest and want to see more please subscribe and if you want to support stackquest consider contributing to my patreon campaign becoming a channel member buying one or two of my original songs or a t-shirt or a hoodie or just donate the links are in the description below alright until next time Quest on