Transformer Architecture Overview

The transformer architecture powers most of the impressive recent breakthroughs in AI. The transformer is behind systems like ChatGPT, Vision Transformers, Image Generators, AlphaFold 2 for predicting protein folding and many others. So, if you are interested to learn about the transformer, this is the right video for you. We already made a video explaining the transformer, but it was one of our first videos and I can do it so much better now. Also, there we did not spend enough time explaining self-attention, which we will do better this time. So here we go, with the remastered explanation of the transformer architecture! Transformers can work with ANY kind of data, and by that, I mean text, images, speech, and so on, as long as we represent the data as a set of vectors. However, it is not always straightforward to do this, as for example text does not naturally come as a sequence of vectors. That means, before we can look at the inner workings of the transformer, we need to understand how to represent inputs as vectors. So, let’s look at two examples: text and images. For text, we do the so-called tokenization where we take a sequence of words and decompose it with the tokenizer into subwords from a predefined vocabulary, for example by following whitespaces and breaking down compound words into their components. If you want to know more about tokenization, check out our video on this. Then these subwords all get assigned a unique vector. The vectors could be initialized randomly or even better: with word embeddings! Word embeddings work after the idea that distances between embeddings represents word similarity (a word is defined by the company it keeps) and words that are semantically more similar are initialized with vectors close in the high dimensional vector space. You can easily download such word embeddings as they are precomputed by counting how often words appear next to other words in text and a neural network learns to assign two words similar embeddings if they both have the same neighbors. You can learn more about word embeddings in our previous video. Now, that we know how to represent text, let’s think about how to represent images. Images are more naturally represented as vectors, or at least matrices, which are high dimensional vectors: An image is composed of three matrices, where each matrix tells us for the red, green and blue channels what the light intensity of that color is in the corresponding pixel. One could take the rows of each matrix and write them one after the other to get vectors. But this would result in a lot of vectors and transformers are much slower with many vectors (as will become clearer later in this video). So what people do instead, is to divide images into patches and apply to each patch the same linear neural network layer that trains together with the transformer, to find the right weights that sensibly change the dimensionality of p by p patches to a d times 1 matrix, which is a d dimensional vector. To summarize, the prerequisite of transformers is that whatever the input, we must first decide for a way to represent this input with vectors. All neural networks, including the transformer, process these vector representations into better and better representations with each layer, until the solution for the task is obvious (or linearly separable, if we want to use jargon). But compared to other neural networks, the transformer does this processing in a specific way, as following: Let’s suppose we have an input sequence, here of text. And the task is for example to predict which token comes next, or whether the sentence expresses a positive or a negative sentiment, or any other classification task we can think of. We take our input sequence, represent it as vectors with word embeddings. One Transformer layer takes in this sequence, updates the vectors and outputs as many vectors as it had in the input and preserves the dimensionality of the vectors. But to do something meaningful with this transformer, we need to add special tokens, for example a classification token at the end of the sequence. This special token goes through the transformer in the same way as the other tokens. But it’s special because to its output representation, we usually append a linear classification layer that classifies from a list of words, called the vocabulary, which tokens comes next. And if we are trying to classify, it assigns probabilities to these classes. And note that this is a simple classification layer, or mathematically it is just a matrix multiplication that happens here, which geometrically corresponds to drawing a separation line in the high dimensional space the word vectors live in. In other words, the solution here should be already obvious as prepared by the transformer, such that we can tell fitting classes from unfitting classes just by drawing a line. During training, the transformer processes the input, gives output vectors and we run the classification layer on the special tokens and get the assigned class. We compare the assigned prediction to the expected one from the dataset, compute the loss value and backpropagate the loss value and update the internal parameters of the classification layer and the transformer layer to values that minimize the loss, thus give better classification results next time. Okay, but what happens in this mysterious box we call “transformer”? Well, it is composed of multiple transformer layers. One transformer layer contains two things: One of them is not so much, it is just the same feed forward network, also called MLP sublayer, acting on every input token. Such an MLP sublayer takes the input representation, applies a dense layer with GeLU activation that doubles the dimension. Then another dense layer with GeLU activation scales down the dimension again. And it is the same MLP layer, with the exact same weights we apply to each input token embedding. Ok, let’s see what we have. A bunch of MLP layers processing each token independently of the others. This is suboptimal, because see that this word representation? It does not even know that there are other words next to it. And it is even worse for the [CLS] token, that should aggregate and summarize the sentence information if we are to use it for classification, but it has no connection to the sentence tokens at all! While the transformer layer saves a lot of compute time because all these MLP layers compute their output in parallel, we need a way to communicate information in the context of the sequence, so that the word “works” is informed of the existence and semantics of its neighbour “attention”, for example. Luckily, this is what the self-attention sublayer is for: to let information flow within the context of the sequence, from one embedding to its neighbors. In a nutshell, the attention layer computes how much of the representation of each of all neighbours we need to add to compute a new token representation, which is the outcome of the self-attention layer. By the way, we will be using attention and self-attention here synonymously. But if you are wondering what the difference between them is: self-attention is when we compute importances of the elements of a sequence to the elements in the same sequence. Attention is more general because we compute the importance of the elements of one sequence to the elements in another sequence. For example, you can see here the self-attention of “it” on the left, and the attention of “ihn” on the right. “Ihn” is an element from a sequence different to the one above it. Now, how does the attention layer compute these importances exactly? Well, it is a bit complicated in the sense that it is a pile of linear algebra that uses the loss function to adapt the entries of weight matrices during training to make them work well in inference. But neural networks are never anything else others than huge piles of linear algebra so strap yourself onto your chair because we will try to explain the attention computation as clear as possible. Self-attention does the following: It takes the input vectors and applies 3 different linear transformations to produce the keys, queries and values vectors. This means that for the queries, it multiplies the Query matrix to the input vector and this results in a query vector. This query matrix is randomly initialized before training and gradient descents adapts its values during backpropagation to make them the right ones that reduce the loss on the training data. And the same query matrix applies to all inputs to get query vectors for all of them. As for the keys, we simply have another matrix called the key matrix which is differently initialized from the query matrix that also multiplies to the input vector to produce a key vector. And to produce the value vectors, we multiply a Value matrix to the input. So, in summary, we have three different matrices, all initialized randomly that linearly transform the input in different ways. Now what is self-attention further doing to these different vectors it has just produced? Let’s suppose we are calculating the attention for the input token “works” to all other tokens in the sequence, including itself. It works the same for the other tokens too. First, we compute the scalar product between the query vector of the token of interest, and the keys of every other vector. Then we divide by the square root of the dimension of the key vectors, so square root of 3. Then we apply the softmax over all these values. We can interpret these softmax scores to be measuring how important each token in the input is for the token “works”. So, the token “Attention” is 13% important for “works”, “works” is 78% important for itself and the [CLS] token is 7% important. Now it gets interesting. To get the final representation of “works”, we take the sum over all value vectors weighted / multiplied with the softmax result. So this is what we meant before by saying that attention combines the representations of the input (the value vectors) weighted by the importance score. Empirically it turns out that one set of attention values in each layer is not enough to capture the complexity of relationships in our data. Think of it this way: the attention importance scores define a graph where it tells us for each token, of how important that token is to all others. But one graph is not enough to model all existing relationships, in the same way you can define your social network graph based on how many friends you have, you can also think of other types of connections, like with whom of these people you work together. Or, with whom you share the same city. There are multiple relationships and importances to be modelled given a set of tokens. Therefore the idea of multi-head self-attention is to let the network learn 3, or 8, or 12 attention patterns, instead of just one. So we do not use one set of query, key and value matrices, but 3 of them and each set is called an “attention head”. As we initialise the key, query and value matrices all randomly, they will start with different values in their training process, will produce different query vectors and they will usually capture different patterns that they detect in your data. One head might focus on one pattern such as coreference resolution, and another one on identifying the subject in sentences. If you wonder how many attention heads you need, the answer is that you are free to choose. It is a hyperparameter. The more, the better, but often you can not use very many as you quickly run out of GPU memory. Especially because attention scales quadratically in time and memory. So if you process a sequence that doubles the size, you will need four times as much time to run and four times as much memory. It is an active area of research to approximate attention with other operations that scale linearly instead of quadratically, or to replace it altogether with other operations that do the job of mixing information between tokens. If you are interested in this topic, please watch our previous videos on this. But it a nutshell, it’s fake news that attention is all you need. You can replace it with other token mixing procedures too. Now, let’s recap what we have so far and what we still need for a full transformer. We have our input embeddings, they go through the self-attention layer that gives us representations that are informed on the fellow embeddings in the sequence. Then they go through the MLP layer all in parallel. But so far, this transformer layer behaves like our input sequence weren’t a sequence, but a set. If we were to reorder the tokens, the transformer would not change its outputs. The result of the attention would still be the same as all operations there are commutative, please check to convince yourself. And the Feed Forward network acts independently of all the other tokens anyway. This is not great that so far, the transformer gives us the same output independently of the order of the input. Because images, text and sound are sequences where order matters, we need a way to tell the transformer layer that this is the first token in the sequence, this is the second, and so on. And this is what positional embeddings do. They are vectors that uniquely identify each position, which we add to the input embeddings. They work like house numbers to identify the specific position of each house in a street address. How do we come up with the values for the positional embeddings? Well, with certain rules or we can simply learn these vectors as well during the training process of the transformer. If you want more details about positional embeddings and the numerous ways to implement them, you can watch one of our previous videos on this. Okay, now we got this figured out, but there is one more thing missing and the architecture is complete. The missing ingredient are the residual connections which after the self-attention layer add the input of the self-attention layer, to its output. A normalization operation reduces the values back again to the 0 to 1 range, because otherwise, after each residual connection, with each layer, the values would get larger and larger and larger… And the same thing, of adding the input back to the output happens around the MLP layer, here in green. The intuition behind residual connections is to make the learning job easier for each layer. To arrive at the solution, the network needs to transform the inputs. But since it is allowed to keep the input through the residual connections, each layer is forced to learn not the whole transformation, but just the difference it needs to add to arrive at the output. And residual connections become even more important, as usually with deep neural networks, we usually do not use just one transformer layer, but append another transformer layer to the output of the previous one, and another layer, and so on. How many? It is a hyperparameter and of course we are limited by the amount of memory our GPUs have. The more the better, because with many layers, the transformer gets more attempts to break down the problem and arrive at the solution, which is easier than getting to the solution in one go with just one layer. And residual connections help when training such a long stack of layers, because during backpropagation, gradient signals can get lost by propagating from the end to the beginning – very much like a whisper in the telephone game -- like it is called in the US. Now, this was most of what you need to know about transformer basics, since you now know the principles after which they predict the next word, like GPT, or classify the whole sequence. Another training procedure we left for the end, is the so called Masked Language Modelling procedure used for transformers of the BERT family. There, we have a classifier token that we use to classify whether two sentences belong together or not, but there is more: 15% of tokens in the sequence are chosen randomly and masked out, and replaced with a special [MASK] token. The training objective of BERT is then to adapt its weights such that a linear mask classification head can choose from the vocabulary the word that we masked out in the input. This masked language modelling procedure is great to train classification transformers, or transformer encoders. Predicting the next word is something for GPT-like models, so transformer decoders. If you are wondering what the difference between Transformers and Recurrent neural networks (RNNs) is, let’s look at this in a simplified view. While in Transformers, we use attention to communicate information in parallel from each input token to every other token, RNNs process the first token, and use that output as input together with the second token, to process the second token. Then the output of the second token goes into the processing of the third token, and so on. And you see the problem: that we need to wait for the second token to finish processing, so we can start computing the third token. This means that RNNs train slower than Transformers. So, when Transformers revolutionized NLP, it’s because their architecture allowed them to read the entire internet because they could process tokens in parallel, while with RNNs, nobody got to train onto the whole internet because it took so much time. We hope you liked this little introduction to the transformer architecture and that you can impress your friends and family that you know how ChatGPT works internally. There are countless other great resources on this topic, such as the illustrated transformer blog post of Jay Allamar and the transformer series of Louis Serrano. Also, I hope my patreon supporters that voted for the transformer explained as a topic for next video, will be happy as I finally managed to finish this video. I really thank them for their patience. If you liked this video, do not forget to like and subscribe. We hope to see you next time. Okay, bye!

The transformer architecture powers most of
the impressive recent breakthroughs in AI. The transformer is behind systems like ChatGPT,
Vision Transformers, Image Generators, AlphaFold 2 for predicting protein folding and many
others. So, if you are interested to learn about the transformer, this is the right video
for you. We already made a video explaining the transformer, but it was one of our first
videos and I can do it so much better now. Also, there we did not spend enough time explaining
self-attention, which we will do better this time. So here we go, with the remastered explanation
of the transformer architecture! Transformers can work with ANY kind of data,
and by that, I mean text, images, speech, and so on,
as long as we represent the data as a set of vectors. However, it is not always straightforward
to do this, as for example text does not naturally come as a sequence of vectors. That means,
before we can look at the inner workings of the transformer, we need to understand how
to represent inputs as vectors. So, let’s look at two examples: text and images.
For text, we do the so-called tokenization where we take a sequence of words and decompose
it with the tokenizer into subwords from a predefined vocabulary, for example by following
whitespaces and breaking down compound words into their components. If you want to know
more about tokenization, check out our video on this. Then these subwords all get assigned a unique
vector. The vectors could be initialized randomly or even better: with word embeddings! Word
embeddings work after the idea that distances between embeddings represents word similarity
(a word is defined by the company it keeps) and words that are semantically more similar
are initialized with vectors close in the high dimensional vector space. You can easily download such word embeddings
as they are precomputed by counting how often words appear next to other words in text and
a neural network learns to assign two words similar embeddings if they both have the same
neighbors. You can learn more about word embeddings in our previous video. Now, that we know how to represent text, let’s
think about how to represent images. Images are more naturally represented as vectors,
or at least matrices, which are high dimensional vectors: An image is composed of three matrices, where
each matrix tells us for the red, green and blue channels what the light intensity of
that color is in the corresponding pixel. One could take the rows of each matrix and
write them one after the other to get vectors. But this would result in a lot of vectors
and transformers are much slower with many vectors (as will become clearer later in this
video). So what people do instead,
is to divide images into patches and apply to each patch the same linear neural network
layer that trains together with the transformer, to find the right weights that sensibly change
the dimensionality of p by p patches to a d times 1 matrix, which is a d dimensional
vector. To summarize, the prerequisite of transformers
is that whatever the input, we must first decide for a way to represent this input with
vectors. All neural networks, including the transformer,
process these vector representations into better and better representations with each
layer, until the solution for the task is obvious (or linearly separable, if we want
to use jargon). But compared to other neural networks, the
transformer does this processing in a specific way, as following: Let’s suppose we have
an input sequence, here of text. And the task is for example to predict which token comes
next, or whether the sentence expresses a positive or a negative sentiment, or any other
classification task we can think of. We take our input sequence, represent it as
vectors with word embeddings. One Transformer layer takes in this sequence, updates the
vectors and outputs as many vectors as it had in the input and preserves the dimensionality
of the vectors. But to do something meaningful with this transformer,
we need to add special tokens, for example a classification token at the end of the sequence.
This special token goes through the transformer in the same way as the other tokens.
But it’s special because to its output representation, we usually append a linear classification
layer that classifies from a list of words, called the vocabulary, which tokens comes
next. And if we are trying to classify, it assigns probabilities to these classes. And note that this is a simple classification
layer, or mathematically it is just a matrix multiplication that happens here, which geometrically
corresponds to drawing a separation line in the high dimensional space the word vectors
live in. In other words, the solution here should be already obvious as prepared by the
transformer, such that we can tell fitting classes from unfitting classes just by drawing
a line. During training, the transformer processes
the input, gives output vectors and we run the classification layer on the special tokens
and get the assigned class. We compare the assigned prediction to the
expected one from the dataset, compute the loss value and backpropagate the loss value
and update the internal parameters of the classification layer and the transformer layer
to values that minimize the loss, thus give better classification results next time.
Okay, but what happens in this mysterious box we call “transformer”?
Well, it is composed of multiple transformer layers. One transformer layer contains two things:
One of them is not so much, it is just the same feed forward network, also called MLP
sublayer, acting on every input token. Such an MLP sublayer takes the input representation,
applies a dense layer with GeLU activation that doubles the dimension. Then another dense layer with GeLU activation
scales down the dimension again. And it is the same MLP layer, with the exact same weights
we apply to each input token embedding. Ok, let’s see what we have. A bunch of MLP
layers processing each token independently of the others. This is suboptimal, because
see that this word representation? It does not even know that there are other words next
to it. And it is even worse for the [CLS] token,
that should aggregate and summarize the sentence information if we are to use it for classification,
but it has no connection to the sentence tokens at all! While the transformer layer saves a lot of
compute time because all these MLP layers compute their output in parallel, we need
a way to communicate information in the context of the sequence, so that the word “works”
is informed of the existence and semantics of its neighbour “attention”, for example. Luckily, this is what the self-attention sublayer
is for: to let information flow within the context of the sequence, from one embedding
to its neighbors. In a nutshell, the attention layer computes how much of the representation
of each of all neighbours we need to add to compute a new token representation, which
is the outcome of the self-attention layer. By the way, we will be using attention and
self-attention here synonymously. But if you are wondering what the difference between
them is: self-attention is when we compute importances of the elements of a sequence
to the elements in the same sequence. Attention is more general because we compute
the importance of the elements of one sequence to the elements in another sequence. For example,
you can see here the self-attention of “it” on the left, and the attention of “ihn”
on the right. “Ihn” is an element from a sequence different to the one above it. Now, how does the attention layer compute
these importances exactly? Well, it is a bit complicated in the sense
that it is a pile of linear algebra that uses the loss function to adapt the entries of
weight matrices during training to make them work well in inference. But neural networks
are never anything else others than huge piles of linear algebra
so strap yourself onto your chair because we will try to explain the attention computation
as clear as possible. Self-attention does the following:
It takes the input vectors and applies 3 different linear transformations to produce the keys,
queries and values vectors. This means that for the queries, it multiplies
the Query matrix to the input vector and this results in a query vector. This query matrix
is randomly initialized before training and gradient descents adapts its values during
backpropagation to make them the right ones that reduce the loss on the training data.
And the same query matrix applies to all inputs to get query vectors for all of them. As for
the keys, we simply have another matrix called the key matrix which is differently initialized
from the query matrix that also multiplies to the input vector to produce a key vector.
And to produce the value vectors, we multiply a Value matrix to the input. So, in summary, we have three different matrices,
all initialized randomly that linearly transform the input in different ways.
Now what is self-attention further doing to these different vectors it has just produced?
Let’s suppose we are calculating the attention for the input token “works” to all other
tokens in the sequence, including itself. It works the same for the other tokens too.
First, we compute the scalar product between the query vector of the token of interest,
and the keys of every other vector. Then we divide by the square root of the dimension
of the key vectors, so square root of 3. Then we apply the softmax over all these values. We can interpret these
softmax scores to be measuring how important each token in the input is for the token “works”.
So, the token “Attention” is 13% important for “works”, “works” is 78% important
for itself and the [CLS] token is 7% important. Now it gets interesting. To get the final
representation of “works”, we take the sum over all value vectors weighted / multiplied
with the softmax result. So this is what we meant before by saying
that attention combines the representations of the input (the value vectors) weighted
by the importance score. Empirically it turns out that one set of attention
values in each layer is not enough to capture the complexity of relationships in our data. Think of it this way: the attention importance
scores define a graph where it tells us for each token, of how important that token is
to all others. But one graph is not enough to model all existing
relationships, in the same way you can define your social network graph based
on how many friends you have, you can also think of other types of connections, like
with whom of these people you work together. Or, with whom you share the same city.
There are multiple relationships and importances to be modelled given a set of tokens. Therefore the idea of multi-head self-attention
is to let the network learn 3, or 8, or 12 attention patterns, instead of just one. So
we do not use one set of query, key and value matrices, but 3 of them and each set is called
an “attention head”. As we initialise the key, query and value matrices all randomly,
they will start with different values in their training process, will produce different query
vectors and they will usually capture different patterns that they detect in your data. One
head might focus on one pattern such as coreference resolution, and another one on identifying
the subject in sentences. If you wonder how many attention heads you need, the answer
is that you are free to choose. It is a hyperparameter. The more, the better, but often you can not
use very many as you quickly run out of GPU memory. Especially because attention scales quadratically
in time and memory. So if you process a sequence that doubles
the size, you will need four times as much time to run and four times as much memory.
It is an active area of research to approximate attention with other operations that scale
linearly instead of quadratically, or to replace it altogether with other operations that do
the job of mixing information between tokens. If you are interested in this topic, please
watch our previous videos on this. But it a nutshell, it’s fake news that attention
is all you need. You can replace it with other token mixing procedures too. Now, let’s recap what we have so far and
what we still need for a full transformer. We have our input embeddings, they go through
the self-attention layer that gives us representations that are informed on the fellow embeddings
in the sequence. Then they go through the MLP layer all in parallel. But so far, this transformer layer behaves
like our input sequence weren’t a sequence, but a set.
If we were to reorder the tokens, the transformer would not change its outputs. The result of
the attention would still be the same as all operations there are commutative, please check
to convince yourself. And the Feed Forward network acts independently
of all the other tokens anyway. This is not great that so far, the transformer gives us
the same output independently of the order of the input. Because images, text and sound are sequences
where order matters, we need a way to tell the transformer layer that this is the first
token in the sequence, this is the second, and so on.
And this is what positional embeddings do. They are vectors that uniquely identify each
position, which we add to the input embeddings. They work like house numbers to identify the
specific position of each house in a street address. How do we come up with the values
for the positional embeddings? Well, with certain rules or we can simply learn these
vectors as well during the training process of the transformer. If you want more details
about positional embeddings and the numerous ways to implement them, you can watch one
of our previous videos on this. Okay, now we got this figured out, but there is one
more thing missing and the architecture is complete. The missing ingredient are the residual connections
which after the self-attention layer add the input of the self-attention layer, to its
output. A normalization operation reduces the values back again to the 0 to 1 range,
because otherwise, after each residual connection, with each layer, the values would get larger
and larger and larger… And the same thing, of adding the input back to the output happens
around the MLP layer, here in green. The intuition behind residual connections is to make the
learning job easier for each layer. To arrive at the solution, the network needs to transform
the inputs. But since it is allowed to keep the input through the residual connections,
each layer is forced to learn not the whole transformation, but just the difference it
needs to add to arrive at the output. And residual connections become even more
important, as usually with deep neural networks, we usually do not use just one transformer
layer, but append another transformer layer to the output of the previous one, and another
layer, and so on. How many? It is a hyperparameter and of course we are limited by the amount
of memory our GPUs have. The more the better, because with many layers, the transformer
gets more attempts to break down the problem and arrive at the solution, which is easier
than getting to the solution in one go with just one layer. And residual connections help
when training such a long stack of layers, because during backpropagation, gradient signals
can get lost by propagating from the end to the beginning – very much like a whisper
in the telephone game -- like it is called in the US. Now, this was most of what you need to know
about transformer basics, since you now know the principles after which they predict the
next word, like GPT, or classify the whole sequence.
Another training procedure we left for the end, is the so called Masked Language Modelling
procedure used for transformers of the BERT family. There, we have a classifier token
that we use to classify whether two sentences belong together or not, but there is more:
15% of tokens in the sequence are chosen randomly and masked out, and replaced with a special
[MASK] token. The training objective of BERT is then to
adapt its weights such that a linear mask classification head can choose from the vocabulary
the word that we masked out in the input. This masked language modelling procedure is
great to train classification transformers, or transformer encoders. Predicting the next
word is something for GPT-like models, so transformer decoders. If you are wondering what the difference between
Transformers and Recurrent neural networks (RNNs) is, let’s look at this in a simplified
view. While in Transformers, we use attention to communicate information in parallel from
each input token to every other token, RNNs process the first token, and use that
output as input together with the second token, to process the second token. Then the output
of the second token goes into the processing of the third token, and so on. And you see
the problem: that we need to wait for the second token to finish processing, so we can
start computing the third token. This means that RNNs train slower than Transformers.
So, when Transformers revolutionized NLP, it’s because their architecture allowed
them to read the entire internet because they could process tokens in parallel, while with
RNNs, nobody got to train onto the whole internet because it took so much time. We hope you liked this little introduction
to the transformer architecture and that you can impress your friends and family that you
know how ChatGPT works internally. There are countless other great resources on this topic,
such as the illustrated transformer blog post of Jay Allamar and the transformer series
of Louis Serrano. Also, I hope my patreon supporters that voted for the transformer
explained as a topic for next video, will be happy as I finally managed to finish this
video. I really thank them for their patience. If you liked this video, do not forget to
like and subscribe. We hope to see you next time. Okay, bye!

Transcript for:Transformer Architecture Overview

Transcript for:
Transformer Architecture Overview