Understanding Vision Transformers in Image Classification

We have previously studied the transformer model and BERT. In this lecture, I will teach Vision Transformer. It is a new state of the art for image classification. Its accuracy surpasses the best convolutional neural networks. What is image classification? For example, we show this image to a model. The model is supposed to infer that the image contains a dog. Feed the image into a neural network. The neural network outputs a vector p. p indicates the result of classification. Each element of p corresponds to a class. If a dataset has eight classes, then p is eight-dimensional. For example, the figure shows the eight elements of vector p. Each element is the confidence of a class. The elements are between 0 and 1, and they add up to 1. The dog class has the biggest confidence, 0.4. It means the neural network has 40% confidence that the image contains a dog. When it comes to image classification, the first model you can think of is definitely CNN. ResNet is the best among CNN models. ResNet was the best solution to image classification. Vision Transformer or VIT is the new state of the art for image classification. VIT was posted on Archive in October 2020 and officially published in 2021. On all the public databases, sets, VIT beats the best ResNet by a small margin, provided that VIT has been pre-trained on a sufficiently large dataset. The bigger the dataset, the greater the advantage of the VIT over ResNet. Transformer was originally developed in 2017 for natural language processing. VIT is a successful application of Transformer in computer vision, but the model of VIT does not have novelty. VIT is exactly the encoder network of Transformer. I use this example to explain vision transformer. Vision transformer requires partitioning the input image into patches of the same shape. For example, we partition the image into nine patches of the same shape. In this example, the nine patches do not have overlap. But it is totally fine to split the image into overlapping patches. It's up to the user. When splitting the image, we use a sliding window that moves several pixels each time. Stride means how many pixels the sliding window moves each time. A smaller stride will result in a bigger number of patches. To split the image, the user needs to specify two arguments. One is the patch size. For example, the Vision Transformer paper uses 16 by 16 patches. The other argument is the stride. The paper splits the image into non-overlapping patches. That is, the stride is 16 by 16, the same as the patch size. Of course you can use a smaller stride. You can use 1 by 1 stride if you want. But this will result in too many patches, which means very heavy computation. Suppose the image is split into nine patches. Every patch is a small color image with RGB channels. A patch is an order 3 tensor. The next step is to vectorize the patches. Vectorization means reshaping a tensor into a vector. The nine tensors are reshaped into nine vectors. If the patches are d1 by d2 by d3 tensors, then the vectors are d1 d2 d3 dimensional vectors. Suppose an image is split into n patches, which are reshaped into n vectors, x1 to xn. Apply a dense layer. The output z1 is equal to w times x1 plus b. Nonlinear activation function is not applied. The dense layer is therefore a linear function. Here, matrix W and vector B are parameters to be learned from training data. Apply the same linear function to vector x2. We obtain vector z2. It is equal to w times x2 plus b. The parameters w and b are the same as before. The dense layers share parameters. Apply the same dense layer to all the vectors x1 to xn. The outputs are z1 to zn. The dense layers have the same parameter matrix W and parameter vector B. In addition, we need to add positional encoding to vectors z1 to zn. The input image is split into n patches. Each patch has a position which is an integer between 1 and n. Positional encoding maps an integer to a vector. The shape of the vector is the same as z. Add the positional encoding vectors to the z vectors. In this way, a z vector captures both the content and the position of a patch. The vision transformer paper empirically demonstrated the benefit of using positional encoding. Without positional encoding, The accuracy decreases by 3%. The paper tried several positional encoding methods. Those methods lead to almost the same accuracy. So it is okay to use any kind of positional encoding. I'd like to explain why positional encoding is necessary. Look at this image. We partition the image into nine patches. Here is a copy of the image. Look at the image on the right. We exchange the positions of some patches. Now the two images are different. However, swapping the z vectors will not affect the final output of transformer. If the z vectors do not contain the positional encoding, then the two images on the left and right will be the same from the transformer's perspective. This is unreasonable because the images on the left and right are different. We hope transformer knows the two images are different. So we assign positional information to the patches and add positional encoding to the z vectors. In this way, if two patches are swapped, their positional encoding will change and therefore the output of transformer will be different. Let's come back to the new networks we were building. x1 to xn are the vectorizations of the n patches. Let vectors Z1 to Zn be the result of linear transformation and positional encoding. They are the representations of the n patches. They capture both the content and positions of the patches. Aside from the n patches, we use the CLS token for classification. An embedding layer takes as input the CLS token and outputs vector z0. Z0 has the same shape as the other z vectors. We use the CLS token because the output of transformer in this position will be used for classification. Previously, in the BERT lecture, I have elaborated on the CLS token. Feed the sequence z0 to zn to a multi-head self-attention layer. If you are unfamiliar with it, go back to my lectures on multi-head self-attention. The outputs of this self-attention layer are a sequence of n plus 1 vectors. Then apply a dense layer. The outputs of the dense layer are also a sequence of n plus 1 vectors. Then add another multi-head self-attention layer and then another dense layer. You can add many self-attention layers and dense layers one by one if you want. Aside from the layers, Transformer actually uses skip connection and normalization. They are standard tricks for improving performance. I don't want to elaborate on the details. The multi-head self-attention layers and dense layers constitute the Transformer encoder network. The outputs are a sequence of n plus 1 vectors. Denote the self-attention layers and dense layers by such an encoder network. Vectors C0 to Cn are the output of transformer. To perform the classification task, we don't need vectors C1 to Cn. Simply ignore them. What we need is vector C0. It is the feature vector extracted from the image. The classification is based on C0. Feed C0 into a Softmax classifier. The classifier outputs vector p. The shape of p is a number of classes. If the data set has 8 classes, then p is 8-dimensional. This example shows the elements of vector P. They indicate the result of classification. During training, we compute the cross entropy of vector P and the ground truth, then compute the gradient of the cross entropy loss with respect to the model parameters, and perform gradient descent to update the parameters. We have finished building the vision transformer model. The next step is to train the model on image data. Firstly, random initialize the model. Then train the model on dataset A. Dataset A should be a large-scale dataset. This step is called pre-training. The result is a pre-trained model. Next, train the model on dataset B. It is typically smaller than dataset A. This step is called fine-tuning. To this end, we have finished training the model. Dataset B is the target dataset. For example, if the task is image classification on ImageNet, then Dataset B contains the 1.3 million training images of ImageNet. Finally, evaluate the model on the test set of Dataset B. We get the test accuracy. The number indicates how good the model is. The Vision Transformer paper mainly uses three datasets. The small ImageNet data is the smallest among the three datasets. It has 1.3 million images and 1000 classes. The big ImageNet is larger. It has 14 million images and 21,000 classes. The small ImageNet is a subset of the big ImageNet. GFT is the biggest among the three. GFT has 300 million images and 18,000 classes. Unfortunately, GFT is Google's private data. GFT is not publicly available. The paper evaluates the models in this way. Pretrain the models on dataset A, fine-tune the models on dataset B, and evaluate the models on the test set of dataset B. Dataset B is the target dataset. They use the test accuracy on dataset B as the evaluation metric. Use the small ImageNet for pretraining. Then they use various target datasets, including the small ImageNet, CIFAR-10, CIFAR-100, and other small datasets. They use the target datasets for fine-tuning and evaluation. On all the target datasets, Transformer is a little worse than ResNet. This means without a big dataset for pre-training, Transformer does not perform well. Then use the big ImageNet for pre-training. They use several smaller datasets for fine-tuning and evaluation. On all the target datasets, Transformer is comparable to ResNet. The big ImageNet has 14 million images, but it is not big enough for Transformer. They also use GFT for pre-training. This time, Transformer is consistently better than ResNet on all the target datasets. Transformer is around 1% better than ResNet. The experiments indicate that Transformer requires very large data for pre-tuning. The bigger the pre-tuning dataset, the greater the advantage of Transformer over ResNet. If the dataset for pre-tuning has less than 100 million images, Transformer is worse than ResNet. When the dataset has over 100 million images, Transformer is better than ResNet. Judging from the result, 300 million is not enough. If the number of images can further grow, then the accuracy of transformer can be even better. In contrast, for ResNet, 100 million or 300 million images do not make a difference. The accuracy of ResNet does not improve as the number of samples grows from 100 million to 300 million. In sum, Vision Transformer requires huge data for pre-tuning. Transformer is advantageous over CNNs only if the dataset for pre-tuning is sufficiently large. Transformer has an insatiable appetite for data. 300 million images are not enough. Thank you for watching this lecture. The link to my slides can be found below the video.

For example, we show this image to a model. The model is supposed to infer that the image contains a dog. Feed the image into a neural network.

The neural network outputs a vector p. p indicates the result of classification. Each element of p corresponds to a class. If a dataset has eight classes, then p is eight-dimensional. For example, the figure shows the eight elements of vector p.

Each element is the confidence of a class. The elements are between 0 and 1, and they add up to 1. The dog class has the biggest confidence, 0.4. It means the neural network has 40% confidence that the image contains a dog.

When it comes to image classification, the first model you can think of is definitely CNN. ResNet is the best among CNN models. ResNet was the best solution to image classification.

Vision Transformer or VIT is the new state of the art for image classification. VIT was posted on Archive in October 2020 and officially published in 2021. On all the public databases, sets, VIT beats the best ResNet by a small margin, provided that VIT has been pre-trained on a sufficiently large dataset. The bigger the dataset, the greater the advantage of the VIT over ResNet. Transformer was originally developed in 2017 for natural language processing.

VIT is a successful application of Transformer in computer vision, but the model of VIT does not have novelty. VIT is exactly the encoder network of Transformer. I use this example to explain vision transformer.

Vision transformer requires partitioning the input image into patches of the same shape. For example, we partition the image into nine patches of the same shape. In this example, the nine patches do not have overlap.

But it is totally fine to split the image into overlapping patches. It's up to the user. When splitting the image, we use a sliding window that moves several pixels each time.

Stride means how many pixels the sliding window moves each time. A smaller stride will result in a bigger number of patches. To split the image, the user needs to specify two arguments.

One is the patch size. For example, the Vision Transformer paper uses 16 by 16 patches. The other argument is the stride.

The paper splits the image into non-overlapping patches. That is, the stride is 16 by 16, the same as the patch size. Of course you can use a smaller stride. You can use 1 by 1 stride if you want. But this will result in too many patches, which means very heavy computation.

Suppose the image is split into nine patches. Every patch is a small color image with RGB channels. A patch is an order 3 tensor. The next step is to vectorize the patches. Vectorization means reshaping a tensor into a vector.

The nine tensors are reshaped into nine vectors. If the patches are d1 by d2 by d3 tensors, then the vectors are d1 d2 d3 dimensional vectors. Suppose an image is split into n patches, which are reshaped into n vectors, x1 to xn. Apply a dense layer.

The output z1 is equal to w times x1 plus b. Nonlinear activation function is not applied. The dense layer is therefore a linear function.

Here, matrix W and vector B are parameters to be learned from training data. Apply the same linear function to vector x2. We obtain vector z2. It is equal to w times x2 plus b. The parameters w and b are the same as before.

The dense layers share parameters. Apply the same dense layer to all the vectors x1 to xn. The outputs are z1 to zn. The dense layers have the same parameter matrix W and parameter vector B.

In addition, we need to add positional encoding to vectors z1 to zn. The input image is split into n patches. Each patch has a position which is an integer between 1 and n. Positional encoding maps an integer to a vector. The shape of the vector is the same as z.

Add the positional encoding vectors to the z vectors. In this way, a z vector captures both the content and the position of a patch. The vision transformer paper empirically demonstrated the benefit of using positional encoding.

Without positional encoding, The accuracy decreases by 3%. The paper tried several positional encoding methods. Those methods lead to almost the same accuracy.

So it is okay to use any kind of positional encoding. I'd like to explain why positional encoding is necessary. Look at this image. We partition the image into nine patches.

Here is a copy of the image. Look at the image on the right. We exchange the positions of some patches. Now the two images are different. However, swapping the z vectors will not affect the final output of transformer.

If the z vectors do not contain the positional encoding, then the two images on the left and right will be the same from the transformer's perspective. This is unreasonable because the images on the left and right are different. We hope transformer knows the two images are different. So we assign positional information to the patches and add positional encoding to the z vectors. In this way, if two patches are swapped, their positional encoding will change and therefore the output of transformer will be different.

Let's come back to the new networks we were building. x1 to xn are the vectorizations of the n patches. Let vectors Z1 to Zn be the result of linear transformation and positional encoding. They are the representations of the n patches.

They capture both the content and positions of the patches. Aside from the n patches, we use the CLS token for classification. An embedding layer takes as input the CLS token and outputs vector z0.

Z0 has the same shape as the other z vectors. We use the CLS token because the output of transformer in this position will be used for classification. Previously, in the BERT lecture, I have elaborated on the CLS token.

Feed the sequence z0 to zn to a multi-head self-attention layer. If you are unfamiliar with it, go back to my lectures on multi-head self-attention. The outputs of this self-attention layer are a sequence of n plus 1 vectors. Then apply a dense layer.

The outputs of the dense layer are also a sequence of n plus 1 vectors. Then add another multi-head self-attention layer and then another dense layer. You can add many self-attention layers and dense layers one by one if you want.

Aside from the layers, Transformer actually uses skip connection and normalization. They are standard tricks for improving performance. I don't want to elaborate on the details. The multi-head self-attention layers and dense layers constitute the Transformer encoder network.

The outputs are a sequence of n plus 1 vectors. Denote the self-attention layers and dense layers by such an encoder network. Vectors C0 to Cn are the output of transformer. To perform the classification task, we don't need vectors C1 to Cn. Simply ignore them.

What we need is vector C0. It is the feature vector extracted from the image. The classification is based on C0. Feed C0 into a Softmax classifier.

The classifier outputs vector p. The shape of p is a number of classes. If the data set has 8 classes, then p is 8-dimensional. This example shows the elements of vector P. They indicate the result of classification.

During training, we compute the cross entropy of vector P and the ground truth, then compute the gradient of the cross entropy loss with respect to the model parameters, and perform gradient descent to update the parameters. We have finished building the vision transformer model. The next step is to train the model on image data. Firstly, random initialize the model.

Then train the model on dataset A. Dataset A should be a large-scale dataset. This step is called pre-training.

The result is a pre-trained model. Next, train the model on dataset B. It is typically smaller than dataset A. This step is called fine-tuning.

To this end, we have finished training the model. Dataset B is the target dataset. For example, if the task is image classification on ImageNet, then Dataset B contains the 1.3 million training images of ImageNet.

Finally, evaluate the model on the test set of Dataset B. We get the test accuracy. The number indicates how good the model is. The Vision Transformer paper mainly uses three datasets.

The small ImageNet data is the smallest among the three datasets. It has 1.3 million images and 1000 classes. The big ImageNet is larger.

It has 14 million images and 21,000 classes. The small ImageNet is a subset of the big ImageNet. GFT is the biggest among the three. GFT has 300 million images and 18,000 classes.

Unfortunately, GFT is Google's private data. GFT is not publicly available. The paper evaluates the models in this way.

Pretrain the models on dataset A, fine-tune the models on dataset B, and evaluate the models on the test set of dataset B. Dataset B is the target dataset. They use the test accuracy on dataset B as the evaluation metric.

Use the small ImageNet for pretraining. Then they use various target datasets, including the small ImageNet, CIFAR-10, CIFAR-100, and other small datasets. They use the target datasets for fine-tuning and evaluation. On all the target datasets, Transformer is a little worse than ResNet. This means without a big dataset for pre-training, Transformer does not perform well.

Then use the big ImageNet for pre-training. They use several smaller datasets for fine-tuning and evaluation. On all the target datasets, Transformer is comparable to ResNet. The big ImageNet has 14 million images, but it is not big enough for Transformer. They also use GFT for pre-training.

This time, Transformer is consistently better than ResNet on all the target datasets. Transformer is around 1% better than ResNet. The experiments indicate that Transformer requires very large data for pre-tuning. The bigger the pre-tuning dataset, the greater the advantage of Transformer over ResNet. If the dataset for pre-tuning has less than 100 million images, Transformer is worse than ResNet.

When the dataset has over 100 million images, Transformer is better than ResNet. Judging from the result, 300 million is not enough. If the number of images can further grow, then the accuracy of transformer can be even better.

In contrast, for ResNet, 100 million or 300 million images do not make a difference. The accuracy of ResNet does not improve as the number of samples grows from 100 million to 300 million. In sum, Vision Transformer requires huge data for pre-tuning.

Transformer is advantageous over CNNs only if the dataset for pre-tuning is sufficiently large. Transformer has an insatiable appetite for data. 300 million images are not enough.

Thank you for watching this lecture. The link to my slides can be found below the video.

Transcript for:Understanding Vision Transformers in Image Classification

Transcript for:
Understanding Vision Transformers in Image Classification