Transcript for:
Understanding Vision Transformers in Computer Vision

Hello, I am Rueda Tiense. I am a professor from the University of the Philippines. Today, I'm going to talk about a recent breakthrough in the field of computer vision. It is called Vision Transformer or simply VIVE. This is the outline of my talk. First. Let us discuss why a vision transformer matters. Then we introduce the concept of attention. Afterward, we will look into the model architecture of vision transformer. We will also highlight applications of it. Finally, we will discuss the limitations and future direction of vision transformers. This is Vision Transformer. Our goal is to analyze it block by block until we fully understand its functionality. In this talk, we will keep on revisiting this figure. Why does Vision Transformer matter? Transformer is a high-capacity network architecture. meaning it can approximate complex functions. While most high-performing models in natural language processing are based on transformers, in computer vision, CNN has been the dominant network. Last year, a group of researchers from Google figured out how to make transformers work on vision. Vision transformer demonstrated that it can outperform CNN-based models in recognition, detection, segmentation, and other downstream tasks. Furthermore, since Transformer is a general-purpose architecture, it can process different data formats such as text, audio, image, and video. Therefore, Multimodal deep learning is one of the big areas where we expect transformers to become useful. Note that multimodal learning is important because our world is inherently multimodal. Before we discuss the concept of attention and vision transformer, let us take a look at the state-of-the-art vision models. These are ResNet, Resnet, and EfficientNet. All of these are based on CNN. There are other top performing models that are not shown here. The key foundation of transformers is attention. Let us have a look at the meaning of attention. Suppose we have a photo of a bird. Let us take two patches from the bird. What can we say about these two patches? These two patches are both relevant in giving meaning to the concept of bird. Both patches give visual clues to the presence of a bird in the image. In other words, we say there is high attention between these two patches. What if we take a patch from the background? This patch has no relevance to any of the first two patches. This patch does not give meaning to the concept of bird. We can say that this patch has low attention with any of the first two patches. Roughly speaking, attention plays the same role in human visual system. Given the same image, we can quickly process which areas belong to the bird and which ones belong to the background. Attention! between any two pixels in the bird area is high while attention between any background pixel and any bird pixel is low. Due to resource limitations, vision transformer computes attention between patches instead of between pixels. How do we express attention mathematically? We can imagine that each patch can be converted into a high-dimensional feature vector. The attention can be computed using the dot product operator. For example, the dot product between the big vector and the i vector is much higher compared to the dot product between the wall vector and the i vector. Note that the dot product, or in this case, attention, is also a measure of the length of projection between two vectors. Let us put attention in the NLP perspective. In this sentence, there is high attention between the word brown and the word black. fox because brown is the adjective of fox. Meanwhile, there is low attention between brown and dog because they have nothing to do with each other. Let us build the vision transformer model. step by step. Again, this is the vision transformer. Let us focus on the input image pre-processing stage. The process is called image to patches since a transformer can only process a sequence of tensors given an image. we split it into patches. For simplicity, we use 9 patches in this example. In vid, we use 196 patches for ImageNet data. After the input image has been converted into patches, we convert the patches into feature vectors. using a process called linear projection. Linear projection happens at the input stage of the vision transformer. All patches are arranged in sequence from top left to bottom right. All patches go through a linear projection layer to produce vectors. z11 to z19. The mathematical operations involved in linear projection is as follows. First, each patch is reshaped into a 1D vector. Then, we multiply our flattened patch with a weight matrix W to produce the z vector. This process can be done for all patches in one step. We can easily spot that the linear projection can be done by a dense or linear layer without the bias term. Alternatively, linear projection can be done more efficiently using strided convolution. For the IET patch, the first kernel produces the first element of the Z vector. The last element is the output of the last filter. Altogether, the patch is linearly projected to a Z vector. As the filters convolve with the image by moving with a stride equal to patch size, The succeeding patches are also linearly projected. We can imagine the convolution and linear projection happening this way. The first patch is process, then the second, then the third, then everything else follows until the last patch. Let us move on to the next step, which is adding position embedding. Why do we need position embedding? Convolution is translation and scale equivariant. With reference to the image on the right, if the bird is translated and scaled, the corresponding output of the convolution is also translated and scaled accordingly. Note that this equivariant property is very important in object detection. As shown in this example, the bounding box must also translate and scale accordingly. Meanwhile, pooling operation is translation and scale invariant, meaning if the bird is translated and scaled, the output of pulling remains the same or it is not affected. This invariant property is important in recognition. When combined together, CNNs achieve both equivariance and invariance. The problem is transformers do not have notion of equivariance. To resolve this issue, we introduce position embedding. Let us take a look at the situation when there is no position embedding. Vision transformer will classify the sequence of patches as bird. There's nothing wrong with that. However, If we shovel the sequence, the transformer will still classify this as bird. From the human point of view, this is hardly a bird. Position embedding is as simple as adding unique position to the linear projection of each patch, so the vision transformer knows the arrangement of the sequence during training. This is the only inductive bias in vision transformer. Everything else is learned. We have different algorithms for position embedding. The original transformer used a sinusoidal function. In most common open source implementations of VIT, a learnable embedding is used. In PyTorch, this is a sample implementation of learnable position embedding. Recently, rotary embedding was introduced in NLP, resulting to significant improvement in performance. There are other position embedding methods that are used in the literature. Lastly, Vision Transformer introduced a learnable class embedding or token. Vith also assigned a position embedding to this class embedding. In PyTorch, This is a sample implementation of learnable class embedding. The final layer feature vector corresponding to this learnable class embedding is used by the MLP head for classification. Later, we will revisit this output. vector. Let us move on. The most important part of Vision Transformer is the encoder. Let us take a look on how the encoder module is built. Assuming we already have the patch feature vectors, this figure shows a simplified encoder module. Step by step, we will introduce additional elements in the module. The two key blocks are the self-attention and MLP. The MLP has two layers with JLU activation. For the meantime, let us focus on building the self-attention block or simply attention. Attention is mathematically defined as a function of three variables, query, key, and value. We can interpret query as features that we are interested in. key as features that may be relevant to query. The dot product of query and key is scaled by a normalizing factor which is simply the dimension of the z vector, that is d. The normalized dot product is then converted into probabilities by the softmax function. Finally, we use these probabilities to scale the original input features. How do we compute query key and value from the input feature? We concatenate all z vectors into a matrix. For query, we multiply it by a weight matrix WQ. The same goes for key and value. Note that all of these operations can be done by a linear or dense layer in MLP. Let us take a look at the improvements done in the encoder module. The two improvements are layer normalization and skip connection. This will be the improved encoder module with layer normalization and skip connection. Before the attention and MLP blocks, the feature vector vectors are first layer normalized. Skip connection is then used after attention and MLP blocks. Why use layer normalization? Like other normalization, it reduces the training time and stabilizes the training. What about skip connection? It improves the performance by as much as 4% by propagating representations across layers. The last improvement in the encoder module is the multi-head self-attention. Instead of just one attention, we use multiple attention blocks as shown by the green boxes in the encoder module. All attention blocks have identical network structure. Why does it matter? Generally, different learned representations from multiple heads can improve the overall network performance. With the enhancement now completed, Note that a transformer encoder is made of multiple encoder modules with identical network structure. This is done by stacking multiple encoder modules together. In this illustration, we have L encoder modules or layers. The last stage is the prediction head. The prediction head is very straightforward. It is just an MLP attached to the last layer class feature vector. Let us now put all modules together. First, we have the input, then we perform linear projection. The code vectors go up the encoder modules. Finally, the feature corresponding to the class embedding goes through an MLP head for classification. This is the overall vision transformer. Let us take a look at the different vision transformer versions. Like other popular models, VIT is available in different model sizes depending on the number of encoder layers, hidden size D, MLP size, and the number of heads. In the original Vision Transformer paper, BASE is the simplest with 12 encoder modules or layers, hidden size of 768. 3,072 MLP size and 12 heads. This base has 86 million parameters. This is substantial. The large and huge versions are even bigger. Succeeding papers propose smaller versions. The small version has 12 layers, hidden size of 384, MLP size of 1536 and 6 heads. The resulting number of parameters is 21 million, which is comparable to ResNet-50. The tiny version is the most compact at just 5 million parameters. For some time, researchers thought that transformers in vision may not work as effectively as in natural language processing tests. It turned out that for the vision transformer to overcome its lack of inductive bias, it should be trained on a very large dataset. Let us see some details in vision transformer training. In a similar spirit, As in NLP, the overall idea is to first pre-train the vision transformer on very large datasets. Then, VIT is fine-tuned on a downstream test. An example of a very large dataset is JFT300M, which is made of 303 million high-resolution images with 18,000 classes. ImageNet 21K is an example of a large dataset which is made of 14 million images with 21,000 classes. It appears that by training on a very large dataset, VIT learns inductive bias that is inherent in CNNs. This graph shows that the performance of VIT on ImageNet 1K recognition after fine-tuning improves with the size of the dataset that is used in pre-training. From the leftmost, if we pre-train VIT on ImageNet 1K alone with just 1.2 million images, the performance is poor compared to big transfer ResNet. However, as we increase The dataset size as shown in the middle and in the right, the performance improves significantly. The performance of Vision Transformer is maximized when it is pre-trained on JFT 300M. This table summarizes the performance of VIT on different pre-training datasets. ImageNet 1K, ImageNet 21K, and JFT 300M. To put in perspective, the performance of bit-based in ImageNet 1K classification is comparable to the performance of EfficientNet B7, a state-of-the-art vision model. The accuracy of bit-large is comparable to the accuracy of EfficientNet V2XL, a more recent state-of-the-art vision model. However, EfficientNet V2XL is smaller by 30% in terms of number of parameters. Parameter efficiency is one aspect of future work for vision transformers. Immediately after DIT was released, people realized its potential and some inherent problems. In particular, training on a not publicly available and very large dataset, JFT300M, is a major issue. To overcome this problem, Data Efficient Image Transformer, or DATE, proposes a distinct knowledge distillation algorithm. DATE uses a regnet Y of comparable size as VIT-based as teacher to train a student VIT using only a relatively small and publicly available image net 1K. It turned out that the performance of DATE is significantly better than VIT. trained on the same dataset. For example, date base is 4.5% higher in accuracy compared to VIT. Note that architecture-wise, both networks are the same. As such, in publicly available pre-trained models of VIT, the weights of date are used. After VIT was introduced, new state-of-the-art performances on different vision problems have been achieved. The new models take advantage of the strong local and global attention of vision transformers. In semantic segmentation, SET-R reformulated the problem from the sequence-to-sequence perspective as an alternative solution to the dominant encoder-decoder FCN model design. A vision transformer is used to turn an image into a sequence. Then, a CNN decoder network is used to predict the segmentation masks. It achieved state-of-the-art performance on ADE20K MIOU set R scored 50.3. In comparison, the baseline PSP-Net only achieved 45. After SetR, SegFormer was introduced. Using a lightweight all-MLP decoder, the intermediate features of VIT are fused to predict the segmentation masks. SegFormer was able to outperform set R. For example, on Cityscape's MIOU, Segformer scored 84.4, while set R is only 82.2. By extending the prediction layer of Vision Transformer, we created VidSTR. VIT-STR is a fast and efficient syntax recognition model. In VIT-STR, we use a pre-trained VIT and replace the head with a sequence of character predictions. The performance of VIT-STR is optimal on accuracy versus number of parameters, accuracy versus plots, and accuracy versus inference speed. TransUnet employs a hybrid CNN transformer architecture to take advantage of both detailed high-resolution spatial information from CNN features and the global context encoded by a vision transformer. It achieves state-of-the-art performance on CT scan dataset versus another state-of-the-art model which is also based on VIT. On DSC metric, transunit achieved 77.5 while R50 VIT achieved 71.3. The Dye Similarity Coefficient or DSC is a validation metric to evaluate the performance of segmentation models on magnetic resonance images. VIT is not perfect and has certain limitations and points of improvement. Vision transformers still have certain limitations to be addressed in the future research. First, the quadratic cost of computing attention. Second, if you recall, the number of parameters required by VIT is still high. The base version, for example, has 86 million parameters, making it unsuitable for resource-constrained computing environments. Promising ideas that are being pursued in this space are combining MobileNet v3 and VIT. hybrid architecture like Levitt, and use of depthwise convolution in multi-head self-attention like InREST. The broader implications of VIT is that transformers or self-attention networks appear to be a more general-purpose network that can process text, audio, image, and video. However, there is still some heavy engineering involved in the processing of input and output as demonstrated in the classification, detection, and segmentation networks. Perceiver.io builds on the concept of self-attention and was able to demonstrate that it can be trained regardless of the structure of input or output. In other words, Perceiver IO only assumes that input and output data are represented as 10-source. We do not have to build a VIT from scratch. There are many good open-source implementations of VIT. For example, raw steam modules. In PyTorch, creating a VIT base is just one line of code. Usually, we use VIT using a pre-trained model and fine-tune it on our target dataset. There is also Phil's VIT PyTorch project. Thank you for listening. I have a GitHub profile with open source implementations of my papers and lecture notes.