Transformers for Computer Vision: Comparison
Please note this is a comparison between Version 1 by SONAIN JAMIL and Version 2 by Camila Xu.

Vision transformers (ViTs) are designed for tasks related to vision, including image recognition. Originally, transformers were used to process natural language (NLP). As a special type of transformer, vision transformers (ViTs) can be used for various computer vision (CV) applications.

  • vision transformers
  • computer vision
  • deep learning
  • image coding

1. Introduction

Vision transformers (ViTs) are designed for tasks related to vision, including image recognition [1]. Originally, transformers were used to process natural language (NLP). Bidirectional encoder representations from transformers (BERT) [2] and generative pretrained transformer 3 (GPT-3) [3] were the pioneers of transformer models for natural language processing. In contrast, classical image processing systems use convolutional neural networks (CNNs) for different computer vision (CV) tasks. The most common CNN models are AlexNet [4][5][4,5], ResNet [6], VGG [7], GoogleNet [8], Xception [9], Inception [10][11][10,11], DenseNet [12], and EfficientNet [13].
To track attention links between two input tokens, transformers are used. With an increasing number of tokens, the cost rises inexorably. The pixel is the most basic unit of measurement in photography, but calculating every pixel relationship in a normal image would be time-consuming; memory-intensive [14]. ViTs, however, take several steps to do this, as described below:
  • ViTs divide the full image into a grid of small image patches.
  • ViTs apply linear projection to embed each patch.
  • Then, each embedded patch becomes a token, and the resulting sequence of embedded patches is passed to the transformer encoder (TE).
  • Then, TE encodes the input patches, and the output is given to the multilayer perceptron (MLP) head, with the output of the MLP head being the input class.
Figure 1 shows the primary illustration of ViTs. In the beginning, the input image is divided into smaller patches. Each patch is then embedded using linear projection. Tokens are created from embedded patches that are given to the TE as inputs. Multihead attention and normalization are used by the TE to encode the information embedded in patches. The TE output is given to the MLP head, and the MLP head output is the input image class.
Figure 1.
ViT for Image Classification.
For image classification, the most popular architecture uses the TE to convert multiple input tokens. However, the transformer’s decoder can also be used for other purposes. As described in 2017, transformers have rapidly spread across NLP, becoming one of the most widely used and promising designs [15].
For CV tasks, ViTs were applied in 2020 [16]. The aim was to construct a sequence of patches that, once reconstructed into vectors, are interpreted as words by a standard transformer. Imagine that the attention mechanism of NLP transformers was designed to capture the relationships between different words within the text. In this case, the CV takes into account how the different patches of the image relate to one another.
In 2020, a pure transformer outperformed CNNs in image classification [16]. Later, a transformer backend was added to the conventional ResNet, drastically lowering costs while enhancing accuracy [17][18][17,18].
In the same year, several key ViT versions were released. These variants were more efficient, accurate, or applicable to specific regions. Swin transformers are the most prominent variants [19]. Using a multistage approach and altering the attention mechanism, the Swin transformer achieved cutting-edge performance on object detection datasets. There is also the TimeSformer, which was proposed for video comprehension issues and may capture spatial and temporal information through divided space–time attention [20].
ViT performance is influenced by decisions such as optimizers, dataset-specific hyperparameters, and network depth. Optimizing a CNN is significantly easier. Even when trained on data quantities that are not as large as those required by ViTs, CNNs perform admirably. Apparently, CNNs exhibit this distinct behavior because of some inductive biases that they can use to comprehend the particularities of images more rapidly, even if they end up restricting them, making it more difficult for them to recognize global connections. ViTs, on the other hand, are devoid of these biases, allowing them to capture a broader and more global set of relationships at the expense of more difficult data training [21].
ViTs are also more resistant to input visual distortions such as hostile patches and permutations [22]. Conversely, preferring one architecture over another may not be the best choice. The combination of convolutional layers with ViTs has been shown to yield excellent results in numerous CV tasks [23][24][25][23,24,25].
To train these models, alternate approaches were developed due to the massive amount of data required. It is feasible to train a neural network virtually autonomously, allowing it to infer the characteristics of a given issue without requiring a large dataset or precise labeling. It might be the ability to train ViTs without a massive vision dataset that makes this novel architecture so appealing.
ViTs have been employed in numerous CV jobs with outstanding and, in some cases, cutting-edge outcomes. The following are some of the important application areas:
  • Image classification;
  • Anomaly detection;
  • Object detection;
  • Image compression;
  • Image segmentation;
  • Video deepfake detection;
  • Cluster analysis.
Figure 2 shows that the percentage of the application of ViTs for image classification, object detection, image segmentation, image compression, image super-resolution, image denoising, and anomaly detection is 50%, 40%, 3%, less than 1%, less than 1%, 2%, and 3% respectively.
Figure 2.
Use of ViTs for CV applications.
ViTs have been widely utilized in CV tasks. ViTs can solve the problems faced by CNNs. Different variants of ViTs are used for image compression, super-resolution, denoising, and segmentation. With the advancement in the ViTs for CV applications, a state-of-the-art survey is needed to demonstrate the performance advantage of ViTs over current CV application approaches.

2. Advanced ViTs

In addition to their promising use in vision, some transformers have been particularly designed to perform a specific task or to solve a particular problem.

2.1. Task-Based ViTs

Task-based ViTs are those ViTs that are designed for a specific task and perform exceptionally well for that task. Lee et al. in [26][120] proposed the multipath ViT (MPViT) for dense prediction by embedding features of the same sequence length with the patches of the different scales. The model achieved superior performance for classification, object detection, and segmentation. However, the model is specific to dense prediction. In [27][121], the authors proposed the coarse-to-fine ViT (C2FViT) for medical image registration. C2FViT uses convolutional ViT [24][28][24,122] an ad multiresolution strategy [29][123] to learn global affine for image registration. The model was specifically designed for affine medical image registration. Similarly, in [30][124], the authors proposed TransMorph for medical image registration and achieved state-of-the-art results. However, these models are task-specific, which is why they are categorized as task-based ViTs here.

2.2. Problem-Based ViTs

Problem-based ViTs are those ViTs which are proposed to solve a particular problem that cannot be solved by pure ViTs. These types of ViTs are not dependent on tasks but rather on problems. For example, ViTs are not flexible. To make a ViT more flexible and to reduce its complexity, the authors in [31][125] proposed a messenger (MSG) transformer. They used specialized MSG tokens for each region. By manipulating these tokens, one can flexibly exchange visual information across the regions. This reduces the computational complexity of ViTs. Similarly, it has been discovered that mixup-based augmentation works well for generalizing models during training, especially for ViTs because they are prone to overfitting. However, the basic presumption of earlier mixup-based approaches is that the linearly interpolated ratio of targets should be maintained constantly with the percentage suggested by input interpolation. As a result, there may occasionally be no valid object in the mixed image due to the random augmentation procedure, but there is still a response in the label space. Chen et al. in [32][126] proposed TransMix for bridging this gap between the input and label spaces. TransMix blends labels based on the attention maps of ViTs. In ViTs, global attention is computationally expensive, whereas local attention provides limited interactions between tokens. To solve this problem, the authors in [33][127] proposed the CSWin transformer based on the cross-shaped window self-attention. This provided efficient computation of self-attention and achieved better results than did the pure ViTs.

3. Open Source ViTs

This section summarizes the available open-source ViTs with potential CV applications. Table 1 presents the comprehensive summary of the open-source ViTs for the different applications of CV such as image classification, object detection, instance segmentation, semantic segmentation, video action classification, and robustness evaluation.
Table 1.
Summary of the open-source ViTs present in the literature for different applications of CV.
[142], the authors proposed TransVisDrone, which is a spatio-temporal transformer for the detection of drones in aerial videos. The model obtained state-of-the-art performance on the NPS [54][143], FLDrones [55][144], and Airborne Object Tracking (AOT) datasets. Liu et al. [56][145] reported the use of ViT for drone crowd counting. The dataset used in the challenge was collected by drones. In [57][146], the authors used unmanned aerial vehicle (UAV) images of date palm trees to investigate the reliability and efficiency of various deep-ViTs. They used different ViTs such as Segformer [58][147], the Segmeter [59][148], the UperNet-Swin transformer, and dense prediction transformers (DPT) [60][149]. Based on the comprehensive experimental analysis, Segformer achieved the highest performance. Zhu et al. [61][150], proposed TPH-YOLOv5 for which they replaced the original prediction head of YOLOv5 with the transformer prediction head (TPH) to overcome the challenges of objection in the drone-captured images. In [62][151], the authors summarized the results of the challenger VisDrone-DET2021 in which the proponents used different transformers, such as Scaled-YOLOv4 with transformer and BiFPN (SOLOER), Swin-transformer (Swin-T), stronger visual information for tiny object detection (VistrongerDet), and EfficientDet for object detection in the drone imagery. Thai et al. [63][152] demonstrated the use of ViT for cassava leaf disease classification and achieved better performance than did the CNNs. A detailed summary of the existing ViTs for drone imagery data is presented in Table 2.
Table 2.
ViTs for drone imagery.
Ref. Model Dataset Objective Perf. Metric Value
[53][142] TransVisDrone
  • NPS
  • FLDrones
  • AOT
  • Img. class. a
  • Object det. b
  • Rob. eval. e
Drone detectionhttps://github.com/naver-ai/pit (accessed on 19 April 2023)
AP@0.5IoU
  • 0.95
  • 0.75
  • 0.80
[16]
[57][1462020 ]
  • Segformer
  • Segmenter
  • UperNet-Swin
  • DPT
ViT *
  • Img. class. a
https://github.com/google-research/vision_transformer (accessed on 19 April 2023)
Date palm trees Segmentation mIoU α
  • ≈86.2%
  • ≈85.3%
  • ≈85.8%
  • ≈85.4%
[19] 2021 Swin Transformer
  • Img. class. a
  • Object det. b
  • Semantic seg.
[57][146]
  • Segformer
  • Segmenter
  • UperNet-Swin
  • DPT
  • c
https://github.com/microsoft/Swin-Transformer (accessed on 19 April 2023)
Date palm trees Segmentation mF-Score β
  • ≈92.5%
  • ≈91.9%
  • ≈92.3%
  • ≈92.0%
[34][32]
[2021 57][146]
  • Segformer
  • Segmenter
  • UperNet-Swin
Cross-ViT
  • DPT
  • Img. class. a
  • Object det. b
https://github.com/IBM/CrossViT (accessed on 19 April 2023)
  • ≈92.4%
  • ≈91.7%
[28][122]
[612021 ][150] TPH-YOLOv5 ††CeiT γ
  • Img. class. a
https://github.com/rishikksh20/CeiT-pytorch (accessed on 19 April 2023)
VisDrone2021 Object detection mAP δ
  • 39.2%
[35][128]
[62][2022 151]
  • DBNet
  • SOLOer
  • Swin-T
  • TPH-YOLOv5
Swin Transformer V2
  • VistrongerDet
  • cascade
  • ††
  • DNEFS
  • EfficientDet
  • DPNet-ensemble
  • DroneEye2020
  • Cascade R-CNN
  • Img. class. a
  • Object det. b
  • Semantic seg. c
  • Vid. act. class. d
https://github.com/microsoft/Swin-Transformer (accessed on 19 April 2023)
VisDrone-DET2021 Object detection AP
  • 39.43%
  • 39.42%
  • 39.40%
  • 39.18%
  • 38.77%
  • 38.72%
  • 38.53%
  • 38.51%
  • 37.37%
  • 34.57%
  • 16.09%
[36][129] 2021 DVT
  • Img. class. a
https://github.com/blackfeather-wang/Dynamic-Vision-Transformer (accessed on 19 April 2023)
[37][130] 2021 PVT ††
  • Object det. b
  • Instance seg. c
  • Semantic seg. c
https://github.com/whai362/PVT (accessed on 19 April 2023)
[38][131] 2021 Twins
  • Img. class. a
  • b
  • Seg. c
https://github.com/Meituan-AutoML/Twins (accessed on 19 April 2023)
[39][132] 2021 Mobile-ViT
  • Object det. b
https://github.com/apple/ml-cvnets (accessed on 19 April 2023)
[40][133] 2021 Refiner
  • Img. class. a
https://github.com/zhoudaquan/Refiner_ViT (accessed on 19 April 2023)
[41][134] 2021 DeepViT †††
  • Img. class. b
https://github.com/zhoudaquan/dvit_repo (accessed on 19 April 2023)
[42][135] 2021 DeiT ††††
  • Img. class. a
https://github.com/facebookresearch/deit (accessed on 19 April 2023)
[43][136] 2021 Visformer
  • Img. class. a
https://github.com/danczs/Visformer (accessed on 19 April 2023)
Video Production Service