Transformer-Based Visual Object Tracking

This entry is adapted from the peer-reviewed paper 10.3390/app132312712

With the rise of general models, transformers have been adopted in visual object tracking algorithms as feature fusion networks. In these trackers, self-attention is used for global feature enhancement. Cross-attention is applied to fuse the features of the template and the search regions to capture the global information of the object. However, studies have found that the feature information fused by cross-attention does not pay enough attention to the object region. In order to enhance cross-attention for the object region, an enhanced cross-attention (ECA) module is proposed for global feature enhancement.

visual object tracking transformer global feature enhancement fast self-attention enhanced cross-attention

1. Introduction

Single-object tracking is pivotal in computer vision ^[1]^[2]^[3] and is tasked with estimating an object’s position and motion trajectory across frames. Overcoming challenges like size variations, occlusions, and deformations are persistent issues ^[4]^[5].

In the field of computer vision, advancements have been made in CNN-based tracking algorithms, such as the Siamese networks ^[6]^[7]^[8], particularly in addressing challenges like size variations, occlusions, and deformations. SiamFC ^[6] is a classic example that employs cross-correlation operations to compute a score map, determining the target’s bounding box by convolving the feature maps of the template and search regions. However, this approach, due to significant differences between the score map and the feature map, overlooks the semantic information of the target, hindering the subsequent classification and regression operations. To solve this problem, many transformer-based trackers ^[9]^[10]^[11]^[12] are used for feature fusion. Mixformer ^[10] proposes a mixed attention module (MAM), which applies the attention mechanism to simultaneously extract features and exchange information. OSTrack ^[11] integrates the target information more widely into the search area to better capture their correlation. In ^[12], the learning and memory capabilities of the transformer are utilized by encoding the information required for tracking into multiple tokens. Through cross-attention operations in the cross-feature augment (CFA) module, TransT ^[9] obtains fused features containing rich semantic information from templates and search regions. This fusion feature can be directly used for the subsequent classification and regression operations to estimate the state of the target.

With the rise of Vision Transformer ^[13]^[14], self-attention techniques show promise in visual tasks. However, their quadratic computational complexity has prompted research into mitigating this challenge. Sparse global attention patterns ^[15]^[16] and smaller attention windows ^[17]^[18] can reduce costs, but at the risk of overlooking information or sacrificing long-term dependency modeling. In Fast re-OBJ ^[19], the authors efficiently leverage the intermediate outputs of the instance segmentation backbone (ISB) for triplet-based training, avoiding redundant feature extraction during both ISB and the embedding generation module (EGM).

2. Transformer-Based Visual Object Tracking

In recent years, convolutional networks have been widely used in computer vision tasks ^[20], including object detection, segmentation, and tracking. SiamFC ^[6] was the first to apply Siamese networks for visual object tracking, which predicts the object states by calculating the distance between the current frame and the template. SiamFC++ ^[8] is an improved version of SiamFC that introduces a new network architecture and training method, enabling object tracking in different scenarios. SiamMask ^[21] uses Siamese networks for both object tracking and segmentation simultaneously. SA-Sia ^[22] utilizes feature mapping to extract spatial information leading to more accurate and robust object tracking. SAT ^[7] employs a deep Siamese network to assess the similarity between objects in each frame, offering a novel solution for evaluating and addressing the association challenges in consecutive frames. However, the local matching strategy, which is based on cross-correlation, could lead to sub-optimal results, especially when the object is occluded or partially visible. Moreover, semantic information about the object may be lost during the correlation operation, resulting in imprecise object boundaries. Thus, an improved transformer and attention mechanism are proposed to replace traditional correlation operations in object tracking. It can effectively extract global contextual information while preserving an object’s semantic information, resulting in more robust and accurate tracking results.

The transformer is a popular neural network architecture that is widely used in natural language processing (NLP) tasks ^[23]. It consists of attention-based encoders and decoders. Self-attention, which is the main module of a transformer, can compute representations of input sequences. It allows each position in the input sequence to focus on the other positions and calculate the weighted averages of their values. DERT ^[24] and ViT ^[13] are early methods for introducing transformer models into the field of computer vision. In visual object tracking fields, transformer-based tracking methods achieve significant improvements compared to Siamese network-based trackers. The correlation operation in the Siamese network is replaced by a self-attention module from the transformer in TransT ^[9] to fuse the information between the template and search region. Stark ^[25] applies an encoder–decoder structure in tracking, where the encoder models the global spatiotemporal feature dependencies between the object and search region, while the decoder learns embedded queries to predict the spatial location of the object. SwinTrack ^[26] introduced a fully attention-based transformer algorithm for feature extraction and feature fusion, enabling the complete interaction between an object and a search region in the tracking process. The encoder–decoder framework based on transformers is widely applied to sequence prediction tasks, as self-attention models the interaction between elements in the sequence. Based on these works, researchers find that traditional transformer methods struggle to distinguish between the object to be tracked and similar interfering objects. To enhance the object information within the fused features, the ECA module is proposed to boost the attention on tracking the object. The ECA module can calculate the average attention level for each position in the fused feature sequence and weigh the positions with higher attention levels. This enhances the feature information in the object region and improves matching accuracy.

In recent years, a series of new attention mechanisms, such as fast attention ^[27], have garnered research interest. Sparse attention ^[28] reduces computational costs by constraining attention weights to consider only relationships within a neighborhood. Shaw et al. ^[29] proposed an attention mechanism to embed relative positional information into attention calculations, allowing the model to better handle dependencies in long sequences. This mechanism has shown excellent performance in tasks involving long sequences, such as text generation. Orthogonal random features are proposed to achieve fast attention operations. It can decompose the attention matrix into the product of random nonlinear functions between the original query and key, avoiding the explicit construction of a quadratic-sized attention matrix.

References

Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302.
Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409.
Fan, H.; Ling, H. Siamese cascaded region proposal networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7952–7961.
Xing, J.; Ai, H.; Lao, S. Multiple human tracking based on multi-view upper-body detection and discriminative learning. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 1698–1701.
Liu, L.; Xing, J.; Ai, H.; Ruan, X. Hand posture recognition using finger geometric feature. In Proceedings of the 21st International Conference on Pattern Recognition, Tsukuba, Japan, 1–15 November 2012; pp. 565–568.
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision Workshops, Amsterdam, The Netherlands, 7–10 October 2018; pp. 850–865.
Suljagic, H.; Bayraktar, E.; Celebi, N. Similarity based person re-identification for multi-object tracking using deep Siamese network. Neural Comput. Appl. 2022, 34, 18171–18182.
Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12549–12556.
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135.
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618.
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel-Aviv, Israel, 23–24 October 2022; pp. 341–357.
Di Nardo, E.; Ciaramella, A. Tracking vision transformer with class and regression tokens. Inf. Sci. 2023, 619, 276–287.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Vancouver, Canada, 18–24 July 2021; pp. 10347–10357.
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520.
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive Rotated Convolution for Rotated Object Detection. arXiv 2023, arXiv:2303.07820.
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578.
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803.
Bayraktar, E.; Wang, Y.; DelBue, A. Fast re-OBJ: Real-time object re-identification in rigid scenes. Mach. Vis. Appl. 2022, 33, 97.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1328–1338.
He, A.; Luo, C.; Tian, X.; Zeng, W. A twofold siamese network for real-time object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4834–4843.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229.
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457.
Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 16743–16754.
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Weller, A. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794.
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509.
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Kongfen Zhu

View Times: 563

Update Date: 08 Dec 2023

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Lei Liu	--	1030	2023-12-07 10:54:21	\|
2	layout & references	Sirius Huang	Meta information modification	1030	2023-12-08 07:38:19	\|

1. Introduction

2. Transformer-Based Visual Object Tracking

References

Video Upload Options

Confirm