Overview of Deep Learning-Based Visual Multi-Object Tracking

Overview of Deep Learning-Based Visual Multi-Object Tracking: Comparison

Please note this is a comparison between Version 3 by Sirius Huang and Version 2 by Shichang Wang.

Multi-target tracking is an advanced visual work in computer vision, which is essential for understanding the autonomous driving environment. Due to the excellent performance of deep learning in visual object tracking, many state-of-the-art multi-target tracking algorithms have been developed.

autonomous driving
deep learning
visual multi-object tracking
transformer

1. Introduction

The primary area of intelligent and networked development in the vehicle and transportation industries is autonomous driving. AVs have the potential to fundamentally alter transportation systems by averting deadly crashes, providing critical mobility to the elderly and disabled, increasing road capacity, saving fuel, and lowering emissions [1,2]^[1][2]. The vehicle perception system’s accurate perception of the environment is essential for safe autonomous driving. The perception of autonomous driving settings depends heavily on object tracking, a high-level vision job in the discipline of computer vision. As a result, the development of an object tracking algorithm ensures the development of an automatic driving system that is both safer and more effective.

This text focuses on multi-object tracking in autonomous driving systems. Multi-object tracking is crucial to ensuring the effectiveness and safety of autonomous driving because it is the fundamental component of the technology. Rarely do objects in traffic situations appear alone. Autonomous driving frequently involves recognizing and tracking many things at once, some of which may be moving in relation to the vehicle or to one another. The majority of techniques in the related literature therefore deal with many objects and attempt to address the multi-object tracking issue. In essence, the MOT algorithm can be summarized as: given the data collected by one or more sensors, how to identify multiple objects in each frame of data and assign an identity to each object, and match those IDs in subsequent data frames ^[3]. An example of the output of the MOT algorithm is shown in Figure 1 below.

Figure 1. An illustration of the output of an MOT algorithm. Each output bounding box has a number that identifies a specific object in the video.

The research on multi-object tracking for autonomous driving has advanced significantly in recent years, but it is still challenging to use the current multi-object tracking techniques for autonomous driving to their full potential because of issues including the varied shapes of cars and pedestrians in traffic scenes, motion blur, and background interference. There are still several difficulties with the existing visual multi-object tracking technology. Visual multi-object tracking must first address more challenging problems such as: an unpredictable number of objects, frequent object occlusion, challenging object differentiation, etc. In particular, the frequent entry and exit of objects from the field of view is a typical and expected behavior in autonomous driving applications, which results in the uncertainty of the number of objects faced by multi-object tracking and necessitates real-time detection of multi-object tracking algorithms. The method must extract robust object features and keep the object-specific ID after occlusion in complicated situations, since the occlusion of an object by other objects or backgrounds will lead to object ID switches (IDs). The high degree of similarity in object appearance also adds to the difficulty of maintaining the right object ID over the long term. The algorithm must be able to extract the characteristics of comparable items that make them separable. Finally, the challenges that multi-object tracking in autonomous vehicles face can be broken down into two categories: the tracking object factor and the backdrop factor. Shape change, scale change, motion blur, etc. are some of the issues brought on by the object’s factors. The impact of backdrop elements is also substantial, particularly the blurring of background interference, occlusion and disappearance of objects, changes in weather, comparable background interference, etc. ^[4].

2. Overview of Deep Learning-Based Visual Multi-Object Tracking

Deep learning-based visual multi-object tracking systems have several overview techniques from various angles. The methods for visual multi-object tracking based on deep learning are outlined here in terms of algorithm classification, related data sets, and algorithm assessment.

2.1. Visual Multi-Object Tracking Algorithm Based on Deep Learning

The tracking algorithm based on detection results has evolved and has quickly taken over as the standard framework for multi-object tracking due to the rapid advancement of object detection algorithm performance [7]^[5]. The TBD sub-modules, such as feature extraction, can be included in the object detection network, though, from the standpoint of the deep neural network’s structure. Joint detection and tracking, or JDT, using a deep network framework to perform visual multi-object tracking, has emerged as a new development trend based on TBD neutron module fusion [11,12]^[6][7]. The attention mechanism has been incorporated into computer vision systems recently because it has the benefit of efficiently capturing the region of interest in the image, enhancing the performance of the entire network [13,14,15,16]^{[8][9][10][11]}. It is used to solve various vision problems, including multi-object tracking. The specific classification structure for the three types of tracking frameworks is shown in Figure 32.

Figure 32. Classification and algorithm of visual multi-object tracking based on deep learning. Overall, the development trend of visual multi-object tracking algorithm is from TBD, to JDT, to Transformer-based tracking algorithm.

At the same time, the characteristics, advantages, and disadvantages of the tracking algorithms of the three types of frameworks are organized as shown in Table 2 [17,18,19,20,21,22,23,24]^{[12][13][14][15][16][17][18][19]}.

Table 21.

Comparison of characteristics of three types of visual multi-object tracking algorithms.

Tracking Algorithm Framework	Principle	Advantage	Disadvantage
TBD	All objects of interest are detected in each frame of the video, and then they are associated with the detected objects in the previous frame to achieve the effect of tracking	Simple structure and strong interpretability	Over-reliance on object detector performance; bloated algorithm design
JDT	End-to-end trainable detection box paradigm to jointly learn detection and appearance features	Multi-module joint learning, weight sharing	Local receptive field, when the object is occluded, the tracking effect is not good
Tranformer-based	Transformer encode-decoder architecture to obtain global and rich contextual interdependcies for tracking	Paraller coputing; rich global and contextual information; the tracking accuracy and accuracy have been greatly improved，with great potential in the filed of computer vision	The parameters are too large and the computational overhead is high； the Transformer-based network has not been fully adapted to the filed of computer vision

2.2. MOT Datasets

Deep learning has the benefit over more conventional machine learning techniques in that it can automatically identify data attributes that are pertinent to a specific task. For deep learning‐based computer vision algorithms, data sets are crucial. The datasets and traits that are frequently utilized in the field of automatic driving tracking are outlined in the following. Due to their frequent updates and closer resemblance to the actual scene, the MOT datasets [25，26]raise the most concerns in the field of visual multi‐object track‐ing. On the MOT dataset, other cutting‐edge tracking methods are also tested.

Deep learning has the benefit over more conventional machine learning techniques in that it can automatically identify data attributes that are pertinent to a specific task. For deep learning-based computer vision algorithms, data sets are crucial. The datasets and traits that are frequently utilized in the field of automatic driving tracking are outlined in the following. Due to their frequent updates and closer resemblance to the actual scene, the MOT datasets ^[20][21] raise the most concerns in the field of visual multi-object tracking. On the MOT dataset, other cutting-edge tracking methods are also tested.

The MOT16 Dataset[25] is used exclusively for tracking pedestrians. There are a total of 14 videos, 7 practice sets, and 7 test sets. These videos were created using a variety of techniques, including fixed and moving cameras, as well as various shooting perspectives. Additionally, the shooting circumstances vary, depending on whether it is day or night and the weather. The MOT16 detector, called DPM, performs better in the area of detecting pedestrians.

The MOT16 dataset ^[20] is used exclusively for tracking pedestrians. There are a total of 14 videos, 7 practice sets, and 7 test sets. These videos were created using a variety of techniques, including fixed and moving cameras, as well as various shooting perspectives. Additionally, the shooting circumstances vary, depending on whether it is day or night and the weather. The MOT16 detector, called DPM, performs better in the area of detecting pedestrians. The video content of the MOT17 Ddataset[25] ^[20] is the same as that of MOT16, but it also provides two detector detection results, namely SDP and Faster R‐-CNN, which have more accurate ground‐-truth annotations. The MOT20 Ddataset[26] ^[21] has 8 video sequences, 4 training sets and 4 testing sets, and the pedestrian density is further increased, with an average of 246 pedestrians per frame. The KITTI dataset ^[22][23] is currently the largest dataset for evaluating computer vision algorithms in autonomous driving scenarios. These data are used to evaluate 3D object detection and tracking, visual odometry, evaluation of stereo images, and optical flow images.

The KITTI dataset[27，28]is currently the largest dataset for evaluating computer vision algorithms in autonomous driving scenarios. These data are used to evaluate 3D object detection and tracking, visual odometry, evaluation of stereo images, and optical flow images.

The NuScenes dataset ^[24] provides a large dataset of full sensor data for autonomous vehicles, including six cameras, one lidar, five radars, as well as GPS and IMU. Compared with the KITTI dataset, it includes more than seven times more object annotations. For each scene, its key frames are selected for annotation, and the annotation rate is 2 Hz. However, it is worth noting that since 23 types of objects are marked, the class- The imbalance problem will be more serious. The Waymo dataset ^[25] is collected with five LiDAR and five high-resolution pinhole cameras. The entire dataset contains 1150 scenes, which are divided into 1000 training sets and 150 test sets, with a total of about 12 million LiDAR annotation boxes and approx. 12 million image annotation boxes. The Mapillary Traffic sign dataset ^[26] is the largest and most diverse traffic sign dataset in the world, which can be used for research on the automatic detection and classification of traffic signs in autonomous driving.

The NuScenes dataset[29] provides a large dataset of full sensor data for autonomous vehicles, including six cameras, one lidar, five radars, as well as GPS and IMU. Compared with the KITTI dataset, it includes more than seven times more object annotations. For each scene, its key frames are selected for annotation, and the annotation rate is 2 Hz. However, it is worth noting that since 23 types of objects are marked, the class‐ The imbalance problem will be more serious

The Waymo dataset[30] is collected with five LiDAR and five high‐resolution pinhole cameras. The entire dataset contains 1150 scenes, which are divided into 1000 training sets and 150 test sets, with a total of about 12 million LiDAR annotation boxes and approx. 12 million image annotation boxes.

The Mapillary Traffic sign dataset[31] is the largest and most dierform verse traffic sign dataset in the world, which can be used for research on the automatic detection and classification of traffic signs in autonomous driving.

To perform visual musual multi‐object tracking tasks, we gather and introduce the datasets are listed in Table 32. Most detection and tracking elements in data collection are related to autos and pedestrians, which helps enhance autonomous driving.

Table 32.Summary of Visual Multi‐object Tracking Datasets.

Summary of Visual Multi‐object Tracking Datasets.

Ref.	Datasets	Year	Feature	DOI/URL
[25，26，32 ]^[20][21][27]	MOT15，, 16，17，,17,20	2016–2020	Sub datasets containing multiple different camera angles and scenes	https://doi.org/10.48550/arXiv.1504.01942 https://doi.org/10.48550/arXiv.1504.01942 https://motchallenge.net/ https://motchallenge.net/
[27，28 ]^[22][23]	KITTI-Tracking	2012	Provides annotations for cars and pedestrians, scene objects are sparse	https://doi.org/10.1177/0278364913491297 https://doi.org/10.1177/0278364913491297 https://www.cvlibs.net/datasets/kitti/eval_tracking.php https://www.cvlibs.net/datasets/kitti/eval_tracking.php
[29]^[24]	NuScenes	2019	Dense traffic and challenging driving conditions	https://doi.org/10.48550/arXiv.1903.11027 https://doi.org/10.48550/arXiv.1903.11027 https://www.nuscenes.org/ https://www.nuscenes.org/
[30]^[25]	Waymo	2020	Diversified driving environment, dense label information	https://doi.org/10.48550/arXiv.1912.04838 https://doi.org/10.48550/arXiv.1912.04838 https://waymo.com/open/data/motion/tfexample https://waymo.com/open/data/motion/tfexample

2.3. MOT Evaluating Indicator

Setting realistic and accurate evaluation metrics is essential for comparing the effectiveness of visual multi‐-object tracking algorithms in an unbiased and fair manner. The three criteria that make up the multi‐-object tracking assessment indicators are if the object detection is real‐-time, whether the predicted position matches the actual position, and whether each object maintains a distinct ID[33] ^[28]. MOT Challenge offers recognized MOT evaluation metrics.

MOTA (Multi‐Object‐Tracking Accuracy): the accuracy of multi‐object tracking is used to count the accumulation of errors in tracking, including the number of tracking objects and whether they match:

MOTP = \sum (FN + FP + IDSW) \sum G T

where FN (False Negative) is the number of detection frames that do not match the prediction frame; FP (False positive) is the number of prediction frames that do not match the detection frame; IDSW (ID Switch) is the object ID change the number of times; GT(Ground Truth) is the number of tracking objects.

MOTP (Multi‐Object‐Tracking Precision): the accuracy of multi‐object tracking, which is used to evaluate whether the object position is accurately positioned.

MOTP = \sum Bt (i) \sum C t

where Ct is the number of matches between the object and the predicted object in the t‐th frame; Bt(i) is the distance between the corresponding position of the object in the t‐th frame and the predicted position, also known as the matching error.

AMOTA (Average Multiple Object Tracking Accuracy): summarize MOTA over all object confidence thresholds instead of using a single threshold. Similar to mAP for object detection, it is used to evaluate the overall accurate performance of the tracking algorithm under all thresholds to improve algorithm robustness. AMOTA can be calculated by integrating MOTA under the recall curve, using interpolation to approximate the integral in order to simplify the calculation.

AMOTA = \frac{1}{L} \sum_{r \in {\frac{1}{L}, \frac{2}{L}, ..., 1} (1 - M O T A_{r})}

where L represents the number of recall values (integration confidence threshold), the higher the L, the more accurate the approximate integral. AMOTA represents the multi‐object tracking accuracy at a specific recall value r.

AMOTP (Average Multi‐object Tracking Precision): The same calculation method as AMOTA, with recall as the abscissa and MOTP as the ordinate, use the interpolation method to obtain AMOTP.

AMOTP = \frac{1}{L} \sum_{r \in {\frac{1}{L}, \frac{2}{L}, ..., 1} (1 - M O T P_{r})}

IDF1 (ID F1 score): measures the difference between the predicted ID and the correct ID.

MT (Mostly Tracked): the number of objects that are successfully tracked 80% of the time as a percentage of all tracked objects.

ML (Mostly Lost): the percentage of the number of objects that satisfy the tracking success 20% of the time out of all the objects tracked.

FM (Fragmentation): evaluate tracking integrity, defined as FM, counted whenever a trajectory changes its state from tracked to untracked, and the same trajectory is tracked at a later point in time.

HOTA (Higher Order Metric): A higher order metric for evaluating MOT proposed by[34] ^[29]. Previous metrics overemphasized the importance of detection or association. This evaluation metric explicitly balances the effects of performing accurate detection, association, and localization into a unified metric for comparing trackers. HOTA scores are more consistent with human visual evaluations.

H O T A = \int_{0}^{1} H O T A_{α} d α

H O T A_{α} = \sqrt{\sum_{c} A (c) | T P | + | F N | + | F P | \sqrt{}}

A (c) = | T P A (c) | | T P A (c) | + | F N A (c) | + | F P A (c) |

where $α$  is the IoU threshold, and $c$ is the number of positive sample trajectories. In the object tracking experiment, there are predicted detection trajectories and ground truth trajectories. The intersection between the two trajectories is called true positive association(TPA), and the trajectory outside the intersection in the predicted trajectory is called false positive association (FPA). Detections outside the intersection in ground truth trajectories are false negative associations (FNA).

References

Fagnant, D.J.; Kockelman, K. Preparing a nation for autonomous vehicles: Opportunities, barriers and policy recommendations. Transp. Res. Part A Policy Pract. 2015, 77, 167–181.
Hussain, R.; Zeadally, S. Autonomous cars: Research results, issues, and future challenges. IEEE Commun. Surv. Tutor. 2018, 21, 1275–1313.
Leon, F.; Gavrilescu, M. A Review of Tracking, Prediction and Decision Making Methods for Autonomous Driving. arXiv 2019, arXiv:1909.07707v1.
Fan, L.; Wang, Z.; Cail, B.; Tao, C.; Feng, Z. A survey on multiple object tracking algorithm. In Proceedings of the 2016 IEEE International Conference on Information and Automation (ICIA), Ningbo, China, 1–3 August 2016.
Sun, Z.; Chen, J.; Chao, L.; Ruan, W.; Mukherjee, M. A Survey of Multiple Pedestrian Tracking Based on Tracking-by-Detection Framework. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 1819–1833.
Feichtenhofer, C.; Pinz, A.; Zisserman, A. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2017, Venice, Italy, 22–29 October 2017; pp. 3038–3046.
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929v2.
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020.
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. In Proceedings of the NIPS 2014, Montreal, QC, Canada, 8–13 December 2014.
Kim, C.; Li, F.; Ciptadi, A.; Rehg, J.M. Multiple hypothesis tracking revisited. In Proceedings of the IEEE International Conference on Computer Vision 2015, Santiago, Chile, 7–13 December 2015; pp. 4696–4704.
Davey, S.J.; Rutten, M.G.; Gordon, N.J. Track-before-detect techniques. In Integrated Tracking, Classification, and Sensor Management; Wiley Online Library: Hoboken, NJ, USA, 2013; pp. 311–362.
Wang, N.; Zhou, W.; Li, H. Reliable re-detection for long-term tracking. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 730–743.
Pang, B.; Li, Y.; Zhang, Y.; Li, M.; Lu, C. Tubetk: Adopting tubes to track multi-object in a one-step training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 6308–6318.
Ke, B.; Zheng, H.; Chen, L.; Yan, Z.; Li, Y. Multi-object tracking by joint detection and identification learning. Neural Process. Lett. 2019, 50, 283–296.
Fortin, B.; Lherbier, R.; Noyer, J.-C. A model-based joint detection and tracking approach for multi-vehicle tracking with lidar sensor. IEEE Trans. Intell. Transp. Syst. 2015, 16, 1883–1895.
Zeng, F.; Dong, B.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple-object tracking with transformer. arXiv 2021, arXiv:2105.03247.
Yu, E.; Li, Z.; Han, S.; Wang, H. Relationtrack: Relation-aware multiple object tracking with decoupled representation. IEEE Trans. Multimed. 2022.
Milan, A.; Leal-Taixe, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016, arXiv:1603.00831v2.
Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. MOT20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003v1.
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Providence, RI, USA, 16–21 June 2012.
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237.
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631.
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454.
Ertler, C.; Mislej, J.; Ollmann, T.; Porzi, L.; Neuhold, G.; Kuang, Y. The mapillary traffic sign dataset for detection and classification on a global scale. In Proceedings of the ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 68–84.
Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv 2015, arXiv:1504.01942.
Keni, B.; Rainer, S. Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP J. Image Video Process. 2008, 2008, 246309.
Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. Hota: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578.