Pedestrian Tracking in Autonomous Vehicles: Comparison
Please note this is a comparison between Version 1 by Majdi Sukkar and Version 2 by Rita Xu.

Effective collision risk reduction in autonomous vehicles relies on robust and straightforward pedestrian tracking. Challenges posed by occlusion and switching scenarios significantly impede the reliability of pedestrian tracking.

  • pedestrian tracking
  • object detection
  • multi-object tracking (MOT)
  • autonomous vehicles

1. Introduction

In recent years, the significance of pedestrian tracking in autonomous vehicles has garnered considerable attention due to its pivotal role in ensuring pedestrian safety. Existing state-of-the-art approaches for pedestrian tracking in autonomous vehicles predominantly rely on object detection and tracking algorithms, including Faster R-CNN (Region-based Convolutional Neural Network), YOLO (You Only Look Once) and SORT algorithms [1]. However, these methods encounter issues in scenarios where pedestrians are occluded or only partially visible [2][2]. These challenges serve as the impetus for our research, as we strive to enhance the reliability and effectiveness of pedestrian tracking, particularly in complex scenarios.
More specifically, the challenges in pedestrian tracking are multifaceted, ranging from crowded urban environments to unpredictable pedestrian behavior [3]. Existing algorithms often struggle to handle scenarios where individuals move behind obstacles, cross paths, or exhibit sudden changes in direction. Furthermore, adverse weather conditions, low lighting, and dynamic urban landscapes pose additional hurdles for accurate pedestrian tracking [4]. These issues underscore the need for advanced pedestrian tracking solutions that can adapt to diverse and complex real-world scenarios.
In response to these challenges, theour research takes inspiration from recent advancements in deep learning and object detection. Deep learning techniques, with their ability to learn intricate patterns and representations from data, have shown promise in overcoming the limitations of traditional tracking methods. Leveraging the capabilities of the YOLOv8 algorithm for object detection, theour approach aims to enhance the accuracy and robustness of pedestrian tracking in dynamic and challenging environments [5]. By addressing the limitations of existing algorithms, rwesearchers aspire to contribute to the development of pedestrian tracking systems that are both reliable and adaptable.
The quest for improved pedestrian tracking is not solely confined to the domain of autonomous vehicles [6]. The relevant applications extend to various fields, such as surveillance, crowd management, and human–computer interaction. Accurate pedestrian tracking is crucial for ensuring public safety, optimizing traffic flow, and enhancing the overall efficiency of smart city initiatives [7][8][7,8]. The effectiveness of pedestrian tracking is also integral for a myriad of applications in the context of urban environments, and in smart cities, where the integration of technology aims to enhance the quality of life, with pedestrian tracking playing a crucial role. Efficient tracking systems can contribute to optimized traffic management, improved public safety, and enhanced urban planning [9]. Beyond traffic applications, pedestrian tracking finds applications in surveillance, where monitoring and analyzing pedestrian movement are essential for security [10]. Additionally, in human–computer interaction scenarios, accurate tracking is pivotal for creating responsive and adaptive interfaces, offering a wide array of possibilities for innovative applications [11]. Therefore, the advancements in pedestrian tracking have far-reaching implications, influencing various aspects of theour daily lives and the development of smart city ecosystems.

2. Pedestrian Tracking in Autonomous Vehicles

Recent advancements in pedestrian tracking encompass a diverse array of methods and algorithms, broadly classified into the following categories:
  • Multi-Object Tracking (MOT) Methods:
    • These methods are designed to concurrently track multiple pedestrians within a scene. Traditional approaches often employ the Hungarian algorithm for association, linking detections across frames [5].
    • Recent advancements leverage deep neural networks, such as TrackletNet [12][15] and DeepSORT, to learn features and association scores, enhancing tracking accuracy.
  • Re-identification-based Methods:
    • This category integrates appearance-based and geometric feature-based techniques to identify and track pedestrians across different camera views. Notably, Siamese networks are employed to learn a similarity metric between pairs of images [13][16].
    • Contemporary methods incorporate attention mechanisms to emphasize discriminative regions of pedestrians in order to improve tracking precision [14][17].
  • Online Multi-Object Tracking: Online tracking methods dynamically track multiple pedestrians in real-time. Techniques include deep learning-based re-identification [15][18] and online multiple hypothesis tracking [16][19].
  • Multi-Cue Fusion: Strategies in this category amalgamate various cues, such as color, shape, and motion, to enhance tracking robustness in complex scenes. For instance, the multi-cue multi-camera pedestrian tracking (MCMC-PT) method integrates color, shape, and motion cues from multiple cameras for comprehensive pedestrian tracking [15][18].
This diverse landscape of pedestrian tracking methodologies underscores the ongoing efforts to address challenges in occlusion, partial visibility, and dynamic scenarios. Each approach brings its unique strengths, contributing to the advancement of pedestrian tracking in autonomous vehicle applications. The subsequent sections of this paper delve into the differentiation between algorithms, detailed technology introductions, and present the proposed StrongSORT algorithm’s effectiveness in addressing these challenges. In the research presented in [17][20], three frameworks for multi-object tracking were evaluated: tracking-by-detection (TBD), joint-detection-and-tracking (JDT), and a transformer-based tracking method. DeepSORT and StrongSORT were classified under the TBD framework. The front-end detector’s performance significantly impacts tracking, and enhancing it is crucial. The transformer-based framework excels in MOTA (multiple object tracking accuracy) but has a large model size, while the JDT framework balances accuracy and real-time performance. In [18][21], an occlusion handling strategy for a multi-pedestrian tracker is proposed, capable of retrieving targets without the need for re-identification models. The tracker can manage inactive tracks and cope with tracks leaving the camera’s field of view, achieving state-of-the-art results on three popular benchmarks. However, the performed comparison does not include DeepSORT or StrongSORT algorithms. The authors of [19][13] propose a framework model based on YOLOv5 and StrongSORT tracking algorithms for real-time monitoring and tracking of workers wearing safety helmets in construction scenarios. The use of deep learning-based object detection and tracking algorithms improves accuracy and efficiency in helmet-wearing detection. The study suggests that changing the box regression loss function from CIOU to Focal-EIOU can further improve detection performance. However, the study’s limitation is its evaluation on a specific dataset, potentially limiting generalization to other datasets or scenarios. The VOT2020 challenge assessed different tracking scenarios, introducing innovations like using segmentation masks instead of bounding boxes and new evaluation methods. Most trackers relied on deep learning, particularly Siamese networks, emphasizing the role of AI and deep learning in advancing object tracking for future improvements [20][22]. In [21][23], the utilization of the YOLOv5 model and the StrongSORT algorithm for ship detection, classification, and tracking in maritime surveillance systems is examined. The practical results demonstrate high accuracy in ship classification and the capability to track at a speed approaching real-time. The influence of StrongSORT contributes to enhancing tracking speed, confirming its effectiveness in maritime surveillance systems. In [22][24], fine-tuning plays a crucial role in adapting the pre-trained YOLOv5 model for brain tumor detection, significantly enhancing the model’s performance in identifying specific brain tumors within radiological images. Thus, this study provides a valuable tool for medical image analysis. In another study [23][25], a direct comparison with the performance of StrongSORT in its advanced version was conspicuously omitted. While utilizing the KC-YOLO approach based on YOLOv5 in their detection algorithm, the study did not surpass the capabilities of more recent versions like YOLOv8. Introducing newer updates could potentially enhance StrongSORT++, as illustrated in [4], showcasing its substantial outperformance across various metrics, including HOTA and IDF1, on the MOT17 and MOT20 datasets. The impact of fine-tuning is evident in a machine learning study [24][26], where deep learning models identified diseases in maize leaves. Fine-tuning significantly improved the performance of pre-trained models, resulting in disease classification accuracy rates exceeding 93%. VGG16, InceptionV3, and Xception achieved accuracy rates surpassing 99%, demonstrating the effectiveness of transfer learning, as previously shown in [25][27], and the positive impact of fine-tuning on disease detection in maize leaves [24][26]. In Table 1, a comparison is presented between StrongSORT and other multi-object tracking methods with a focus on the advantages and disadvantages of its method.
Table 1. Comparison overview between StrongSORT and other multi-object trackers.
Relevant studies affirm the positive impact of YOLOv5 on tracking, aligning seamlessly with our detection-based tracking algorithm. The general aim of our research is to leverage the enhanced capabilities of the YOLOv8 version. The ongoing research underscores StrongSORT’s inherent advantages—simplicity, efficiency, and proficiency in effectively managing occlusions and identity switches. In recognizing the identified limitations within StrongSORT, such as a slightly slower runtime in specific cases, it becomes imperative to strike a balance in utilizing its features while simultaneously addressing challenges in diverse implementation environments. Future research endeavors will delve into a meticulous examination of the algorithm’s efficiency concerning occlusions and identity switches.

2.1. YOLOv8

YOLOv8 (You Only Look Once version 8) represents a significant advancement over its predecessors, YOLOv7 and YOLOv6, incorporating multiple features that enhance both speed and accuracy. A notable addition is the spatial pyramid pooling (SPP) module, enabling YOLOv8 to extract features at varying scales and resolutions [30][32]. This facilitates the precise detection of objects of different sizes. Another key feature is the cross stage partial network (CSP) block, reducing the network’s parameters without compromising accuracy, thereby improving both training times and overall performance [31][33]. Evaluation of prominent object detection benchmarks, including COCO and Pascal VOC datasets, showcases YOLOv8’s exceptional capabilities. The COCO dataset, with 80 object categories and complex scenes, saw YOLOv8 achieving the highest-ever mean average precision (mAP) score of 55.3% among single-stage object detection algorithms. Additionally, it achieved a remarkable real-time speed of 58 frames per second on an NVIDIA GTX 1080 Ti GPU. In summary, YOLOv8 stands out as an impressive object detection algorithm, delivering high accuracy and real-time performance. Its ability to process the entire image simultaneously makes it well suited for applications such as autonomous driving and robotics. With its innovative features and outstanding performance, YOLOv8 is poised to remain a preferred choice for object detection tasks in the foreseeable future [32][34]. Figure 1 illustrates the performance of YOLOv8 in real-world scenarios [33][35].
Figure 1. YOLOv8 performance.

2.2. DeepSORT

DeepSORT is an advanced real-time multiple objects tracking algorithm that leverages deep learning-based feature extraction coupled with the Hungarian algorithm for assignment. The code structure of DeepSORT comprises the following key components [34][36]:
  • Feature Extraction: Responsible for extracting features from input video frames, including bounding boxes and corresponding features.
  • Detection and Tracking: Detects objects in each video frame and associates them with their tracks using the Hungarian algorithm.
  • Kalman Filter: Predicts the location of each object in the next video frame based on its previous location and velocity.
  • Re-identification: Matches the appearance of an object in one video frame with its appearance in a previous frame.
  • Output: Generates the final output—a set of object tracks for each video frame.
These components collaboratively form a comprehensive tracking algorithm. DeepSORT can undergo training using a substantial dataset of video frames and corresponding object bounding boxes. The training process of DeepSORT fine-tunes the various algorithm’s components, such as the appearance model and feature extraction, aiming to enhance the algorithm’s overall performance. Despite its strengths, DeepSORT also faces unique challenges, including:
  • Blurred Objects: Tracking difficulties due to image artifacts caused by blurred objects [35][37].
  • Intra-object Changes: Challenges in handling changes in the shape or size of objects [36][38].
  • Occlusion: Tracking challenges when objects overlap or obstruct each other
  • [
  • ]
  • [
  • 18].
  • Scale Variation: Difficulty when objects appear at different scales in the image [40][42].
To address these challenges, the StrongSORT algorithm has been proposed aiming to offer solutions which enhance tracking robustness and efficiency [4].

2.3. StrongSORT

The StrongSORT algorithm enhances the original DeepSORT by introducing the AFLink algorithm for temporal matching to person tracking and the GSI algorithm for temporal interpolation of matched individuals. New configuration options have been incorporated to facilitate these improvements [4]:
  • AFLink: A flag indicating whether the AFLink algorithm should be utilized for temporal matching.
  • Path_AFLink: The path to the AFLink algorithm model to be employed.
  • Non-rigid Objects: Difficulty in tracking objects appearing for short durations [37][39].
  • Appearance Model: Stores and updates the appearance features of each object over time, facilitating re-identification and appearance updates.
  • GSI: A flag indicating whether the GSI algorithm should be employed for temporal interpolation.
  • Transparent Objects: Challenges in detecting objects made of transparent materials.
  • Non-linear Motion: Difficulty in tracking irregularly moving objects [38][40].
  • Fast Motion: Challenges posed by quickly moving objects.
  • Interval: The temporal interval to be applied in the GSI algorithm.
    Similar Objects: Difficulty in differentiating objects with similar appearances [39][41
  • Tau: The temporal interval to be used in the GSI algorithm.
The original DeepSORT application is updated through the integration of the AFLink and GSI algorithms, providing enhanced capabilities for temporal matching and interpolation of tracked individuals. Figure 2 presents the framework of the AFLink model. It adopts the spatio-temporal information of two tracklets as input and predicts their connectivity [4].
Figure 2. Framework of the AFLink model.
Video Production Service