You're using an outdated browser. Please upgrade to a modern browser for the best experience.
DeepSORT
Edit

DeepSORT is an intelligent tracking technology that can continuously track multiple objects in complex scenarios, such as crowded areas or environments with occlusions. By integrating the appearance features and motion patterns of the targets, it is widely applied in fields like security surveillance, autonomous driving, and sports analysis, significantly enhancing the stability and accuracy of tracking.

deepsort track

1. Overview

DeepSORT (Deep Simple Online and Realtime Tracking) is an advanced multi-object tracking (MOT) algorithm designed to address challenges such as occlusions, crowded environments, and varying object appearances in real-time applications. It extends the capabilities of its predecessor, SORT (Simple Online and Realtime Tracking), by integrating deep learning-based appearance features with traditional motion-based tracking methods. By combining a Kalman filter for motion prediction and a deep neural network for appearance modeling, DeepSORT achieves robust tracking accuracy while maintaining computational efficiency, making it suitable for applications like surveillance, autonomous driving, and sports analytics.

2. Core Methodology 

The algorithm operates through four interconnected components. First, it employs a motion model using an 8-dimensional state vector to represent the position, aspect ratio, and velocity of each tracked object. A Kalman filter predicts future states and corrects them based on new detections, allowing the system to handle short-term occlusions and erratic movements. Tracks are categorized as tentative (new and unconfirmed), confirmed (stable), or deleted (terminated after prolonged mismatches) to manage tracking reliability.

For data association, DeepSORT uses a hybrid approach. It calculates the Mahalanobis distance between predicted and detected object positions to account for motion uncertainty, while simultaneously measuring the cosine similarity between deep appearance features extracted from object detections and stored track histories. These metrics are combined through a weighted sum, with the balance adjustable based on environmental factors—for instance, prioritizing appearance features in scenarios with significant camera movement.

To mitigate identity switches during prolonged occlusions, the algorithm implements a cascaded matching strategy. This prioritizes matching older tracks first, as they are statistically more likely to reappear after temporary disappearance, before attempting to link newer detections. A final intersection-over-union (IoU) matching step handles cases where appearance features become unreliable due to sudden viewpoint changes or partial obstructions.

A deep convolutional neural network (CNN) is used to extract appearance descriptors for each detection. These descriptors are normalized and compared with a gallery of previously associated descriptors to compute the cosine distance. This helps in maintaining track identity across long-term occlusions. DeepSORT employs a matching cascade to handle tracks with different ages. Tracks that have not been updated for a certain number of frames are given priority in the matching process. This approach helps in reducing the computational load and ensures that tracks with higher uncertainty are matched first. Tracks that exceed a predefined maximum age are deleted from the tracker. New tracks are initiated for unmatched detections, and tentative tracks that fail to associate with detections within a certain number of frames are also removed. This ensures that the tracker remains efficient and does not accumulate unnecessary tracks.

3. Applications

DeepSORT's versatility has led to widespread adoption across industries. In security systems, it enables persistent tracking of individuals through crowded spaces despite temporary occlusions. Autonomous vehicle platforms utilize its robust tracking to monitor pedestrians and surrounding vehicles in dynamic environments. Sports analysts apply the algorithm to track players' movements for tactical analysis, while retail systems employ it for customer behavior studies. Open-source implementations, often integrated with popular object detectors like YOLO or Faster R-CNN, have further accelerated its deployment in research and commercial projects.

4. Integration Process

The integration of YOLO and DeepSORT works seamlessly to achieve efficient object detection and tracking. Initially, YOLO detects objects in the first frame and DeepSORT initializes tracks for each detected object, marking them as tentative until they are consistently detected over several frames. In subsequent frames, DeepSORT predicts the positions of the tracks using the Kalman filter and matches the detected objects with the predicted tracks. This matching process combines the Mahalanobis distance for motion information and the cosine distance for appearance information, ensuring accurate association even in challenging scenarios. The Hungarian algorithm is employed to solve the assignment problem and achieve optimal matching. Tracks that are consistently matched with detections are confirmed and maintained, while tracks that are not matched for a certain number of frames are deleted. New tracks are initiated for unmatched detections, and tentative tracks that fail to match are removed to keep the tracker efficient. This coordinated process between YOLO and DeepSORT ensures robust and real-time tracking performance.

5. Advantages of the Combination

The combination of DeepSORT and YOLO offers several significant advantages. Firstly, it achieves real-time performance by leveraging YOLO's fast detection speed and DeepSORT's efficient tracking algorithm, making it suitable for applications requiring immediate feedback. Secondly, DeepSORT's use of appearance features extracted through a deep convolutional neural network enhances the system's robustness to occlusions. This means that even when objects are temporarily blocked from view, the tracker can maintain their identity based on their appearance, ensuring continuous tracking. Additionally, the integration can handle multiple objects of different classes simultaneously, making it highly versatile for complex scenes with various objects. These combined strengths make the system highly effective for applications such as surveillance, autonomous driving, and crowd analysis, where accurate and reliable object detection and tracking are crucial.

6. Usages

6.1. Running the Tracker

The following example starts the tracker on one of the MOT16 benchmark sequences. Researchers assume resources have been extracted to the repository root directory and the MOT16 benchmark data is in ./MOT16:

python deep_sort_app.py \
    --sequence_dir=./MOT16/test/MOT16-06 \
    --detection_file=./resources/detections/MOT16_POI_test/MOT16-06.npy \
    --min_confidence=0.3 \
    --nn_budget=100 \
    --display=True
 

Check python deep_sort_app.py -h for an overview of available options. There are also scripts in the repository to visualize results, generate videos, and evaluate the MOT challenge benchmark.

6.2. Generating Detections

Beside the main tracking application, this repository contains a script to generate features for person re-identification, suitable to compare the visual appearance of pedestrian bounding boxes using cosine similarity. The following example generates these features from standard MOT challenge detections.

python tools/generate_detections.py \
    --model=resources/networks/mars-small128.pb \
    --mot_dir=./MOT16/train \
    --output_dir=./resources/detections/MOT16_train

The model has been generated with TensorFlow 1.5. If running into incompatibility, re-export the frozen inference graph to obtain a new mars-small128.pb that is compatible with version:

python tools/freeze_model.py

The generate_detections.py stores for each sequence of the MOT16 dataset a separate binary file in NumPy native format. Each file contains an array of shape Nx138, where N is the number of detections in the corresponding MOT sequence. The first 10 columns of this array contain the raw MOT detection copied over from the input file. The remaining 128 columns store the appearance descriptor. The files generated by this command can be used as input for the deep_sort_app.py.

NOTE: If python tools/generate_detections.py raises a TensorFlow error, try passing an absolute path to the --model argument. This might help in some cases.

7. Conclusion

DeepSORT represents a significant advancement in multi-object tracking by integrating deep learning with traditional tracking techniques. Its robustness and real-time performance make it a popular choice for a wide range of applications.

References

  1. Nicolai Wojke; Alex Bewley; Dietrich Paulus. Simple online and realtime tracking with a deep association metric; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, United States, 2017; pp. 3645-3649.
  2. Nicolai Wojke; Alex Bewley. Deep Cosine Metric Learning for Person Re-identification; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, United States, 2018; pp. 748-756.
  3. Abhijeet Pujara; Mamta Bhamare. DeepSORT: Real Time & Multi-Object Detection and Tracking with YOLO and TensorFlow; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, United States, 2022; pp. 456-460.
  4. QiFeng Sui. Multi-Target Tracking Based on YOLOv8 and DeepSORT; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, United States, 2024; pp. 674-677.
  5. Yuhan Wang; Han Yang. Multi-target Pedestrian Tracking Based on YOLOv5 and DeepSORT; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, United States, 2022; pp. 508-514.
More
Upload a video for this entry
Information
Contributor MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :
View Times: 258
Revisions: 3 times (View History)
Update Date: 24 Mar 2025
Academic Video Service