avatarRenu Khandelwal

Summary

ByteTrack is an advanced Multi-Object Tracking (MOT) technique that significantly improves tracking accuracy by associating both high and low score detection boxes, leveraging a simple yet effective association method to maintain object identities through occlusions, motion blur, and size changes.

Abstract

ByteTrack represents a novel approach in the field of Multi-Object Tracking (MOT), which is a critical component of computer vision. It stands out by considering all detection boxes, not just those with high confidence scores. This method utilizes a two-stage association process: the first stage associates high score detection boxes with tracklets, and the second stage matches low score detection boxes with unmatched tracklets. This approach ensures that objects are not lost during tracking due to common challenges such as occlusions, motion blur, and changes in object size. ByteTrack's effectiveness is demonstrated by its superior performance over existing techniques like SORT and Deep SORT, as evidenced by higher True Positives (TP), lower False Positives (FP), and fewer Identity Switches (IDs). The algorithm's robustness is further highlighted by its ability to handle complex scenarios, making it a significant advancement in the MOT domain.

Opinions

  • The authors of the ByteTrack technique emphasize the importance of valuing every detection box, not just those with high scores, to improve tracking accuracy.
  • Existing MOT techniques that only consider high score detection boxes are seen as less effective, particularly in challenging conditions such as crowded scenes or when objects undergo significant appearance changes.
  • The use of a Kalman filter for predicting new locations of tracklets and the incorporation of IoU or Re-ID feature distances for similarity computation are considered essential components of the ByteTrack algorithm.
  • The two-stage association process in ByteTrack is opinioned to be a key factor in its robustness, allowing for the recovery of objects that would otherwise be missed due to low detection scores.
  • The performance metrics (MOTA, IDF1, IDs) indicate that ByteTrack is not only more accurate but also more consistent in maintaining object identities across frames, which is crucial for real-world applications of MOT.
  • The authors suggest that the simplicity of the ByteTrack algorithm, combined with its effectiveness, makes it a preferable choice for MOT tasks compared to more complex existing methods.

ByteTrack: A Simple Yet Effective Multi-Object Tracking Technique

A simple, effective, and a generic association method to track objects by associating almost every detection box instead of just the high score ones

The goal of Multi-Object Tracking(MOT) is to draw the bounding boxes around objects by detecting and identifying them in a video and then maintaining their trajectories with high accuracy.

Image by author

MOT takes a single continuous video as an input and splits it into discrete frames at a specific frame rate. The output of the MOT is

  • Detection: what objects are present in each frame
  • Localization: where objects are in each frame
  • Association: whether objects in different frames belong to the same or different objects

Existing MOT Techniques

MOT methods based on tracking by detection utilize the most powerful detection techniques like one-stage object detector RetinaNet, CenterNet, or YOLO series to obtain high-performance tracking. Tracking by detection directly uses detection boxes on a single image for tracking, and information from the previous frames is usually leveraged to enhance the video detection performance.

Detection by tracking uses a Kalman filter to predict the location of the tracklets in the next frame. It fuses the predicted boxes with the detection boxes to enhance the detection results. This method utilizes similarity with tracklets to strengthen the reliability of detection boxes.

Most MOT methods retain only the high score detection boxes above a certain threshold, i.e., 0.5, and use these high score detection boxes as the input for data association.

Data association associates the tracklets to the detection boxes to match them according to similarity.

The SORT algorithm computes the IoU between the detection and predicted boxes for similarity scores. Deep SORT leverages Appearance similarity, which extracts appearance features from the detection boxes. Appearance similarity is measured by the cosine similarity of the Re-ID features and is helpful to re-identify an object which occluded for an extended period.

The matching strategy assigns identities to the objects after computing the similarity score using Hungarian Algorithm or greedy assignment.

Challenges with existing MOT Techniques

MOT is a challenging task in complex and crowded conditions as some tracklets get unmatched because they do not match to an appropriate high score detection box.

  • Occlusions can cause missed object detection.
  • Motion Blur can cause missed trajectories with objects frequently entering and leaving the frame generating fragmented trajectories and ID switching.
  • Size changes can cause missed trajectories and ID switching

BYTE is a simple and effective association method for MOT. This MOT technique is named BYTE as it considers each detection box as a basic unit of the tracklist like a byte in a computer program and the tracking method values every detection box and not just the ones with high scores

source: ByteTrack: Multi-Object Tracking by Associating Every Detection Box

In the figure above,

a) Displays the detection boxes identified using object detection.

b) Associate tracklets with high score detection boxes where the threshold≥ 0.5. The same box color represents the same identity.

c) Implements BYTE where all the detection boxes are valued, even the low score ones. As a result, the occluded person with low detection scores is matched correctly to the previous tracklet, and the background in the right part of the image is removed.

BytrTrack Algorithm

ByteTrack performs MOT on a video using the high-performance detector YOLOX and performs association between the detection boxes and the tracks using BYTE.

BYTE keeps all detection boxes and separates them into high score ones (Dʰᶦᵍʰ) and low score(Dˡᵒʷ) ones. BYTE uses a Kalman filter to predict the new locations in the current frame of each track in T.

The first association in BYTE is performed between the high score detection boxes Dʰᶦᵍʰ to all the tracklets. Similarity for the first association is computed using IoU or the Re-ID feature distances between the detection boxes Dʰᶦᵍʰ and the predicted box of tracks T.

Some tracklets get unmatched because they do not match to an appropriate high score detection box Dʰᶦᵍʰ, which usually occurs when occlusion, motion blur, or size changing occurs.

The second association is performed after the first association between the low score detection boxes Dˡᵒʷ and the remaining unmatched tracklets(Tʳᵉᵐᵃᶤⁿ) to recover the objects in low score detection boxes and filter out the background.

Keep the unmatched tracks in Tʳᵉ-ʳᵉᵐᵃᶤⁿ and delete all the unmatched low score detection boxes as those are considered background.

For the long-range association, we need to preserve the identity of the tracks across multiple frames. For the unmatched tracks after the second association, Tʳᵉ-ʳᵉᵐᵃᶤⁿ will be put into Tˡᵒˢᵗ. Each track in Tˡᵒˢᵗ exists for a certain number of frames, i.e., 30, to handle object rebirth, after which it is deleted from the tracks T.

ByteTrack performance

BYTE outperforms other association methods like SORT, Deep SORT, and MOTDT by a large margin

BYTE is more robust to the detection score threshold than SORT due to the second association in BYTE. The second association in BYTE recovers the objects whose detection scores are lower as it considers every detection box regardless of its detection score.

BYTE obtains notably more TP(True Positives) than FP(False Positives) from the low score detection boxes, which notably increases MOTA from 74.6 to 76.6

IDF1 increases for BYTE from 76.9 to 79.3. It decreases IDs from 291 to 159, highlighting the importance of the low score detection boxes and proving the ability of BYTE to recover objects using low score detection boxes.

ByteTrack results for motion blur. The yellow triangle represents the high detection score box, and the red triangle represents the low detection score box (source: ByteTrack: Multi-Object Tracking by Associating Every Detection Box)
ByteTrack results for occlusion The yellow triangle represents the high detection score box, and the red triangle represents the low detection score box (source: ByteTrack: Multi-Object Tracking by Associating Every Detection Box)

Conclusion:

ByteTrack is a simple yet effective algorithm for multi-object tracking(MOT). It uses YOLOX, a high-performance object detector, and BYTE for data association. BYTE uses all of the detection results, both low and high detection scores, to enhance the performance of ByteTrack. ByteTrack is robust to occlusion, motion blur, and size changes and performs accurate tracking.

References:

Multi-Object Tracking by Associating Every Detection Box by Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo2, Wenyu Liu, Xinggang Wang

YOLOX: Exceeding YOLO Series in 2021

An Introduction to Object Tracking

HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking

Artificial Intelligence
Computer Vision
Object Tracking
Bytetrack
Robotics
Recommended from ReadMedium