ByteTrack: A Simple Yet Effective Multi-Object Tracking Technique

A simple, effective, and a generic association method to track objects by associating almost every detection box instead of just the high score ones

The goal of Multi-Object Tracking(MOT) is to draw the bounding boxes around objects by detecting and identifying them in a video and then maintaining their trajectories with high accuracy.

MOT takes a single continuous video as an input and splits it into discrete frames at a specific frame rate. The output of the MOT is

Detection: what objects are present in each frame
Localization: where objects are in each frame
Association: whether objects in different frames belong to the same or different objects

Existing MOT Techniques

MOT methods based on tracking by detection utilize the most powerful detection techniques like one-stage object detector RetinaNet, CenterNet, or YOLO series to obtain high-performance tracking. Tracking by detection directly uses detection boxes on a single image for tracking, and information from the previous frames is usually leveraged to enhance the video detection performance.

Detection by tracking uses a Kalman filter to predict the location of the tracklets in the next frame. It fuses the predicted boxes with the detection boxes to enhance the detection results. This method utilizes similarity with tracklets to strengthen the reliability of detection boxes.

Most MOT methods retain only the high score detection boxes above a certain threshold, i.e., 0.5, and use these high score detection boxes as the input for data association.

Data association associates the tracklets to the detection boxes to match them according to similarity.

The SORT algorithm computes the IoU between the detection and predicted boxes for similarity scores. Deep SORT leverages Appearance similarity, which extracts appearance features from the detection boxes. Appearance similarity is measured by the cosine similarity of the Re-ID features and is helpful to re-identify an object which occluded for an extended period.

The matching strategy assigns identities to the objects after computing the similarity score using Hungarian Algorithm or greedy assignment.

Challenges with existing MOT Techniques

MOT is a challenging task in complex and crowded conditions as some tracklets get unmatched because they do not match to an appropriate high score detection box.

Occlusions can cause missed object detection.
Motion Blur can cause missed trajectories with objects frequently entering and leaving the frame generating fragmented trajectories and ID switching.
Size changes can cause missed trajectories and ID switching

BYTE is a simple and effective association method for MOT. This MOT technique is named BYTE as it considers each detection box as a basic unit of the tracklist like a byte in a computer program and the tracking method values every detection box and not just the ones with high scores

source: ByteTrack: Multi-Object Tracking by Associating Every Detection Box

In the figure above,

a) Displays the detection boxes identified using object detection.

b) Associate tracklets with high score detection boxes where the threshold≥ 0.5. The same box color represents the same identity.

c) Implements BYTE where all the detection boxes are valued, even the low score ones. As a result, the occluded person with low detection scores is matched correctly to the previous tracklet, and the background in the right part of the image is removed.

BytrTrack Algorithm

ByteTrack performs MOT on a video using the high-performance detector YOLOX and performs association between the detection boxes and the tracks using BYTE.

BYTE keeps all detection boxes and separates them into high score ones (Dʰᶦᵍʰ) and low score(Dˡᵒʷ) ones. BYTE uses a Kalman filter to predict the new locations in the current frame of each track in T.

The first association in BYTE is performed between the high score detection boxes Dʰᶦᵍʰ to all the tracklets. Similarity for the first association is computed using IoU or the Re-ID feature distances between the detection boxes Dʰᶦᵍʰ and the predicted box of tracks T.

Some tracklets get unmatched because they do not match to an appropriate high score detection box Dʰᶦᵍʰ, which usually occurs when occlusion, motion blur, or size changing occurs.

The second association is performed after the first association between the low score detection boxes Dˡᵒʷ and the remaining unmatched tracklets(Tʳᵉᵐᵃᶤⁿ) to recover the objects in low score detection boxes and filter out the background.

Keep the unmatched tracks in Tʳᵉ-ʳᵉᵐᵃᶤⁿ and delete all the unmatched low score detection boxes as those are considered background.

For the long-range association, we need to preserve the identity of the tracks across multiple frames. For the unmatched tracks after the second association, Tʳᵉ-ʳᵉᵐᵃᶤⁿ will be put into Tˡᵒˢᵗ. Each track in Tˡᵒˢᵗ exists for a certain number of frames, i.e., 30, to handle object rebirth, after which it is deleted from the tracks T.

ByteTrack performance

BYTE outperforms other association methods like SORT, Deep SORT, and MOTDT by a large margin

BYTE is more robust to the detection score threshold than SORT due to the second association in BYTE. The second association in BYTE recovers the objects whose detection scores are lower as it considers every detection box regardless of its detection score.

BYTE obtains notably more TP(True Positives) than FP(False Positives) from the low score detection boxes, which notably increases MOTA from 74.6 to 76.6

IDF1 increases for BYTE from 76.9 to 79.3. It decreases IDs from 291 to 159, highlighting the importance of the low score detection boxes and proving the ability of BYTE to recover objects using low score detection boxes.

ByteTrack results for motion blur. The yellow triangle represents the high detection score box, and the red triangle represents the low detection score box (source: ByteTrack: Multi-Object Tracking by Associating Every Detection Box)

ByteTrack results for occlusion The yellow triangle represents the high detection score box, and the red triangle represents the low detection score box (source: ByteTrack: Multi-Object Tracking by Associating Every Detection Box)

Conclusion:

ByteTrack is a simple yet effective algorithm for multi-object tracking(MOT). It uses YOLOX, a high-performance object detector, and BYTE for data association. BYTE uses all of the detection results, both low and high detection scores, to enhance the performance of ByteTrack. ByteTrack is robust to occlusion, motion blur, and size changes and performs accurate tracking.

References:

Multi-Object Tracking by Associating Every Detection Box by Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo2, Wenyu Liu, Xinggang Wang

YOLOX: Exceeding YOLO Series in 2021

An Introduction to Object Tracking

HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking