A Cheat Sheet For Multi-Object Tracking
Everything about MOT in a nut-shell
Multiple Object Tracking(MOT)
MOT takes a single continuous video and splits it into discrete frames at a specific frame rate(fps) to output
- Detection: what objects are present in each frame
- Localization: where objects are in each frame
- Association: whether objects in different frames belong to the same or different objects
Typical Applications of MOT
Multi-object tracking(MOT) has its application in
- Video surveillance for traffic control, digital forensics
- Gesture recognition
- Robotics
- Augmented Reality
- Self-driving vehicles
Challenges with MOT
- Accurately detect the objects of interest in the frame with high confidence. Issues with accurate object detection are failing to detect an object of interest, assigning a wrong class label to a detected object, or incorrectly localizing an identified object.
- ID Switching occurs when two similar objects overlap or blend, causing the identity switching; hence, keeping track of the object id is difficult.
- Background distortion: Busy background makes it difficult to detect small objects during object detection
- Occlusion: occurs when something you want to see is hidden or occluded by another object.
- Multiple Spatial Spaces, Deformation, or Object rotation
- Image illumination
- Visual streaking or smearing captured on camera due to motion blur
Characteristics of a Multi-object tracker(MOT)
A good multi-object tracker(MOT)
- Tracks object by identifying the correct number of trackers at the precise locations in each frame.
- Identify objects by tracking individual objects consistently over a long period,
- Track objects despite occlusion, illumination changes, background, motion blur, etc.
- Detect and Track objects fast
Popular MOT Algorithms
Centroid based Object Tracking
Centroid-based object tracking utilizes the Euclidean distance between the centroids of the objects detected between two consecutive frames in a video.

IOU Object Tracker
Intersection-over-Union is another technique for object tracking that associates detections of subsequent frames solely by their spatial overlap to tracks.
Visual IOU Object Tracker
Visual IOU Object Tracker works in two directions; visual forward and backward tracking of the object help merge discontinued tracks.
Simple Online Realtime Tracking (SORT)
The SORT method assumes tracking quality depends on object detection performance.
SORT starts by first detecting objects using Faster Region-CNN(FrRCNN).
The object detection is associated with the detected bounding box by predicting its new location in the current frame to update the target state solved optimally using a Kalman filter framework.
The assignment cost matrix is computed as the intersection-over-union (IOU) distance between each detection and all predicted bounding boxes from the existing targets. The assignment is solved using the Hungarian algorithm.
SORT algorithm helps reduce occluder target, and Id switches to work well when object motion is small. SORT may fail in challenging cases of crowded scenes and fast motion
Deep SORT
Deep SORT is an extension of SORT incorporating appearance information through a pre-trained association metric.
Deep SORT allows for tracking through more extended periods of occlusion, is simple to implement, and runs in real-time.
Deep SORT adopts a single conventional hypothesis tracking methodology with recursive Kalman filtering and frame-by-frame data association using the Hungarian algorithm.

The appearance feature describes all the features of a given image. Deep SORT also utilizes a matching cascade similar to SORT to prioritize more frequently seen objects.
Deep SORT reduces ID switches and occlusions, leading to lower False Positives.
FairMOT(Multiple Object Tracking)
The FairMOT does not use the multi-task approach of first detecting objects and their bounding boxes, followed by Object trackings like SORT and Deep SORT. FairMOT considers that the network is biased to the primary detection task, which is unfair to the re-ID or object tracking task.
The object detection and re-ID tasks are treated equally in FairMOT.

The input image is fed to an encoder-decoder network to extract high-resolution feature maps.
FairMOT then adds two homogeneous branches for detecting objects and extracting re-ID features to obtain a good trade-off between detection and re-ID.
Read this article for a detailed understanding on different MOT algorithm
BytrTrack Algorithm
ByteTrack performs MOT on a video using the high-performance detector YOLOX and performs association between the detection boxes and the tracks using BYTE.
BYTE keeps all detection boxes and separates them into high score ones (Dʰᶦᵍʰ) and low score(Dˡᵒʷ) ones. BYTE uses a Kalman filter to predict the new locations in the current frame of each track in T.
The first association in BYTE is performed between the high score detection boxes Dʰᶦᵍʰ to all the tracklets. Similarity for the first association is computed using IoU or the Re-ID feature distances between the detection boxes Dʰᶦᵍʰ and the predicted box of tracks T.
Some tracklets get unmatched because they do not match an appropriate high score detection box Dʰᶦᵍʰ, which occurs when occlusion, motion blur, or size change occurs.

The second association is performed after the first association between the low score detection boxes Dˡᵒʷ and the remaining unmatched tracklets(Tʳᵉᵐᵃᶤⁿ) to recover the objects in low score detection boxes and filter out the background.
Keep the unmatched tracks in Tʳᵉ-ʳᵉᵐᵃᶤⁿ and delete all the unmatched low score detection boxes as those are considered background.
Characteristics of MOT Evaluation Metrics
MOT evaluation metrics need to exhibit two significant properties
- MOT evaluation metrics need to address five error types in MOT. These five error types are False negatives(FN), False positives(FP), Fragmentation, Mergers(ID Switch), and Deviation.

2. MOT evaluation metrics should have monotonicity, and error types should be differentiable so that the metrics have the tracker’s performance concerning each of the five basic error types.
Commonly used MOT evaluation metrics.
Track-mAP
Track mAP performs both matching and association at a trajectory level and is biased toward measuring association. It operates based on the confidence-ranked potential tracking results. Track-mAP is non-monotonic in detection.
Multi-Object Tracking Accuracy- MOTA
MOTA is the most widely used metric that closely represents human visual assessment. In MOTA, matching is done at a detection level. Association is measured in MOTA using Identity Switch (IDSW), which occurs when a tracker wrongfully swaps object identities or when a track is lost and is reinitialized with a different identity. MOTA measures three types of tracking errors: False Positive, False Negative, and ID Switch

The Identification Metrics: IDF1
IDF1 emphasizes Association accuracy rather than detection. IDF1 uses IDTP(Identity True Positives), where prID is matched with grID when S ≥ α of trajectories. IDF1 is the ratio of correctly identified detections over the average number of ground-truth and computed detections. The Hungarian algorithm selects trajectories to match for minimizing the sum of IDFP and IDFN.

IDF1 combines IDP(ID Precision) and IDR(ID Recall).

Higher-Order Tracking Accuracy-HOTA
HOTA is a single unified metric for ranking trackers. HOTA can be decomposed into components that correspond to these five error types: Detection Recall, Detection Precision, Association Recall, Association Precision, and Localisation Accuracy. As a result, HOTA has its error type differentiable and is strictly monotonic, providing information about the tracker’s performance concerning each of the different basic error types
HOTA tracking errors are categorized into Detection errors, Association errors, and Localization errors.
- Detection error occurs when a tracker predicts detections that don’t exist in the ground truth or fails to predict detections in the ground truth. Detection errors can be further categorized as detection recall (measured by FNs) and detection precision (measured by FPs)
- Association error occurs when trackers assign the same prID to two detections with different gtIDs or assign different prIDs to two detections that should have the same gtID. Association errors are further categorized into errors of association recall (measured by FNAs) and association precision (measured by FPAs)
- Localization errors occur when prDets are not perfectly spatially aligned with gtDets.

MOTA performs both matching and association scoring at a local detection level but accentuates detection accuracy, whereas IDF1 performs at a trajectory level by emphasizing the effect of association.
Track-mAP is similar to IDF1 as it performs both matching and association at a trajectory level and is biased toward measuring association.
HOTA balances both by being an explicit combination of a detection score and an association score by performing matches at the detection level while scoring association globally over trajectories.

Read this article for a detailed understanding of different MOT evaluation metrics
References:
SIMPLE REAL-TIMEND REALTIME TRACKING Alex Bewley
SIMPLE ONLINE AND REAL-TIME TRACKING WITH A DEEP ASSOCIATION METRIC
HOTA: A Higher-Order Metric for Evaluating Multi-Object Tracking
How to evaluate tracking with the HOTA metrics
MOT16: A Benchmark for Multi-Object Tracking
Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics
Evaluating Multi-Object Tracking
An Introduction to Object Tracking
ByteTrack: A Simple Yet Effective Multi-Object Tracking Technique






