An Introduction to Object Tracking
Common challenges with object tracking and different techniques for object tracking
What is Object Tracking?
Object tracking is a technique to identify objects of interest in a video, assigning a unique Id to each distinct identified object as they move. The object tracking for an object with a unique Id starts when the object enters the scene and ends when the object leaves the scene.
Object tracking aims to accurately identify objects of interest, estimate their trajectories in videos and track them as they move. Tracking determines whether an object is the same as in the previous frames.
Object Tracking is composed of
- Object Detection and Recognition: A technique to identify and localize the objects of interest by bounding boxes in each frame
- Object Tracking: Generating a unique ID for each detected object, tracking them as they move around in a video while maintaining the ID assignment.
Application of Object Tracking
- Autonomous driving to track cars and other obstacles to can plan a route and avoid collision
- Traffic control detects moving vehicles and correctly recognizes their model, make, and brand type for important traffic information.
- Digital forensics includes criminal and online forensic tracking for single or multiple targets.
- Visual surveillance to handle vehicle theft, crowd management, etc.
- Monitoring social distancing during COVID-19
- Identifying defective pieces on the production line
Common Challenges with Object Tracking
There are several challenges with Object Tracking; understanding and addressing these scenarios can aid in accurately tracking objects.
- Accurate object detection: The object detection algorithm must accurately identify the objects of interest with high confidence. Issues with accurate object detection are failing to detect an object of interest, assigning a wrong class label to a detected object, or incorrectly localizing an identified object.
The figure below shows failing to detect a person correctly when two people are walking close to each other, assigning the wrong label to caution tape as a person. These incorrect object detection will prevent accurate object tracking.

- Speed of Object detection: It is imperative for real-time object tracking of objects to take the least amount of time for object detection to accurately track object ids.
- ID Switching occurs when two similar objects overlap or blend, causing the identities to be switched; hence, keeping track of the object id is difficult.
The figure below shows that when three people come together, there will be a possibility of ID switching.

- Background distortion: Busy backgrounds make it difficult to detect small objects during object detection.

- Occlusion: occurs when something you want to see is hidden or occluded by another object. As shown below, the lamp post is blocking the person behind it.

- Image Illumination: Lighting has an enormous influence on object detection and recognition. The same objects will look different depending on the lighting conditions, like the image below will look different during the early morning, afternoon, and late evening conditions.

- Multiple Spatial Spaces, Deformation, or Object rotation: Objects can be deformed, change their shapes, sizes, aspect ratio, or rotate in an unanticipated way creating additional complexity or confusion for object detection. Object detectors must be trained to assign the correct class to detect objects with all possible shapes, sizes, aspect ratios, and movements.
- Motion blurring: visual streaking or smearing captured on camera due to a motion of the object recorded on the camera causes difficulty for object detection.
Object Tracking Techniques
Several object tracking techniques with a high-level overview of their workings are discussed here.
Centroid based Object Tracking
Centroid-based object tracking utilizes the Euclidean distance between the centroids of the objects detected between two consecutive frames in a video.
Step1: Objects are detected using a bounding box for the frame at time t-1
Step 2: Calculate the centroids for the object detected for the frame at time t-1.
Step 3: Objects are detected using a bounding box for the frame at time t. Assign a unique ID to the objects
Step 4: Calculate the centroids of the object detected for the frame at time t.
Step 5: Calculate the Euclidean distance between the centroids of all the objects detected in frames t-1 and t.

Step 6: If the distance between the centroid at time t-1 and t is less than the threshold, it is the same object in motion. Hence, use the existing object Id and update the bounding box coordinates of the object to the new bounding box value.
Step 7: If the distance between the centroid at time t-1 and t is greater than the threshold, add a new object id.
Step 8:When objects detected in the previous frame cannot be matched to any existing objects, remove the object id from tracking.

IOU Object Tracker
Intersection-over-Union is another technique for object tracking that associates detections of subsequent frames solely by their spatial overlap to tracks.
A track gets the object detection with the highest intersection-over-union(IOU) to its last known object position assigned. This association is solved using a linear assignment problem using the Hungarian algorithm to maximize the sum of all IOUs for the frame.

An improvement to the IOU-based Object Tracker is incorporating visual tracking into the intersection-over-union (IOU) tracker.
Visual IOU Object Tracker
Visual IOU Object Tracker is performed in two directions; visual forward and backward tracking of the object help merge discontinued tracks.
- When no object detection satisfies the IOU threshold for an object track, the visual tracker is initialized on the last known object position at the previous frame and used to track the object for a specified number of frames.
- If a new detection satisfies the IOU threshold within the specified frames, the visual tracking is stopped, and the IOU tracker continues. Otherwise, the track is terminated.
The above approach helps compensate reliably for a few missing detections,

Visual IOU tracking reduces false negative detections, and as a result, the number of ID switches and fragmentations is reduced, significantly increasing the quality of the tracks.
Simple Online Realtime Tracking (SORT)
The SORT method is based on the assumption that tracking quality depends on object detection performance.

SORT starts by first detecting objects using Faster Region-CNN(FrRCNN).
- The first stage of FrRCNN extracts features and proposes regions for the second stage.
- The second stage of FrRCNN classifies the object in the proposed region.
FrRCNN shares the parameters between the two stages, creating an efficient framework for detection.
The object detection is associated with the detected bounding box by predicting its new location in the current frame to update the target state solved optimally using a Kalman filter framework.
The assignment cost matrix is computed as the intersection-over-union (IOU) distance between each detection and all predicted bounding boxes from the existing targets. The assignment is solved using the Hungarian algorithm.
New tracks are initiated for each detection that cannot be associated with an existing track. When the object detection with overlap is less than IOU(min), signifying the existence of an untracked object.
Tracks that exceed a predefined maximum age are considered to have left the scene, and the track is deleted. Tracks are terminated if they are not detected for T(Lost) frames, and T(Lost) frames are usually set to 1.
SORT algorithm helps reduce occluder target, and Id switches to work well when object motion is small. SORT may fail in challenging cases of crowded scenes and fast motion
Deep SORT
Deep SORT is an extension of SORT incorporating appearance information through a pre-trained association metric.
Deep SORT and SORT are cascade-style object tracking methods. They first predict the bounding boxes for detected objects and then pool features from them to estimate the corresponding re-ID features.
Deep SORT allows for tracking through more extended periods of occlusion, is simple to implement, and runs in real-time.
Deep SORT adopts a conventional single hypothesis tracking methodology with recursive Kalman filtering and frame-by-frame data association using the Hungarian algorithm.
Kalman filter predicts future object trajectories based on the current position. It takes the bounding coordinates as the direct observations of the object state to provide a rough estimate of the object's predicted location.
With the new bounding boxes tracked using the Kalman filter, we must associate new detections with the new predictions.
Association between the predicted Kalman states and newly arrived measurements is to build an assignment problem is solved using the Hungarian algorithm.

Deep SORT employs a CNN trained on a large-scale person re-identification dataset; after stripping the last classification layer, the final batch and L2 normalization projects appearance feature compatible with the cosine appearance metric.
The appearance feature describes all the features of a given image. Deep SORT also utilizes a matching cascade similar to SORT to prioritize more frequently seen objects.
Mahalanobis distance provides information about possible object locations based on the motion of the objects useful for short-term predictions. On the other hand, the cosine distance considers appearance information useful to recover identities after long-term occlusions.
Deep SORT reduces ID switches and reduced occlusions, leading to lower False Positives.
FairMOT(Multiple Object Tracking)
The FairMOT does not use the multi-task approach of first detecting objects and their bounding boxes followed by Object trackings like SORT and Deep SORT. FairMOT considers that the network is biased to the primary detection task, which is unfair to the re-ID or object tracking task.
Why is Multi-step Object Tracking unfair?
Object Tracking accuracy suffers ID switches as the re-ID(tracking) task is not fairly learned compared to the Object detection task.
The two main reasons for multi-step object tracking being unfair
- Object Tracking depends on the accuracy of the primary task of object detection. Any incorrect/missed object detection will significantly impact the object tracking accuracy.
- The ROI-Aling feature used for object detection is also used for re-ID tasks; however, detection and tracking are two different tasks and need different features. Tracking needs low-level features to discriminate between different instances of the same class. Object objection detection requires deep and abstract features to estimate object classes and positions; object detection features must be similar for different instances of the same object.

Apart from being unfairly trained, the two-step object tracking methods suffer from scalability issues and cannot achieve real-time inference speed, especially when there are a large number of objects in video.
The object detection and re-ID tasks are treated equally in FairMOT.

The input image is fed to an encoder-decoder network to extract high-resolution feature maps.
FairMOT then adds two homogeneous branches for detecting objects and extracting re-ID features to obtain a good trade-off between detection and re-ID.
ResNet-34 is the backbone of FairMOT to balance between accuracy and speed. An enhanced version of Deep Layer Aggregation (DLA) is applied to the backbone to fuse multi-layer features.
Detection branch
The detection branch is built on top of anchor-free CenterNet. Three parallel heads are appended to DLA-34 to estimate heatmaps, object center offsets, and bounding box sizes.
- The heatmap estimates the locations of the object centers.
- The box offset head aims to localize objects more precisely.
- The bounding box size head estimates the target box's height and width at each location.
Re-ID Branch
The Re-ID branch aims to generate features to distinguish objects. It assumes that the affinity among different objects is smaller than between the same object.
The re-ID features are learned through a classification task.
Training of FairMOT
Training is jointly performed on the detection and re-ID branches by adding their losses together. FairMOT generates heatmaps, box offset, size maps, and a one-hot class representation of the objects for a given input image. These outputs are compared to the estimated measures to obtain losses to train the whole network.
Training of the re-ID features further enhances the association ability of the tracker.
Inference using FairMOT
FairMOT network takes an image of input size 1088×608, on top of the predicted heatmap predicted from the detection branch, perform non-maximum suppression (NMS) based on the heatmap scores to extract the peak keypoint. Compute the corresponding bounding boxes based on the estimated offsets and box sizes. Extract the identity embeddings at the estimated object centers.
The next step is to associate the detected boxes over time using the re-ID features. The first step is to use Kalman Filter to predict trajectories of objects detected for locations in the frame compute the Mahalanobis distance between the predicted and detected boxes, similar to DeepSORT.
The re-ID branch estimates a re-ID feature for each pixel to characterize the object-centered at the pixel, and tracking is based on the features at the predicted object centers.
The second stage in the re-ID branch is to track unmatched detections based on the overlap between the detected and predicted boxes. Initialize the unmatched detections as new tracks and save the unmatched tracklets for 30 frames in case they reappear in the future. Matched tracks should meet the matching threshold set.
Fair MOT attain high levels of detection and tracking accuracy.
Conclusion:
Object tracking detects objects in each frame represented as bounding boxes, followed by a data association that aims to associate object detections across frames in a video sequence to the object tracks.
The object tracking can be implemented using a simplistic centroid-based approach or IOU-based approach.
SORT and Deep Sort are cascade-based object tracking methods that utilize the Kalman filter. The Kalman filter predicts the future object trajectories based on the current position. Hungarian algorithm is used to solve the assignment problem that associates the predicted Kalman states and newly arrived measurements.
FairMOT is a simple architecture consisting of two homogeneous branches to detect objects and extract re-ID features. FairMOT also uses the Kalman filter to predict the future object trajectories, and then the association is done using the Hungarian algorithm. Fair MOT achieves high levels of detection and tracking accuracy
References:
SIMPLE REAL-TIMEND REALTIME TRACKING Alex Bewley
SIMPLE ONLINE AND REAL-TIME TRACKING WITH A DEEP ASSOCIATION METRIC
