Video Object Tracking with Optical Flow and Yolo

Introduction
Object Detection and Object Tracking are quite useful things for the modern world, especially when talking about solving real-life problems in fields associated with businesses (all of them) like agriculture, robotics, transportation, and so on. This article is meant to make you familiar with the “Detection” and “Tracking” terms, and of course, teach you how to implement these terms in code and visualizing.
What is the difference between “Detection” and “Tracking”?
When speaking about detecting an object, we limit ourselves to one frame, which means that the Object Detection algorithm has to work with one picture, and ONLY detect a certain object/objects. Object Tracking is about the whole video. The algorithm needs to track one object across the entire video, thus making sure that this object is unique. This is the task for the tracker algorithms (DeepSort, Sort, Native CV2 Tracking Algorithms)[1].
In general, we talk about certain IDs that are assigned to each object, the Kalman Filter (prediction of future position of the object) or even the Optical Flow (for tracking the moving objects).
Methods and Algorithms Used
Okay, since we understand what is detection and tracking, we can move on to the methodology and some advanced techniques.
Optical Flow
Optical flow is the pattern of apparent motion of image objects between two consecutive frames caused by the movement of an object or camera. It is a 2D vector field where each vector is a displacement vector showing the movement of points from the first frame to the second.

It shows a ball moving in 5 consecutive frames. The arrow shows its displacement vector[2]. Optical flow has many applications in areas like :
- Structure from Motion
- Video Compression
- Video Stabilization
More about Optical Flow in the Making it real chapter.
YOLOv8 model
YOLOv8 is a model based on YOLO (You Only Look Once), by Ultralytics. Generally, this model specializes in:
- Detecting Objects
- Segmentation
- Classifying Objects
The YOLOv8 family of models is widely considered one of the best in the field, offering superior accuracy and faster performance. Its ease of use is attributed to the fact that it consists of five separate models, each catering to different needs, time constraints, and scopes.

In summary, models vary in terms of mean average precision (mAP) and the number of parameters they possess. Additionally, some models can be resource-intensive and exhibit differences in speed. For instance, the X model is considered the most advanced, leading to higher accuracy. However, it may result in slower rendering of videos or images. On the other hand, the Nano model (N) is the fastest option but sacrifices some accuracy [3].
SORT Algorithm
The SORT Algorithm, by Alex Bewley, is a tracking algorithm for 2D multiple object tracking in video sequences. It serves as the foundation for other tracking algorithms like DeepSort. Due to its minimalist nature, it is straightforward to use and implement. Here, you can find more about this algorithm and you can even look into the source code.
Both YOLOv8 and SORT Algorithm are based on CNN (telling you this to move on explaining what the heck is CNN).
Math Behind
CNN’s

So, CNNs or Convolutional Neural Networks are neural networks that are based on convolution layers and pooling layers. As it is written in A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way, “The objective of the Convolution Operation is to extract the high-level features such as edges, from the input image”, so to put it simply, convolution layers are extracting the most important features from the initial input. The pooling layer, on the other part, is the one that simplifies things or “is responsible for reducing the spatial size of the Convolved Feature”. This process enables the machine to understand the features of the initial input. Therefore, we receive a complex feature learning process, where Convolutional and Pooling layers are stacked upon each other[4].
Optical Flow Math
Pixel movement between two consecutive frames is referred to as optical flow. Either the camera is moving or the scene is moving, depending on the motion.
The fundamental goal of optical flow is to calculate the displacement vector of an object as a result of camera motions or the object’s motion. To calculate the motion vectors of all image pixels or a sparse feature collection, our main objective is to determine their displacement.
If we were to use a picture to illustrate the optical flow issue, it would look somewhat like this:

Optical Flow functions by defining a dense vector field and is a key component in several computer vision and machine learning applications, including object tracking, object recognition, movement detection, and robot navigation. Each pixel is given its displacement vector in this field, which aids in determining the direction and speed of each moving object’s pixel in each frame of the input video sequence [5].
Making it Real
Object Tracking with YOLOv8 and SORT
Let’s first of all, understand how to deal with the YOLOv8 model.
pip install ultralytics
# !pip install ultralytics for JUPYTER Notebook
Then:
from ultralytics import YOLO
# Assuming you have opencv installed
import cv2
MODEL = "yolov8x.pt"
# Creating an instance of your chosen model
model = YOLO(MODEL)
results = model("people.jpg",show=True)
# "0" will display the window infinitely until any keypress (in case of videos)
# waitKey(1) will display a frame for 1 ms
cv2.waitKey(0)

Now, since you understood the basics, let’s go to the true object detection and tracking.
import cv2
from ultralytics import YOLO
import math
# CV2 but prettier and easier to use
import cvzone
# Importing all functions from SORT
from sort import *
# cap = cv2.VideoCapture(0) #for webcam
# cap.set(3,1280)
# cap.set(4,720)
cap = cv2.VideoCapture("data/los_angeles.mp4")
model = YOLO("yolos/yolov8n.pt")
The cap
variable will be the instance of the video that we are using and model
, the instance of the YOLOv8 model.
classes = {0: 'person',
1: 'bicycle',
2: 'car',
...
78: 'hair drier',
79: 'toothbrush'}
result_array = [classes[i] for i in range(len(classes))]
Initially, the classes that you get from YOLOv8 API, are float numbers or class id’s. Of course, each number has a name class attached to it. It is simpler to make a dict for this and then, if you need, to transform it into an array (got too lazy to make it manually).
# Line coordinates (explain it later)
l = [593,500,958,500]
while True:
# Reading the content of the video by frames
_, frame = cap.read()
# Every frame goes through the YOLO model
results = model(frame,stream=True)
for r in results:
# Creating bounding boxes
boxes = r.boxes
for box in boxes:
# Extracting coordinates
x1, y1, x2, y2 = box.xyxy[0]
x1, y1, x2, y2 = int(x1),int(y1),int(x2),int(y2)
# Creating instances width and height
w,h = x2-x1,y2-y1
cvzone.cornerRect(frame,(x1,y1,w,h),l=5, rt = 2, colorC=(255,215,0), colorR=(255,99,71))
# Confidence or accuracy of every bounding box
conf = math.ceil((box.conf[0]*100))/100
# Class id (number)
cls = int(box.cls[0])
That was the part where we detected every object. Now it’s time to track and count every car on the road:
while True:
_, frame = cap.read()
results = model(frame,stream=True)
detections = np.empty((0,5)) #making an empty array
for r in results:
boxes = r.boxes
for box in boxes:
''' rest of the code '''
ins = np.array([x1,y1,x2,y2,conf]) #every object should be recorded like this
detections = np.vstack((detections,ins)) # then stacked together in a common array
tracks = tracker.update(detections) #sending our detections to the tracker func
cv2.line(frame, (l[0],l[1]),(l[2],l[3]),color=(255,0,0),thickness=3) #line as a threshold
Now, I will create an array to store all of our detections. Next, I will send this array to the tracker function, where I will extract the unique IDs and bounding box coordinates (which are the same as the previous ones). The important detail is the cv2.line
instance: I am generating a line using specific coordinates. If cars, identified by certain IDs, traverse this line, the OVERALL count will increment. In essence, we are establishing a car counter that operates according to the car ID[6].
for result in tracks:
x1,y1,x2,y2,id = result
x1,y1,x2,y2 = int(x1),int(y1),int(x2),int(y2)
#.putTextRect is for putting a rectangle above bounding box
cvzone.putTextRect(frame,f'{result_array[cls]} {conf} id:{int(id)} ',(max(0,x1),max(35,y1-20)),scale=1, thickness=1, offset=3, colorR=(255,99,71))
#Coordinates for the center of bb
cx,cy = x1+w//2, y1+h//2
if l[0]<cx<l[2] and l[1]-10<cy<l[3]+10:
if totalCount.count(id) == 0:
#Counting every new car that crosses the line
totalCount.append(id)
#Line changes its color when a object crosses it
cv2.line(frame, (l[0],l[1]),(l[2],l[3]),color=(127,255,212),thickness=5)
#Rectangle to display the nr. of counted cars
cvzone.putTextRect(frame,f' Total Count: {len(totalCount)} ',(70,70),scale=2, thickness=1, offset=3, colorR=(255,99,71))
m.write(frame)
cv2.imshow("Image",frame)
cv2.waitKey(1)
Now, onto the most interesting aspect: cx
and cy
are the coordinates for the center of the bounding box. By utilizing these values, we can determine whether the car has crossed the designated line or not (check out the code). If the car has indeed crossed the line, our next step involves verifying whether the ID assigned to this car is unique, thus indicating that the car has not crossed the line previously.