avatarDaryl Tan

Summary

FCOS is a fully convolutional one-stage object detection method that eliminates the need for anchor boxes and provides performance comparable to anchor-based methods.

Abstract

FCOS is a new approach to object detection that eliminates the need for anchor boxes, which are commonly used in methods such as FasterRCNN, RetinaNet, and SSD. This method directly finds objects based on points tiled on the image and has several advantages, including being anchor-free, proposal-free, and computing per-pixel predictions in a fully convolutional manner. FCOS is built on top of FPN, which aggregates multi-level features from the backbone as a pyramid, and outputs are fed through a subnetwork consisting of 3 branches: classification, center-ness, and regression. The article provides a detailed explanation of the forward pass pipeline, detection head, and mapping detection head predictions to location on the image.

Opinions

  • The author suggests that FCOS is a welcome step forward in object detection research, given its simplicity while being on par with anchor-based methods.
  • The author notes that FCOS is able to match performance with anchor-based methods while requiring fewer predictions per image.
  • The author suggests that FCOS is a more efficient method for object detection, as it eliminates the need for cumbersome IoU matching calculation and preset anchors.
  • The author notes that FCOS is able to improve detection and localization accuracy of objects with varying sizes, particularly small objects.
  • The author suggests that FCOS is able to suppress spurious detections that deviate from the center of the object, making it easier to remove overlapping positive boxes during non-maximum suppression.
  • The author notes that FCOS is able to match the performance of anchor-based methods while being simpler and more efficient.
  • The author suggests that FCOS is a promising approach for object detection research and looks forward to seeing more advancements and improvements in this area.

FCOS Walkthrough: The Fully Convolutional Approach to Object Detection

Going anchor-free for object detection

Source: Detection on Cityscapes

Introduction

Since the development of convolutional neural networks, object detection has been dominating by anchor-based methods such as FasterRCNN, RetinaNet and SSD. These methods rely on a large number of preset anchors tiled onto the image. Each anchor predicting if an object is contained and the refinement of the coordinates.

Recently, more attention has been geared towards eliminating the requirements for preset anchors, which requires manual tuning on the scale, aspect ratio and number of anchors. To do that, an effective method, FCOS [1], was proposed which directly find objects based on points tiled on the image.

The main characteristics of FCOS are:

  1. Anchor free: no cumbersome IoU matching calculation and preset anchors.
  2. Proposal free✅: one stage detection.
  3. Computes per-pixel prediction in a fully convolutional manner: the number of detection prediction equals the size of feature maps spatially.

This model was well received as it was able to match performance with anchor-based methods while requiring fewer predictions per image. We will understand how this is done and break down all the components in this article. At the same time conducting some investigation on Cityscapes dataset along the way.

Forward Pass

FCOS is built on top of FPN which aggregate multi-level features from the backbone as a pyramid. Predictions are obtained across 5 feature levels from FPN.

The outputs are then fed through a subnetwork consisting of 3 branches: classification, center-ness and regression.

We will discuss the forward pass pipeline here.

Figure 1. FCOS Architecture

Inputs: An image of size [B, H, W, 3]

Backbone: To be compatible with FPN, multi-scale features are extracted from a CNN encoder. Any existing encoder such as DenseNet or ResNet can be plugged in as the feature extractor.

For ResNet50, we extract the last feature maps from stages 1 to 5.

C1: [B, H/2, W/2, 64]    
C2: [B, H/4, W/4, 256] 
C3: [B, H/8, W/8, 512]  
C4: [B, H/16, W/16, 1024]
C5: [B, H/32, W/32, 2048]

FPN: FPN takes advantage of the scale-invariant properties of feature pyramids. This enables the model to detect objects over a wide range of scales. Deeper layer features encode lower resolution but rich in semantic information while shallow layers contain high resolution but low semantic features. To counterbalance both effects, lateral connections are employed to fuse features between shallow and deeper layers in the pyramid. This enhances the detection and localization accuracy of objects with varying sizes. Small object, in particular, is improved.

Each feature map is scaled by a factor of 2 with output channel = 256. The scale or the ratio of the output feature relative to the input image is usually referred to as the output stride.

# fpn_stride = [8, 16, 32, 64, 128]
P3: [B, H/8, W/8, 256]    
P4: [B, H/16, W/16, 256] 
P5: [B, H/32, W/32, 256]  
P6: [B, H/64, W/64, 256]
P7: [B, H/128, W/128, 256]

Detection Head: Per Pixel Predictions

In the same vein as fully convolutional segmentation CNNs, where each pixel in the output layer corresponds to a confidence score of the semantic scores, FCOS outputs the predictions in the same fashion for all layers across FPN.

Shared Head Branches: The per-pixel prediction is estimated for 3 heads with each branch being a fully convolution network (FCN) having similar architecture given below.

head = [[Conv2d, GroupNormalization, relu],
        [Conv2d, GroupNormalization, relu],
        [Conv2d, GroupNormalization, relu],
        [Conv2d, GroupNormalization, relu]]

Note that the head is shared across all FPN features. i.e. each level in FPN is fed through the same head.

Centerness head: The center-ness describes the deviation between the center of an object from the location. The authors suggested adding this as they observed scores for low-quality boxes with locations further away from the center remains high. As such, they can be suppressed by learning this centerness scale factor.

The center-ness head will output the per feature level normalise distance from the center of the object it is responsible for. The closer the prediction is to a center, the higher the normalized value.

P3_ctrness: sigmoid(head(P3))   # [B, H/8, W/8, 1]  
P4_ctrness: sigmoid(head(P4))   # [B, H/16, W/16, 1]
P5_ctrness: sigmoid(head(P5))   # [B, H/32, W/32, 1] 
P6_ctrness: sigmoid(head(P6))   # [B, H/64, W/64, 1] 
P7_ctrness: sigmoid(head(P7))   # [B, H/128, W/128, 1]

Class predictions head: Predicts the per-pixel class probability weighted by the center-ness score. As mention above, the class probability is obtained by multiplying the class probability with the center-ness score.

P3_class_prob: sigmoid(head(P3)) * p3_ctrness # [B, H/8, W/8, C]  
P4_class_prob: sigmoid(head(P4)) * p4_ctrness # [B, H/16, W/16, C]
P5_class_prob: sigmoid(head(P5)) * p5_ctrness # [B, H/32, W/32, C]
P6_class_prob: sigmoid(head(P6)) * p6_ctrness # [B, H/64, W/64, C]
P7_class_prob: sigmoid(head(P7)) * p7_ctrness # [B, H/128, W/128, C]

Box regression head: Predicts the (l, t, r, b) from the center of the location. See figure 2.

P3_reg: conv2d(head(P3))   # [B, H/8, W/8, 4]  
P4_reg: conv2d(head(P4))   # [B, H/16, W/16, 4]
P5_reg: conv2d(head(P5))   # [B, H/32, W/32, 4] 
P6_reg: conv2d(head(P6))   # [B, H/64, W/64, 4] 
P7_reg: conv2d(head(P7))   # [B, H/128, W/128, 4]

Note that the regression head is trained to predict the scale normalized distances. Therefore, we will have to denormalize to image size during inference reg = reg_pred * stride. The mapping of pixel predictions to the location will be explained in the next section.

Figure 2. Regression points

Mapping Detection Head Prediction to Location on Image

In order to make sense of the predictions, the features are mapped and tiled onto locations (as referred to in the paper) on the image plane

The total number of locations and predictions on the image equals the size of the feature FPN maps:

num locations = [H/8 * W/8] + [H/16 * W/16] + [H/32 * W/32] + [H/64 * W/64] + [H/128 * W/128]

As for the position of the locations, it is positioned near the center of the receptive field of the corresponding location feature map:

# stride per fpn level: [8, 16, 32, 64, 128]
# [featx, featy]: per level pixel coordinate 
position on image = [stride/2 + featx * stride, 
                     stride/2 + featy * stride]

The figure below shows the relative position of all the locations, nicely overlayed on the image. Purple corresponds to the P5 for reference. I suggest you zoom in to see the other locations 🧐

Figure 3. Feature locations on the image. In this example, for an image size of (1024, 2048). Total number of locations = 43, 648

Training Mode

Let's discuss the framework for training in this section. The idea is to encode all the locations shown in figure 3 with ground truth information. For each location, we encode the ground truth labels and regression targets base on the following criteria:

Determine if a location is a positive or negative sample

The initial version of FCOS implemented a simple way to mark a location as a positive sample:

  1. A location is considered positive if it lies within a ground truth box.
  2. Positive samples are further filtered based on scale constraints, depending on the size of the feature maps.

Classification targets

If positive, they are labeled as the class of the box. Else, set as background class. For object boxes that overlap, the author suggested choosing the class with the smallest area as the label (fig 4).

Regression targets

For positive samples, set the regression target l*, t*, r*, b* based on bounding boxes. As mentioned in the paper and to further elaborate point 2, regression targets are ignored if it is beyond a certain value.

if max(l*, t*, r*, b*) > m_i and max(l*, t*, r*, b*) < m_i+1:
   location_label = negative

for m = {0, 64, 128, 256, 512}. This is to constraint the regression limit defined for each pyramid level, reducing overlapping matches across feature levels.

For example, feature level P3 is only responsible for regressing boxes with a maximum length of 64 while P4 regress between 64 to 128.

Centerness targets

The center-ness depicts the normalized distance from the location to the center of the object that the location is responsible for. Given the regression targets (l*, t*, r*, b*) for a location, the center-ness target is defined as,

centerness* = sqrt(min(l*, r*) / max(l*, r*) × 
                   min(t*, b*) / max(t*, b*))

sqrt slow down the decay of the center-ness. The center-ness ranges from 0 to 1 and is thus trained with binary cross-entropy loss.

Figure 4. (Left) Locations within boxes are annotated as positive, color-coded by class. For overlapping regions, e.g. the intersection between the motorcycle, rider and car is labeled as car as it has the smallest area. (Right) Filtered by limiting regression targets.

Scaled normalized distance: The model is trained on normalized regression target according to the stride of the pyramid features. This stabilized training in the presence of regression outliers to some extent.

Other tricks

Several other tricks were implemented to improve detection performance. Center sampling was proposed to only consider locations close to the center of the object as positive, which greatly improves the mAP score. Another technique was to merge the center-ness layer with the regression head.

Further Analysis of Tiling Mechanism

As explained above, location centers are positioned roughly stride/2 pixels apart in each feature level. This works as the location’s receptive field would have covered enough area without overlapping too much with another location’s region. Recall that the receptive field of CNN increases deeper into the network as information is propagated. Region increases from P3 to P7 in FPN.

Example: Considering figure 5, suppose we have an image of size [40, 80] and considering feature maps from P3, P4 , the location center can be computed using the formulation above. See figure as well.

We observe that P4 feature at P4[2, 1] covers a larger region as shown. Thus, responsible for detecting larger objects as compared to P3. In this example, since the pixel feature is outside the 🚘 bounding box, it is marked as a negative sample even though it has clearly encoded information about the 🚘 . But more of the background as well.

For the pixel feature p3[2, 5], it is within the object’s box. However, its receptive field does not cover the entire 🚘. Its class probability score might be lower than another location

Figure 5. Mapping of feature pixel to location

In addition, the use of the regression limit would prevent low-level feature locations from regressing very large object in the image even though it is well contained in the object, further reducing overlapping at separate levels.

Computing the Loss

The total loss is given by:

Loss = loss_ctrness + loss_class + loss_reg

Centerness loss: Binary cross-entropy for sigmoid outputs.

Class loss: Focal loss remains the defacto choice for object detection which has proven itself to tackle the problem of imbalanced classes.

Regression Loss IOU/GIOU: FCOS steered away from the stand L2 norm loss as it is argued that L2 loss does not have a strong correlation between minimizing the loss and improving the IOU between the GT box and predicted box. A good local optimum for l2 objective may not necessarily be a local optimum for IoU. The problem is exemplified in the figure below. Hence IOU loss was opted, which directly measure the intersection over union between prediction and target.

One drawback of IOU loss was if the 2 objects do not overlap, IoU value will be zero, its gradient will be zero and the model does not learn. Therefore, Generalized IOU loss [2] was introduced to circumvent this flaw.

So even if the IoU is 0, a loss is still induced and its magnitude depends on the closeness of the boxes (formulation shown below). The additional penalty term will push the predicted boxes towards the target boxes for non-overlapping cases:

when IoU = 0,
gIoU loss = 1 - (IoU - (Area_C - (Area_pred + Area_gt) / Area_C))
          = 1 - (- Area_C - (Area_pred + Area_gt) / Area_C)
Figure 6. Non-overlapping regions

Inference Mode

At inference, pixel-wise predictions from the heads are converted to bounding boxes class and dimension as follows:

  1. Determine top k positive samples/locations based on predefined threshold against the class probability score
  2. Obtain bounding box corners (x1, y1, x2, y2) from regression prediction
# location: [num_loc, (x, y)]
# reg: [num_loc, (l, t, r, b)]
detections = stack([locations[:, 0] - reg[:, 0],
                    locations[:, 1] - reg[:, 1],
                    locations[:, 0] + reg[:, 2],
                    locations[:, 1] + reg[:, 3]],
                    axis=1)

3. Apply non-maximum suppression to remove overlapping positive boxes

Effectiveness of center-ness layer

Using a trained model, figure 7 shows the class probability confidence scores, magnitude display as brightness intensity of the point with and without weighting scores by center-ness layer.

It is observed that locations that are still within the object but deviate from the center have confidence scores reduced. These spurious detections make it easy for removal during non-maximum suppression.

Figure 7. (Left) Probability score without center-ness. (Right) Score suppress with center-ness

Conclusion

FCOS definitely is a welcome step forward in object detection research, given its simplicity while being on par with anchor-based methods. I am excited to see more advancement and improvement in this area.

Reference

[1]Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: fully convolutional one-stage object detection. In ICCV, 2019.

[2] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019

Deep Learning
Machine Learning
Computer Vision
Python
Recommended from ReadMedium