Summary

This article discusses the use of a variant of focal loss, a distance-aware cross entropy loss, for semantic segmentation problems with sparse labels, focusing on its advantages over standard cross entropy loss and mean square error (MSE) loss.

Abstract

The article begins by introducing focal loss as a more focused version of cross entropy loss, which is commonly used in semantic segmentation problems. It then highlights an issue with standard cross entropy loss when used in tasks with sparse labels, where background pixels are punished equally regardless of their distance to foreground pixels. To address this, the author proposes a distance-aware cross entropy loss, which is a modified version of focal loss that takes into account the distance to foreground labels. This new loss function is compared to MSE loss, which is used in tasks with highly imbalanced labels but loses semantic information. The distance-aware cross entropy loss is shown to be more effective in preserving semantic information while learning the positions of foreground labels.

Opinions

Standard cross entropy loss has an issue when used in tasks with sparse labels, where background pixels are punished equally regardless of their distance to foreground pixels.
MSE loss is used in tasks with highly imbalanced labels but loses semantic information.
The proposed distance-aware cross entropy loss is a modified version of focal loss that takes into account the distance to foreground labels.
The distance-aware cross entropy loss is more effective in preserving semantic information while learning the positions of foreground labels.
The article recommends using the distance-aware cross entropy loss for semantic segmentation problems with sparse labels.
The article also recommends reading the author's previous post for more details on focal loss.
The article concludes by providing references to related works.

Understanding Focal Loss for Pixel-level Classification in Convolutional Neural Networks

A smooth version of cross entropy loss for quick convergence

As I wrote in the last article of this series, focal loss is a more focused cross entropy loss. In semantic segmentation problems, focal loss can help the model focus on pixels that have not been well trained yet, which is more effective and purposeful than cross entropy loss. I recommend the article if you haven ‘t read it yet.

Demystifying Focal Loss I: A More Focused Version of Cross Entropy Loss

In computer vision, cross entropy is a widely used loss item in classification problems. In semantic segmentation, we…

medium.com

Again in this article, I ‘d like to talk about a variant of focal loss that can be used as a distance-aware cross entropy loss in semantic segmentation problems, especially with sparse labels.

The issue of standard cross entropy loss

Cross entropy loss is typically used in semantic segmentation, in which each pixel of the image is labeled with a category number, and a model is trained to predict that number for each pixel.

This per-pixel prediction fashion is widely used in almost all academic papers about semantic segmentation. However, there exists an issue, which may cause bad influence on convergence and is seldomly discussed. The sparser the labels are, the more significant the issue will be.

Let ‘s take a binary classification scenario as an example. Extension to multi-class classification is straightforward. Look at the following Figure 1, in which a blue pixel is labeled with 1, and all other pixels are labeled with 0. This labeled image is extremely sparse and foreground-background imbalanced because only one pixel is labeled as foreground class.

Figure 1: a labeled image for binary classification

Now, imaging Figure 1 is input to our model, and we get an output image in which the label of every pixel is predicted. As conceptually shown in Figure 2, among all the resulting pixels, we take two of them for discussion. The two pixels are colored in green, and the labeled blue pixel is also preserved here for ease of explanation.

Figure 2: a prediction result for Figure 1

Suppose the two green pixels are predicted with the same probability 0.9, which means that the probabilities of their labels to be 1 is 0.9, and the probabilities of their labels to be 0 is 0.1. As their real labels are 0, the two green pixels will be punished heavily during this round of training because the probability for the real label is merely 0.1, which is too small.

Even though their distances to the foreground blue pixel are totally different, the two green pixels will be punished equally if standard binary cross entropy loss is adopted. This is not problematic if both the labels are distributed evenly. However, since the foreground-background labels are highly imbalanced in this example, we hope that the model can be trained to focus on and converge the prediction results to the foreground pixel. More specifically, background pixels within a small radius of the foreground pixel are allowed to be predicted to 1, with the allowance extent varying in a monotonic fashion according to their distances to the foreground pixel. However, this cannot be implemented with standard cross entropy loss, in which each pixel is trained individually, independently from other pixels. This is just the issue of standard cross entropy loss when used in tasks with sparse labels.

A distance-aware cross entropy loss

Since the issues with standard cross entropy loss discussed above, tasks with highly imbalanced labels such as facial point detection [Sun 2013], human pose detection [Newell 2016] etc. adopted mean square error (MSE) loss for training. The label images are shown in Figure 3.

Figure 3: facial point detection (above), and human pose detection (below)

MSE works by minimizing the difference between real and predicted key-point positions in a pixel level regression framework. The MSE loss function is shown in Figure 4.

The issue of this approach is that the predictions only contain the positions of pixels, while their semantic information is lost. If someone wants to use the predicted key points for some post process, she has to work out a pattern to deal with the intrinsic relationship among the points in advance, which might be complicated.

Therefore, in such tasks with highly imbalanced labels, can we train a model to learn the positions of foreground labels effectively while preserving their semantic information?

As discussed above, the answer is Yes! We can modify the standard cross entropy loss in two steps to get such a new loss function. Firstly, modify it to a focal loss [Lin 2017]. Secondly, modify the focal loss to be aware of distance to foreground labels. I call it distance-aware cross entropy loss [Law 2018].

In the first step, standard cross entropy loss is modified to focal loss as shown in Figure 5. In the equation, pt is a measurement of prediction accuracy. The higher the pt is, the more accurate the prediction is. Since pt is between 0 and 1, the factor 1-pt can be used to decrease the standard cross entropy loss if the accuracy is already good enough, thus making the model focus on areas that have not been well trained yet. I recommend reading my last post for more details.

In the second step, as discussed above, background pixels within a small radius of the foreground pixel are allowed to be predicted to 1, with the allowance extent varying in a monotonic fashion according to their distances to the foreground pixel. Therefore, for the background pixels with label 0, a distance factor is added to decrease their loss values according to their distances to the nearest foreground pixel with label 1.

Figure 6: distance aware cross entropy loss [Law 2018]

In order to get that loss function, we split the focal loss into two equations according to different label values (0 and 1). Then a distance factor ycij is added as shown in Figure 6. Here, the distance factor ycij is a Gaussian function centered on a foreground label position in the 2D image space as shown in Figure 7. The function value at the center is 1 and decreases monotonically as getting far from center. Thus, the focal loss doesn ‘t change for foreground pixels (upper equation in Figure 6). For background pixels, under the influence of the factor 1-ycij, the focal loss decreases largely as getting close to foreground pixels, while keeps almost unchanged as getting far from them (lower equation in Figure 6).

As a consequence, with the help from the distance factor, focal loss can be modified to focus on and converge the prediction results to the foreground pixels for tasks with highly imbalanced foreground-background labels. Focal loss refines the traditional pixel-wise discrete cross entropy loss to a continuous smooth loss as shown in Fig. 7, which is faster to converge.

References

Deep Convolutional Network Cascade for Facial Point Detection, Sun et al., CVPR 2013

Stacked Hourglass Networks for Human Pose Estimation, Newell et al., ECCV 2016

Focal Loss for Dense Object Detection, Lin et al., ICCV 2017

CornerNet: Detecting Objects as Paired Keypoints, Law et al., ECCV 2018

Join Medium with my referral link - Shuchen Du

Read every story from Shuchen Du (and thousands of other writers on Medium). Your membership fee directly supports…

dushuchen.medium.com