Summary

The web content discusses semantic segmentation, its applications, and three popular deep learning architectures used for this task: Fully Convolutional Networks (FCN), U-Net, and Mask RCNN.

Abstract

Semantic segmentation is a pivotal task in computer vision that involves classifying each pixel in an image into a specific class, which is crucial for applications like self-driving cars, robotic systems, and damage detection. The article delves into three prominent deep learning model architectures for semantic segmentation: Fully Convolutional Networks (FCN), which use convolution blocks and upsampling to classify pixels; U-Net, an FCN variant optimized for medical imaging with symmetric architecture and skip connections for improved segmentation; and Mask RCNN, which extends Faster RCNN by adding a mask branch for pixel-level object segmentation. Each architecture is detailed with its unique features, such as FCN's use of deconvolution layers, U-Net's concatenation of feature maps for local and global information fusion, and Mask RCNN's combination of object detection and segmentation tasks. The author also provides personal insights into implementing these models and invites collaboration on AI projects through their consultancy.

Opinions

The author emphasizes the importance of semantic segmentation for precise understanding in various applications.
FCN is praised for its ability to recover fine-grained spatial information using skip connections.
U-Net is highlighted for its speed and effectiveness in medical imaging, attributed to its symmetric architecture and feature map concatenation.
Mask RCNN is noted for its accuracy and the seamless integration of object detection and instance segmentation, making it a powerful tool for tasks requiring both functions.
The author expresses a preference for U-Net due to its faster runtime compared to FCN and Mask RCNN.
The author showcases their expertise and hands-on experience by mentioning the implementation of these models in practical scenarios, such as smoke and car damage segmentation.
There is an implicit endorsement of Mask RCNN for tasks that require high accuracy, as it combines the strengths of Faster RCNN and FCN.
The author encourages readers to explore further through provided resources and invites collaboration, indicating a commitment to advancing AI solutions.

Semantic Segmentation — Popular Architectures

Doing cool things with data!

What is semantic segmentation?

Semantic segmentation is the task of classifying each and very pixel in an image into a class as shown in the image below. Here you can see that all persons are red, the road is purple, the vehicles are blue, street signs are yellow etc.

Semantic segmentation is different from instance segmentation which is that different objects of the same class will have different labels as in person1, person2 and hence different colours. The picture below very crisply illustrates the difference between instance and semantic segmentation.

One important question can be why do we need this granularity of understanding pixel by pixel location?

Some examples that come to mind are:

i) Self Driving Cars — May need to know exactly where another car is on the road or the location of a human crossing the road

ii) Robotic systems — Robots that say join two parts together will perform better if they know the exact locations of the two parts

iii) Damage Detection — It maybe important in this case to know the exact extent of damage

Deep Learning Model Architectures for Semantic Segmentation

Lets now talk about 3 model architectures that do semantic segmentation.

1. Fully Convolutional Network (FCN)

FCN is a popular algorithm for doing semantic segmentation. This model uses various blocks of convolution and max pool layers to first decompress an image to 1/32th of its original size. It then makes a class prediction at this level of granularity. Finally it uses up sampling and deconvolution layers to resize the image to its original dimensions.

These models typically don’t have any fully connected layers. The goal of down sampling steps is to capture semantic/contextual information while the goal of up sampling is to recover spatial information. Also there are no limitations on image size. The final image is the same size as the original image. To fully recover the fine grained spatial information lost in down sampling, skip connections are used. A skip connection is a connection that bypasses at least one layer. Here it is used to pass information from the down sampling step to the up sampling step. Merging features from various resolution levels helps combining context information with spatial information.

I trained a FCN to perform semantic segmentation or road vs non road pixels for a self driving car.

2. U-Net

The U-Net architecture is built upon the Fully Convolutional Network (FCN) and modified in a way that it yields better segmentation in medical imaging.

Compared to FCN-8, the two main differences are:

(1) U-net is symmetric and

(2) the skip connections between the downsampling path and the upsampling path apply a concatenation operator instead of a sum.

These skip connections intend to provide local information to the global information while upsampling. Because of its symmetry, the network has a large number of feature maps in the upsampling path, which allows to transfer information. B

The U-Net owes its name to its symmetric shape, which is different from other FCN variants.

U-Net architecture is separated in 3 parts:

1 : The contracting/downsampling path 2 : Bottleneck 3 : The expanding/upsampling path

I have implemented U-net for smoke segmentation. A major advantage of U-net is that it is much faster to run than FCN or Mask RCNN.

3. Mask RCNN

Lets start with a gentle introduction to Mask RCNN.

Faster RCNN is a very good algorithm that is used for object detection. Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.

Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask — which is a binary mask that indicates the pixels where the object is in the bounding box. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. To do this Mask RCNN uses the Fully Convolution Network (FCN).

So in short we can say that Mask RCNN combines the two networks — Faster RCNN and FCN in one mega architecture. The loss function for the model is the total loss in doing classification, generating bounding box and generating the mask.

Mask RCNN has a couple of additional improvements that make it much more accurate than FCN. You can read more about them in their paper.

I have trained custom Mask RCNN models using Keras Matterport github and Tensorflow object detection. To learn how to build a Mask RCNN yourself, please follow the tutorial at Car Damage Detection Blog.

I have my own deep learning consultancy and love to work on interesting problems. I have helped many startups deploy innovative AI based solutions. Check us out at — http://deeplearninganalytics.org/.

You can also see my other writings at: https://medium.com/@priya.dwivedi

If you have a project that we can collaborate on, then please contact me through my website or at [email protected]

References:

More about FCN

More about U-Net