Semantic Segmentation — Popular Architectures
Doing cool things with data!

What is semantic segmentation?
Semantic segmentation is the task of classifying each and very pixel in an image into a class as shown in the image below. Here you can see that all persons are red, the road is purple, the vehicles are blue, street signs are yellow etc.
Semantic segmentation is different from instance segmentation which is that different objects of the same class will have different labels as in person1, person2 and hence different colours. The picture below very crisply illustrates the difference between instance and semantic segmentation.

One important question can be why do we need this granularity of understanding pixel by pixel location?
Some examples that come to mind are:
i) Self Driving Cars — May need to know exactly where another car is on the road or the location of a human crossing the road
ii) Robotic systems — Robots that say join two parts together will perform better if they know the exact locations of the two parts
iii) Damage Detection — It maybe important in this case to know the exact extent of damage
Deep Learning Model Architectures for Semantic Segmentation
Lets now talk about 3 model architectures that do semantic segmentation.
1. Fully Convolutional Network (FCN)
FCN is a popular algorithm for doing semantic segmentation. This model uses various blocks of convolution and max pool layers to first decompress an image to 1/32th of its original size. It then makes a class prediction at this level of granularity. Finally it uses up sampling and deconvolution layers to resize the image to its original dimensions.

These models typically don’t have any fully connected layers. The goal of down sampling steps is to capture semantic/contextual information while the goal of up sampling is to recover spatial information. Also there are no limitations on image size. The final image is the same size as the original image. To fully recover the fine grained spatial information lost in down sampling, skip connections are used. A skip connection is a connection that bypasses at least one layer. Here it is used to pass information from the down sampling step to the up sampling step. Merging features from various resolution levels helps combining context information with spatial information.
I trained a FCN to perform semantic segmentation or road vs non road pixels for a self driving car.
2. U-Net
The U-Net architecture is built upon the Fully Convolutional Network (FCN) and modified in a way that it yields better segmentation in medical imaging.
Compared to FCN-8, the two main differences are:
(1) U-net is symmetric and
(2) the skip connections between the downsampling path and the upsampling path apply a concatenation operator instead of a sum.
These skip connections intend to provide local information to the global information while upsampling. Because of its symmetry, the network has a large number of feature maps in the upsampling path, which allows to transfer information. B
The U-Net owes its name to its symmetric shape, which is different from other FCN variants.
U-Net architecture is separated in 3 parts:
1 : The contracting/downsampling path 2 : Bottleneck 3 : The expanding/upsampling path

I have implemented U-net for smoke segmentation. A major advantage of U-net is that it is much faster to run than FCN or Mask RCNN.
3. Mask RCNN
Lets start with a gentle introduction to Mask RCNN.

Faster RCNN is a very good algorithm that is used for object detection. Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.
Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask — which is a binary mask that indicates the pixels where the object is in the bounding box. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. To do this Mask RCNN uses the Fully Convolution Network (FCN).
So in short we can say that Mask RCNN combines the two networks — Faster RCNN and FCN in one mega architecture. The loss function for the model is the total loss in doing classification, generating bounding box and generating the mask.
Mask RCNN has a couple of additional improvements that make it much more accurate than FCN. You can read more about them in their paper.
I have trained custom Mask RCNN models using Keras Matterport github and Tensorflow object detection. To learn how to build a Mask RCNN yourself, please follow the tutorial at Car Damage Detection Blog.
I have my own deep learning consultancy and love to work on interesting problems. I have helped many startups deploy innovative AI based solutions. Check us out at — http://deeplearninganalytics.org/.
You can also see my other writings at: https://medium.com/@priya.dwivedi
If you have a project that we can collaborate on, then please contact me through my website or at [email protected]






