Monocular Bird’s-Eye-View Semantic Segmentation for Autonomous Driving
A review of BEV semantic segmentation as of 2020
Updates:
- Add BEV feat stitching, 2021/01/31
- Add PYVA, 2021/10/01
- Add Panoptic BEV, 2021/10/04
- TODO: Add BEV-Seg, CaDDN, FIERY, HDMapNet.
I have also written an updated blog post regarding BEV object detection, especially with Transformers.
Autonomous driving requires an accurate representation of the environment around the ego vehicle. The environment includes static elements such as road layout and lane structures, and also dynamic elements such as other cars, pedestrians, and other types of road users. The static elements can be captured by an HD map containing lane level information.
There are two types of mapping methods, offline and online. For offline mapping and the application of deep learning in offline mapping, please refer to my previous post. In places where there is no map support or the autonomous vehicle has never been to, the online mapping would be useful. For online mapping, one conventional method is SLAM (simultaneous localization and mapping) which relies on the detection and matching of geometric features on a sequence of images, or with a twist of the added notion of object.
This post will focus on another way to do online mapping — bird’s-eye-view (BEV) semantic segmentation. Compared with SLAM which requires a sequence of images from the same moving camera over time, BEV semantic segmentation is based on images captured by multiple cameras looking at different directions of the vehicle at the same time. It is, therefore, able to generate more useful information from the one-shot collection of data than SLAM. In addition, when the ego car is stationary or slowly moving, BEV semantic segmentation would still work, while SLAM will perform poorly or fail.

Why BEV semantic maps?
In a typical autonomous driving stack, Behavior Prediction and Planning are generally done in this a top-down view (or bird’s-eye-view, BEV), as hight information is less important and most of the information an autonomous vehicle would need can be conveniently represented with BEV. This BEV space can be loosely referred to as the 3D space. (For example, object detection in BEV space is typically referred to as 3D localization, to differ from full-blown 3D object detection.)
It is therefore standard practice to rasterize HD maps into a BEV image and combine with dynamic object detection in behavior prediction planning. Recent research exploring this strategy includes IntentNet (Uber ATG, 2018), ChauffeurNet (Waymo, 2019), Rules of the Road (Zoox, 2019), Lyft Prediction Dataset (Lyft, 2020), among many others.

Traditional computer vision tasks such as object detection and semantic segmentation involve making estimations in the same coordinate frame as the input image. As a consequence, the Perception stack of autonomous driving typically happens in the same space as the onboard camera image — the perspective view space.

The gap between the representation used in perception and downstream tasks such as prediction and planning are typically bridged in the Sensor Fusion stack, which lifts the 2D observation in perspective space to 3D or BEV, usually with the help of active sensors such as radar or lidar. That said, it is beneficial for perception across modalities to use BEV representation. First of all, it is interpretable and facilitates debugging about inherent failure modes for each sensing modality. It is also easily extensible to other new modalities and simplifies the task of late fusion. In addition, as mentioned above, the perception results in this representation can be readily consumed by prediction and planning stack.
Lifting Perspective RGB images to BEV
The data from active sensors such as radar or lidar lend themselves to the BEV representation as the measurement are inherently metric in 3D. However, due to the ubiquitous presence and low cost of the surround-view camera sensors, the generation of BEV images with semantic meaning has attracted a lot of attention recently.
In the title of this post, “monocular” refers to the fact that the input of the pipeline are images obtained from monocular RGB cameras, without explicit depth information. Monocular RGB images captured onboard autonomous vehicles are perspective projections of the 3D space, and the inverse problem of lifting 2D perspective observations into 3D is an inherently ill-posed problem.
Challenges, IPM and Beyond
One obvious challenge for BEV semantic segmentation is the view transformation. In order to properly restore the BEV representation of the 3D space, the algorithm has to leverage both hard (but potentially noisy) geometric priors such as the camera intrinsics and extrinsics, and also soft priors such as the knowledge corpus of road layout, and common sense (cars do not overlap in BEV, etc). Conventionally, inverse perspective mapping (IPM) has been the go-to method for this task, assuming a flat ground assumption and a fixed camera extrinsics. But this task does not work well for non-flat surface or on a bumpy road when camera extrinsics vary.

The other challenge lies in the collection of data and annotation for such a task. One way to do this is to have a drone following the autonomous vehicle at all times (similar to MobileEye’s CES 2020 talk), and then ask human annotation of semantic segmentation. This method is obviously not practical and scalable. Many studies have relied on synthetic data or unpaired map data for training the lifting algorithm.
In the following sessions, I will review recent advances in the field and highlight the commonalities. These studies can be largely grouped into two types depending on the supervision signal used. The first type of study resorts to simulation for indirect supervision and the second type directly leverages the recently released multi-modal datasets for direct supervision.
Simulation and Semantic Segmentation
The seminal studies in this field use simulation to generate the necessary data and annotation to lift perspective images into BEV. To bridge the simulation-to-reality (sim2real) domain gap, many of them use semantic segmentation as an intermediate representation.
VPN (View Parser Network, RAL 2020)
VPN (Cross-view Semantic Segmentation for Sensing Surroundings) is among the first works to explore BEV semantic segmentation and refers to it as “cross-view semantic segmentation”. The View Parsing Network (VPN) uses a view transformer module to model the transformation from perspective to BEV. This module is implemented as a multilayer perceptron (MLP) that stretches the 2D physical extent into a 1D vector and then perform a fully connected operation on it. In other words, it ignores strong geometric priors but purely adopts a data-driven approach to learn the perspective-to-BEV warping. This warping is camera-specific and one network has to be learned per camera.

VPN uses synthetic data (generated with CARLA) and adversarial loss for domain adaptation during training. In addition, it uses a semantic mask as an intermediate representation without the photorealistic texture gap.
The input and output of the view transformer module are of the same size. The paper mentioned that this makes it easily plugged-in to other architectures. It is actually quite not necessary as I see it, as the perspective view and BEV are inherently different spaces and therefore no need to enforce the same pixel format nor even the aspect ratio between the input and output. Code is available on github.
Fishing Net (CVPR 2020)
Fishing Net convert lidar, radar, and camera fusion in a single unified representation in BEV space. This representation makes it much easier to perform late fusion across different modalities. The view transformation module (the purple block in the vision path) is similar to the MLP-based VPN. The input to the view transformation network is a sequence of images, but they are just concatenated across the channel dimension and fed into the network, instead of leveraging an RNN structure.

The groundtruth generation is with 3D annotation in lidar, and it mainly focuses on dynamic objects such as vehicles and VRU (vulnerable road users, such as pedestrians and cyclists). All the rest are represented by a background class.
The BEV semantic grid has a resolution of 10 cm and 20 cm/pixel. This is much coarser than the typical value of 4 or 5 cm/pixel used in the offline mapping. Following the convention of VPN, the dimensions of the images match the output resolution of 192 x 320. Talk at CVPR 2020 can be found on Youtube.
VED (ICRA 2019)
VED (Monocular Semantic Occupancy Grid Mapping with Convolutional Variational Encoder-Decoder Networks) exploits a variational encoder-decoder (VED) architecture for semantic occupancy grid map prediction. It encodes the front-view visual information for the driving scene and subsequently decodes it into a BEV semantic occupancy grid.

This groundtruth is generated using a disparity map from stereo matching in the CityScape dataset. This process may be noisy, and this actually prompted the use of VED and the sampling from the latent space to make the model robust to imperfect GT. However, by virtue of being a VAE, it often does not produces sharp edges, perhaps due to the Gaussian prior and mean-squared error.
The input image and output are 256×512 and 64×64. VED leveraged the architecture of a vanilla SegNet (a relatively strong baseline for conventional semantic segmentation) and introduced one 1x2pooling layer in order to accommodate the different aspect ratio of input and output.
Learning to Look around Objects (ECCV 2018)
Learning to Look around Objects for Top-View Representations of Outdoor Scenes hallucinates occluded areas in BEV, and leverages simulation and map data to help.
Personally, I think this is quite a seminal paper in the field of BEV semantic segmentation, but it does not seem to have received much attention. Maybe it needs a catchy name?

The view transformation is done via pixel-wise depth prediction and project to BEV. This partially overcomes the issue of lack of training data in BEV space. This is also done in later work as in Lift, Splat, Shoot (ECCV 2020) reviewed below.
The trick used by the paper to learn to hallucinate (predict occluded portions) is quite amazing. For dynamic objects whose GT depth is hard to find, we filter out loss. Randomly masking out blocks of images and ask the model to hallucinate. Use the loss as a supervision signal.

As it is hard to obtain explicitly paired supervision in BEV space, the paper used the adversarial loss to guide the learning with simulation and OpenStreetMap data to ensure that the generated road layout looks like realistic road layouts. This trick is also used in later work as in MonoLayout (WACV 2020).
It employs one CNN in image space for depth and semantics prediction, lifts the predictions to 3D space and renders in BEV, and finally uses another CNN in BEV space for refinement. This refinement module in BEV is also used in many other works such as Cam2BEV (ITSC 2020) and Lift, Splat, Shoot (ECCV 2020).
Cam2BEV (ITSC 2020)

Cam2BEV (A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird’s Eye View) uses a spatial transformer module with IPM to transform perspective features to BEV space. The neural network architecture takes in four images captured by different cameras, and for each of them apply IPM transformation before concatenating them together.

Cam2BEV uses synthetic data generated from VTD (virtual test drive) simulation environment. It takes in four semantic segmentation image and focus on the lifting process and avoided dealing with the sim2real domain gap.
Cam2BEV has a rather focused scope and many design choices which makes it highly practical. First of all, it only works in semantic space, and thus avoided the question of sim2real domain gap. It has a preprocessing stage to deliberately mask out occluded regions to avoid reasoning about occlusion and arguably make the problem more tractable. To ease the lifting process, it also takes as input a “homography image”, generated by IPM of semantic segmentation results and concatenated into a 360 deg BEV image. Thus the main goal of Cam2BEV is to reason the physical extent of the 3D objects in the BEV, which may be elongated in the homography image.

Cam2BEV targets to correct IPM but in the sense that IPM distort 3D objects such as cars that are not on the road surface. Yet it still cannot handle non-flat road surface or pitch changes during the drive. Both the input and output of Cam2BEV is 256x512 pixels. Code is available in github. It also provides a nice baseline implementation of IPM.
All you need is (multimodal) datasets
The recent release of many multi-modality datasets (Lyft, Nuscenes, Argoverse, etc) makes direct supervision of the monocular BEV semantic segmentation task possible. These datasets provide not only 3D object detection information but also an HD map along with localization information to pinpoint ego vehicle at each timestamp on the HD map.
The BEV segmentation task has two parts, the (dynamic) object segmentation task, and the (static) road layout segmentation task. For object segmentation, 3D bounding boxes are rasterized into the BEV image to generate annotation. For static road layouts, maps are transformed into the ego vehicle frame based on provided localization results and rasterized into BEV annotation.
MonoLayout (WACV 2020)
MonoLayout: Amodal scene layout from a single image focuses on the lifting of a single camera into a semantic BEV space. The focus of the paper is on amodal completion which reasons for the occluded area. It seems to be heavily influenced by Learning to Look around Objects (ECCV 2018).

The view transformation is performed via an encoder-decoder structure and the latent feature is called “shared context”. Two decoders are used to decode the static and dynamic class separately. The authors also reported negative results of using a combined decoder to handle both static and dynamic objects in the ablation study.

Though HD Map groundtruth is available in Argoverse dataset, MonoLayout chooses to use it only for evaluation but not for training (hindsight or deliberate design choice?). For training, MonoLayout uses a temporal sensor fusion process to generated weak groundtruth by aggregating 2D semantic segmentation results throughout a video with localization information. It uses monodepth2 to lift RGB pixels to point cloud. It also discards anything 5 m away from the ego car as they could be noisy. To encourage the network to output conceivable scene layout, MonoLayout used adversarial feature learning (similar to that used in Learning to Look around Objects). The prior data distribution is obtained from OpenStreetMap.
MonoLayout has a spatial resolution of 30 cm/pixel, and thus the 128 x 128 output corresponds to 40 m x 40 m in BEV space. Code is available in github.















