Monocular Bird’s-Eye-View Semantic Segmentation for Autonomous Driving

A review of BEV semantic segmentation as of 2020

Updates:

Add BEV feat stitching, 2021/01/31
Add PYVA, 2021/10/01
Add Panoptic BEV, 2021/10/04
TODO: Add BEV-Seg, CaDDN, FIERY, HDMapNet.

I have also written an updated blog post regarding BEV object detection, especially with Transformers.

Monocular BEV Perception with Transformers in Autonomous Driving

A review of academic literature and industry practice as of late 2021

towardsdatascience.com

Autonomous driving requires an accurate representation of the environment around the ego vehicle. The environment includes static elements such as road layout and lane structures, and also dynamic elements such as other cars, pedestrians, and other types of road users. The static elements can be captured by an HD map containing lane level information.

There are two types of mapping methods, offline and online. For offline mapping and the application of deep learning in offline mapping, please refer to my previous post. In places where there is no map support or the autonomous vehicle has never been to, the online mapping would be useful. For online mapping, one conventional method is SLAM (simultaneous localization and mapping) which relies on the detection and matching of geometric features on a sequence of images, or with a twist of the added notion of object.

Deep Learning in Mapping for Autonomous Driving

A literature review as of 2020

towardsdatascience.com

This post will focus on another way to do online mapping — bird’s-eye-view (BEV) semantic segmentation. Compared with SLAM which requires a sequence of images from the same moving camera over time, BEV semantic segmentation is based on images captured by multiple cameras looking at different directions of the vehicle at the same time. It is, therefore, able to generate more useful information from the one-shot collection of data than SLAM. In addition, when the ego car is stationary or slowly moving, BEV semantic segmentation would still work, while SLAM will perform poorly or fail.

The difference in input to BEV semantic segmentation vs SLAM (*Image by the author of this post*)

Why BEV semantic maps?

In a typical autonomous driving stack, Behavior Prediction and Planning are generally done in this a top-down view (or bird’s-eye-view, BEV), as hight information is less important and most of the information an autonomous vehicle would need can be conveniently represented with BEV. This BEV space can be loosely referred to as the 3D space. (For example, object detection in BEV space is typically referred to as 3D localization, to differ from full-blown 3D object detection.)

It is therefore standard practice to rasterize HD maps into a BEV image and combine with dynamic object detection in behavior prediction planning. Recent research exploring this strategy includes IntentNet (Uber ATG, 2018), ChauffeurNet (Waymo, 2019), Rules of the Road (Zoox, 2019), Lyft Prediction Dataset (Lyft, 2020), among many others.

Recent research rendering HD Maps as a BEV semantic map (*Image compiled by the author of this post,* sources from publications in reference)

Traditional computer vision tasks such as object detection and semantic segmentation involve making estimations in the same coordinate frame as the input image. As a consequence, the Perception stack of autonomous driving typically happens in the same space as the onboard camera image — the perspective view space.

Perception happens in image space (left: SegNet) while Planning happens in BEV space (right: NMP) (source)

The gap between the representation used in perception and downstream tasks such as prediction and planning are typically bridged in the Sensor Fusion stack, which lifts the 2D observation in perspective space to 3D or BEV, usually with the help of active sensors such as radar or lidar. That said, it is beneficial for perception across modalities to use BEV representation. First of all, it is interpretable and facilitates debugging about inherent failure modes for each sensing modality. It is also easily extensible to other new modalities and simplifies the task of late fusion. In addition, as mentioned above, the perception results in this representation can be readily consumed by prediction and planning stack.

Lifting Perspective RGB images to BEV

The data from active sensors such as radar or lidar lend themselves to the BEV representation as the measurement are inherently metric in 3D. However, due to the ubiquitous presence and low cost of the surround-view camera sensors, the generation of BEV images with semantic meaning has attracted a lot of attention recently.

In the title of this post, “monocular” refers to the fact that the input of the pipeline are images obtained from monocular RGB cameras, without explicit depth information. Monocular RGB images captured onboard autonomous vehicles are perspective projections of the 3D space, and the inverse problem of lifting 2D perspective observations into 3D is an inherently ill-posed problem.

Challenges, IPM and Beyond

One obvious challenge for BEV semantic segmentation is the view transformation. In order to properly restore the BEV representation of the 3D space, the algorithm has to leverage both hard (but potentially noisy) geometric priors such as the camera intrinsics and extrinsics, and also soft priors such as the knowledge corpus of road layout, and common sense (cars do not overlap in BEV, etc). Conventionally, inverse perspective mapping (IPM) has been the go-to method for this task, assuming a flat ground assumption and a fixed camera extrinsics. But this task does not work well for non-flat surface or on a bumpy road when camera extrinsics vary.

IPM generates artifacts around 3D objects such as cars (source)

The other challenge lies in the collection of data and annotation for such a task. One way to do this is to have a drone following the autonomous vehicle at all times (similar to MobileEye’s CES 2020 talk), and then ask human annotation of semantic segmentation. This method is obviously not practical and scalable. Many studies have relied on synthetic data or unpaired map data for training the lifting algorithm.

In the following sessions, I will review recent advances in the field and highlight the commonalities. These studies can be largely grouped into two types depending on the supervision signal used. The first type of study resorts to simulation for indirect supervision and the second type directly leverages the recently released multi-modal datasets for direct supervision.

Simulation and Semantic Segmentation

The seminal studies in this field use simulation to generate the necessary data and annotation to lift perspective images into BEV. To bridge the simulation-to-reality (sim2real) domain gap, many of them use semantic segmentation as an intermediate representation.

VPN (View Parser Network, RAL 2020)

VPN (Cross-view Semantic Segmentation for Sensing Surroundings) is among the first works to explore BEV semantic segmentation and refers to it as “cross-view semantic segmentation”. The View Parsing Network (VPN) uses a view transformer module to model the transformation from perspective to BEV. This module is implemented as a multilayer perceptron (MLP) that stretches the 2D physical extent into a 1D vector and then perform a fully connected operation on it. In other words, it ignores strong geometric priors but purely adopts a data-driven approach to learn the perspective-to-BEV warping. This warping is camera-specific and one network has to be learned per camera.

The overall pipeline of View Parse Network (source)

VPN uses synthetic data (generated with CARLA) and adversarial loss for domain adaptation during training. In addition, it uses a semantic mask as an intermediate representation without the photorealistic texture gap.

The input and output of the view transformer module are of the same size. The paper mentioned that this makes it easily plugged-in to other architectures. It is actually quite not necessary as I see it, as the perspective view and BEV are inherently different spaces and therefore no need to enforce the same pixel format nor even the aspect ratio between the input and output. Code is available on github.

Fishing Net (CVPR 2020)

Fishing Net convert lidar, radar, and camera fusion in a single unified representation in BEV space. This representation makes it much easier to perform late fusion across different modalities. The view transformation module (the purple block in the vision path) is similar to the MLP-based VPN. The input to the view transformation network is a sequence of images, but they are just concatenated across the channel dimension and fed into the network, instead of leveraging an RNN structure.

Fishing Net uses BEV representation to facilitate late fusion across sensor modalities (source)

The groundtruth generation is with 3D annotation in lidar, and it mainly focuses on dynamic objects such as vehicles and VRU (vulnerable road users, such as pedestrians and cyclists). All the rest are represented by a background class.

The BEV semantic grid has a resolution of 10 cm and 20 cm/pixel. This is much coarser than the typical value of 4 or 5 cm/pixel used in the offline mapping. Following the convention of VPN, the dimensions of the images match the output resolution of 192 x 320. Talk at CVPR 2020 can be found on Youtube.

VED (ICRA 2019)

VED (Monocular Semantic Occupancy Grid Mapping with Convolutional Variational Encoder-Decoder Networks) exploits a variational encoder-decoder (VED) architecture for semantic occupancy grid map prediction. It encodes the front-view visual information for the driving scene and subsequently decodes it into a BEV semantic occupancy grid.

Variational Encoder-Decoder network for view transformation in VED (source)

This groundtruth is generated using a disparity map from stereo matching in the CityScape dataset. This process may be noisy, and this actually prompted the use of VED and the sampling from the latent space to make the model robust to imperfect GT. However, by virtue of being a VAE, it often does not produces sharp edges, perhaps due to the Gaussian prior and mean-squared error.

The input image and output are 256×512 and 64×64. VED leveraged the architecture of a vanilla SegNet (a relatively strong baseline for conventional semantic segmentation) and introduced one 1x2pooling layer in order to accommodate the different aspect ratio of input and output.

Learning to Look around Objects (ECCV 2018)

Learning to Look around Objects for Top-View Representations of Outdoor Scenes hallucinates occluded areas in BEV, and leverages simulation and map data to help.

Personally, I think this is quite a seminal paper in the field of BEV semantic segmentation, but it does not seem to have received much attention. Maybe it needs a catchy name?

Overall pipeline for Learning to Look around Objects (source)

The view transformation is done via pixel-wise depth prediction and project to BEV. This partially overcomes the issue of lack of training data in BEV space. This is also done in later work as in Lift, Splat, Shoot (ECCV 2020) reviewed below.

The trick used by the paper to learn to hallucinate (predict occluded portions) is quite amazing. For dynamic objects whose GT depth is hard to find, we filter out loss. Randomly masking out blocks of images and ask the model to hallucinate. Use the loss as a supervision signal.

Ignore dynamic regions, but explicitly drop-out static regions to directly supervise hallucination (source)

As it is hard to obtain explicitly paired supervision in BEV space, the paper used the adversarial loss to guide the learning with simulation and OpenStreetMap data to ensure that the generated road layout looks like realistic road layouts. This trick is also used in later work as in MonoLayout (WACV 2020).

It employs one CNN in image space for depth and semantics prediction, lifts the predictions to 3D space and renders in BEV, and finally uses another CNN in BEV space for refinement. This refinement module in BEV is also used in many other works such as Cam2BEV (ITSC 2020) and Lift, Splat, Shoot (ECCV 2020).

Cam2BEV (ITSC 2020)

Input semantic segmentation images and predicted BEV map in Cam2BEV (source)

Cam2BEV (A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird’s Eye View) uses a spatial transformer module with IPM to transform perspective features to BEV space. The neural network architecture takes in four images captured by different cameras, and for each of them apply IPM transformation before concatenating them together.

Cam2BEV uses deterministic IPM to transform intermediate feature maps (source)

Cam2BEV uses synthetic data generated from VTD (virtual test drive) simulation environment. It takes in four semantic segmentation image and focus on the lifting process and avoided dealing with the sim2real domain gap.

Cam2BEV has a rather focused scope and many design choices which makes it highly practical. First of all, it only works in semantic space, and thus avoided the question of sim2real domain gap. It has a preprocessing stage to deliberately mask out occluded regions to avoid reasoning about occlusion and arguably make the problem more tractable. To ease the lifting process, it also takes as input a “homography image”, generated by IPM of semantic segmentation results and concatenated into a 360 deg BEV image. Thus the main goal of Cam2BEV is to reason the physical extent of the 3D objects in the BEV, which may be elongated in the homography image.

Preprocessed Homography image and the target GT for the neural network to predict in Cam2BEV (source)

Cam2BEV targets to correct IPM but in the sense that IPM distort 3D objects such as cars that are not on the road surface. Yet it still cannot handle non-flat road surface or pitch changes during the drive. Both the input and output of Cam2BEV is 256x512 pixels. Code is available in github. It also provides a nice baseline implementation of IPM.

All you need is (multimodal) datasets

The recent release of many multi-modality datasets (Lyft, Nuscenes, Argoverse, etc) makes direct supervision of the monocular BEV semantic segmentation task possible. These datasets provide not only 3D object detection information but also an HD map along with localization information to pinpoint ego vehicle at each timestamp on the HD map.

The BEV segmentation task has two parts, the (dynamic) object segmentation task, and the (static) road layout segmentation task. For object segmentation, 3D bounding boxes are rasterized into the BEV image to generate annotation. For static road layouts, maps are transformed into the ego vehicle frame based on provided localization results and rasterized into BEV annotation.

MonoLayout (WACV 2020)

MonoLayout: Amodal scene layout from a single image focuses on the lifting of a single camera into a semantic BEV space. The focus of the paper is on amodal completion which reasons for the occluded area. It seems to be heavily influenced by Learning to Look around Objects (ECCV 2018).

MonoLayout hallucinates occluded regions in KITTI (left) and Argoverse (right) (source)

The view transformation is performed via an encoder-decoder structure and the latent feature is called “shared context”. Two decoders are used to decode the static and dynamic class separately. The authors also reported negative results of using a combined decoder to handle both static and dynamic objects in the ablation study.

The Encoder-decoder design of MonoLayout (source)

Though HD Map groundtruth is available in Argoverse dataset, MonoLayout chooses to use it only for evaluation but not for training (hindsight or deliberate design choice?). For training, MonoLayout uses a temporal sensor fusion process to generated weak groundtruth by aggregating 2D semantic segmentation results throughout a video with localization information. It uses monodepth2 to lift RGB pixels to point cloud. It also discards anything 5 m away from the ego car as they could be noisy. To encourage the network to output conceivable scene layout, MonoLayout used adversarial feature learning (similar to that used in Learning to Look around Objects). The prior data distribution is obtained from OpenStreetMap.

MonoLayout has a spatial resolution of 30 cm/pixel, and thus the 128 x 128 output corresponds to 40 m x 40 m in BEV space. Code is available in github.

PyrOccNet (CVPR 2020)

PyrOccNet: Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks predicts BEV semantic map from monocular images and fuses them into a coherent view with Bayesian Filtering.

Demo results of PyrOccNet on NuScenes dataset (source)

The core component of view transformation in PyrOccNet is performed via a dense transformer module (note that this transformer is NOT attention based!). It seems to be heavily inspired by OFT (BMVC 2019) from the same authors. OFT uniformly smears the feature at a pixel location along the ray projected back into 3D space, which closely resembles the backprojection algorithm used in Computational Tomography. The Dense Transformer module in PyrOccNet takes it one step further by using an FC layer to expand along the depth axis. In practice, there are multiple dense transformers operating at different scales, focusing on different distance ranges in BEV space.

The dense transformer layer is inspired by the observation that while the network needs a lot of vertical context to map features to the birds-eye-view (due to occlusion, lack of depth information, and the unknown ground topology), in the horizontal direction the relationship between BEV locations and image locations can be established using simple camera geometry. — from PyrOccNet paper

Architecture diagram of PyrOccNet with Dense Transformer (source)

The training data comes from the multimodal dataset of the Argoverse dataset and nuScenes dataset, which has both map data and 3D object detection ground truth

PyrOccNet uses Bayesian Filtering to fuse the information across multiple cameras and across time in a coherent manner. It drawing inspiration from the old idea of the binary Bayesian occupancy grid and boosts the interpretability of the network output. The temporal fusion results closely resemble a mapping process and is quite similar to the “temporal sensor fusion” process used to generate weak GT in MonoLayout.

Temporal fusion results of PyrOccNet for only static classes (source)

PyrOccNet has a spatial resolution of 25 cm/pixel, and thus the 200 x 200 output corresponds to 50 m x 50 m in BEV space. Code will be available in github.

Lift, Splat, Shoot (ECCV 2020)

Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D performs dense pixel-wise depth estimation for the view transformation. It first uses per-camera CNNs to perform probabilistic pixel-wise depth prediction to lift each perspective image into a 3D point cloud and then uses camera extrinsics to splat on BEV. Finally, a BEV CNN is used to refine the predictions. The “shoot” part means path planning and will be skipped as it is outside the scope of this post.

Lift and Splat pipeline (left) and probabilistic depth prediction (right) (source)

It proposed probabilistic 3D lifting through prediction of depth distribution for a pixel in the RGB image. In a way, it unifies the one-hot lifting of pseudo-lidar (CVPR 2019) and the uniform lifting of OFT (BMVC 2019). Actually this “soft” prediction is a trick commonly used in differentiable rendering. The concurrent work of Pseudo-Lidar v3 (CVPR 2020) also uses this soft rasterization trick to make depth lifting and projection differentiable.

The training data comes from the multimodal dataset of the Lyft dataset and nuScenes dataset, which has both map data and 3D object detection ground truth.

Lift-Splat-Shoot has a input resolution of 128x352, and the BEV grid is 200x200 with a resolution of 0.5 m/pixel = 100m x 100m. Code is available in github.

BEV Feature Stitching

Both PyrOccNet and Lift-Splat-Shoot focuses on stitching synchronized images from multiple cameras into a coherent 360-degree view. BEV-Feature-Stitching (Understanding Bird's-Eye View Semantic HD-Maps Using an Onboard Monocular Camera) fuses a monocular video (with estimated ego pose) into a coherent forward view.

BEV Feature Stitching takes in a monocular video (a sequence of images) as the input of the model. To fuse the information from multiple frames, it introduces a BEV temporal aggregation module. This module first projects the intermediate feature map, and then aggregate the feature maps with ego pose (estimated from odometry pipeline) into a coherent and extended BEV space. In addition, it intermediate features of each image frame is supervised in camera space with reprojected BEV ground truth.

BEV Feature Stitching has a BEV grid of 200x200 pixels, with a resolution of 0.25 m/pixel = 50m x 50m.

PYVA: Projecting Your View Attentively (CVPR 2021)

It is difficult for CNN to directly fit a view projection model due to the locally confined receptive fields of convolutional layers. Therefore, most of the CNN-based methods explicitly encodes the view transformation into the CNN architecture (homographic transformation of feature maps with camera extrincis, etc).

Transformers, on the other hand, are more suitable to do this job due to the global attention mechanism. PYVA (Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation, CVPR 2021) uses a cross-attention transformer module (although they call it attention and did not spell the transformers out out explicitly) to lift image features to BEV and perform road layout and vehicle segmentation on it.

Following VPN, PYVA uses MLP to lift projective image features into BEV. Then it uses a cross-attention transformer to do further refinement of the lifting. The lifted BEV feature X’ is used as Query, and image feature X (or cycled reprojected image feature X’’) act as Key and Value.

The idea behind the use of the cross-attention transformer module is intuitive: for each pixel in Query (BEV feature), which pixel in the Key (image feature) shall the network pay attention to? Unfortunately this paper did not show some example of the attention maps inside the transformer module, which would have illustrated this intuition nicely.

PYVA uses a cross-view transformer module to lift front view into BEV

PYVA follow the idea of MonoLayout to use adversarial training loss to encourage the decoder to output more reasonable road layout and vehicle locations. PYVA takes it one step further by not only encouraging the network to output reasonable vehicle location but also reasonable relationship between the road and vehicles, as the road layout provide useful prior to predict vehicle location and orientation (cars are more likely to park alongside the road, for example).

Panoptic BEV

Panoptic BEV (Bird’s-Eye-View Panoptic Segmentation Using Monocular Frontal View Images) notes correctly that the notion of instance is critical to downstream. FIERY extended the semantic segmentation idea to instance segmentation, and Panoptic BEV takes it one step further and performs panoptic segmentation in BEV space.

Panoptic BEV uses a ResNet backbone with BiFPN (as in EfficientDet) neck. At each level (P2-P5), image features are projected into a BEV by a dense transformer module (note that this transformer is NOT attention based!).

Network architecture of Panoptic BEV (transformer her is not attention-based)

In each dense transformer, there is a distinct vertical transformer and a flat transformer. First the input image is binary semantically segmented into a vertical mask and a horizontal mask. The image masked by vertical mask is fed into the vertical transformer. The vertical transformer uses a volumetric lattice to model the intermediate 3D space which is then flattened to generated the vertical BEV features. This vertical transformer is quite similar to that of Lift-Splat-Shoot. The flat transformer uses IPM followed by an Error Correction Module (ECM) to generate the flat BEV features. This largely follows the idea of PyrOccNet.

Dense transformer lifts image to BEV, and has a vertical transformer and flat transformer

The entire design seems to be built on top of Lift-Splat-Shoot and PyrOccNet, by injecting the prior information of IPM into the general lift-and-splat pipeline. The module design indeed look quite reasonable, but they feel too much hand-crafted and complicated and may not lead to the best performance once the model can get access to enough data.

The influence by PyrOccNet is also apparent from the naming of the view transformation module as a “dense transformer” and reference of the “vertical features”. The results look quite promising from the demo video. The code to be released on Github.

Limitations and Future Directions

Although much progress has been made in BEV semantic segmentation, there exist several key gaps as I see it before its wide deployment in production systems.

First of all, for the dynamic participants, there is no concept of instance yet. This makes it hard to leverage prior knowledge of dynamic objects in the behavior prediction. For example, cars follow a certain motion model (such as a bicycle model) and have limited patterns of future trajectory, while pedestrians tend to have more random motion. Many of the existing approaches tend to connect multiple cars into one contiguous region in the semantic segmentation results.

The dynamic semantic classes cannot be reused and are largely “disposable” information, while static semantic classes (such as the road layout and markings on the road) in the BEV image can be seen as an “online map” and should be harvested and recycled. How to aggregate BEV semantic segmentation over multiple timestamps to estimate a better map is another critical question to answer. The temporal sensor fusion approach in MonoLayout and PyrOccNet may be useful, but they need to be benchmarked against traditional approaches such as SLAM.

How to convert the online pixel-wise semantic map into a lightweight and structured map for future reuse. In order not to throw away precious mapping cycles onboard, the online map has to be converted to some format which the ego car or other cars can effectively leverage in the future.

Takeaway

View Transformation: many of the existing work ignores the strong geometric prior info of camera extrinsics. This should be avoided. PyrOccNet and Lift-Splat-Shoot seem to be in the correct direction.
Data and Supervision: Most of the research prior to 2020 are based on simulation data and use semantic segmentation as the intermediate representation to bridge the sim2real domain gap. More recent works leverage multi-modal datasets to perform direct supervision on the task, achieving very promising results.
I do feel that perception in BEV space is the future of perception, especially with the help of differentiable rendering, the view transformation can be implemented as a differentiable module and plug into an end-to-end model to directly lift perspective images into BEV space.