Summary

The web content discusses the process of inverse projection transformation in computer vision, which involves reconstructing 3D scenes from 2D images using depth information.

Abstract

The article delves into the concept of inverse projection, a method used in computer vision to recover the three-dimensional structure of a scene from a two-dimensional image. It explains the loss of depth information in standard camera projections and the necessity of depth maps to accurately reconstruct the scene. The text covers the challenges of perceiving depth from perspective views, the importance of depth estimation in applications like autonomous vehicles, and the ongoing research in this field, including techniques such as stereo vision and deep learning. The mathematical foundations of back-projection are detailed, along with practical examples using Python and a simulator called CARLA. The article also touches on the pinhole camera model, camera calibration, and the process of obtaining intrinsic parameters. Finally, it explores the application of orthographic projection for creating top-down views, which are beneficial for mobile robot navigation.

Opinions

The author emphasizes the difficulty in discerning absolute distances and relative depths from 2D images due to perspective projection, highlighting the importance of depth cues and estimation techniques.
The article suggests that depth estimation using deep learning has shown remarkable performance, indicating a significant advancement in the field of computer vision.
The author provides a personal opinion on the complexity of reasoning about depth in perspective views and the benefits of having a depth map for accurate 3D positioning.
There is an endorsement of the checkerboard method for camera calibration and the use of PnP, Direct Linear Transform, or RANSAC for improving robustness in determining camera parameters.
The author expresses enthusiasm for the reader to engage with the content by including interactive elements such as exercises to derive focal lengths and optical centers, as well as providing access to code and further reading.
The text concludes with an appreciation for the practical applications of back-projection, such as in Structure from Motion and mobile robot navigation, showcasing the real-world relevance of the discussed concepts.

Inverse Projection Transformation

Fig 1: 3D points back-projected from a RGB and Depth image

Depth and Inverse Projection

When an image of a scene is captured by a camera, we lose depth information as objects and points in 3D space are mapped onto a 2D image plane. This is also known as a projective transformation, in which points in the world are converted to pixels on a 2d plane.

However, what if we want to do the inverse? That is, we want to recover and reconstruct the scene given only 2D image. To do that, we would need to know the depth or Z-component of each corresponding pixels. Depth can be represented as an image as shown in fig 2 (centre). With brighter intensity denoting point further away.

Understanding depth perception is essential in many computer vision applications. For example, able to measure depth for an autonomous vehicle allows for better decision making as the agent is fully aware of the separation distances between other vehicles and pedestrians.

Fig 2: (from left) RGB, Depth, 3D back-projected points

Difficulty reasoning in perspective view

Consider fig 2 above, given only the RGB image. It is hard to tell what is the absolute distance between the 2 cars on the left lane. Also, I most certainly find it difficult to perceive whether the trees on the left are indeed very near or it is extremely far from the house. The above phenomenon is a consequence of perspective projection, which requires us to rely on various cues to come up with a good estimate of the distance. I discussed some of the issues with perspectivity in this post which might interest you:)

However, if we were to reproject it back to 3d with the aid of a depth map (fig 2. right), we could accurately position the trees and perceive that they are actually distanced away from the building. Yes, we are really bad at figuring the relative depth when an object is occluded behind another object. The main takeaway from this is looking at images alone, it is hard to discern depth.

The problem of estimating depth is an on-going research and has well progressed over the years. Many techniques have been developed and the most successful methods come from determining depth using stereo vision[1]. And in recent years, depth estimation using deep learning has shown incredible performance [2], [3].

In this article, we will take a tour and understand the mathematics and concepts of performing back-projection from 2D pixels coordinate to 3D points. I will then go through a simple example in Python to show the projection in action. Code is available here. We will assume that a depth map is provided to perform the 3D reconstruction. The concepts that we will go through are camera calibration parameters, projective transformation using intrinsic and its inverse, coordinate transformation between frames.

3D reconstructed point clouds from the scene in Fig 2

Central Projection of Pinhole Camera Model

First and foremost, understanding the geometrical model of the camera projection serves as the core idea. What we are ultimately interested in is the depth, parameter Z. Here, we consider the simplest pinhole camera model with no skew or distortion factor.

3D points are mapped to the image plane (u, v) = f(X, Y, Z). The complete mathematical model that describes this transformation can be written as p = K[R|t] * P.

where

p is the projected point on the image plane
K is the camera intrinsics matrix
[R|t] is the extrinsic parameters describing the relative transformation of the point in the world frame to the camera frame
P, [X, Y, Z, 1] represents the 3D point expressed in a predefined world coordinate system in Euclidean space
Aspect ratio scaling, s: controls how pixels are scaled in the x and y direction as focal length changes

Intrinsic parameter matrix

The matrix K is responsible for projecting 3D points to the image plane. To do that, the following quantities must be defined as

Focal length (fx, fy): measure the position of the image plane wrt to the camera centre.
Principal point (u0, v0): The optical centre of the image plane
Skew factor: The misalignment from a square pixel if the image plane axes are not perpendicular. In our example, this is set to zero.

The most common way of solving all the parameters is using the checkerboard method. Where several 2D-3D correspondences are obtained through matching and solving the unknown parameters by means of PnP, Direct Linear Transform or RANSAC to improve robustness.

With all the unknowns determine, we can finally proceed to recover the 3D points (X, Y, Z) by applying the inverse.

Backprojection

Consider the equation in Fig 4. Suppose (X, Y, Z, 1) is in the camera coordinate frame. i.e. we do not need to consider the extrinsic matrix [R|t]. Expanding the equation would give as

The 3D points can be recovered with Z given by the depth map and solving for X and Y. We can then further transform the points back to the world frame if needed.

Inverse projection example

Let's go through a simple example to digest the concepts. We will use the RGB and depth image as shown in figure 1. Pictures are acquired from a camera mounted on a car in a simulator, CARLA. The depth map is stored as float32 and encodes up to a maximum of 1000m for depth values at infinity.

Intrinsic Parameters from Field of View

Instead of determining the intrinsic parameters using checkerboard, one can calculate the focal lengths and optical centre for the pinhole camera model. The information needed is the imaging sensor height and width in pixels and the effective field of view in the vertical and horizontal direction. Camera manufacturer usually provides those. In our example, we will use +-45 degrees both in the vertical and horizontal direction. We will set the scale factor to 1.

Referring to fig 3, the focal lengths (fx, fy) and the principal point (u0, v0) can be determined using simple trigonometry. I leave it up to you to derive it as an exercise or you can look it up in the code!

Now, we can compute the inverse as follows

Obtain the intrinsic camera parameters, K
Find the inverse of K
Apply equation in fig 5 with Z as depth from a depth map.

# Using Linear Algebra
cam_coords = K_inv @ pixel_coords * depth.flatten()

A slower but more intuitive way of writing step 3 is

cam_points = np.zeros((img_h * img_w, 3))
i = 0
# Loop through each pixel in the image
for v in range(height):
    for u in range(width):
        # Apply equation in fig 5
        x = (u - u0) * depth[v, u] / fx
        y = (v - v0) * depth[v, u] / fy
        z = depth[v, u]
        cam_points[i] = (x, y, z)
        i += 1

You will get the same results!

Conclusion

There we have it, I have gone through the basic concepts required to do a back-projection.

Back projecting to 3D forms the basis of 3D scene reconstruction via Structure form Motion where several images are captured from a moving camera, along with its depth known or computed. Thereafter, matching and stitching together to get a complete understanding of the scene structure.

darylclimb/cvml_project

Projects and application using computer vision and machine learning - darylclimb/cvml_project

github.com

Orthographic Projection: Top View (Optional)

With the points represented in 3D, one interesting application is to project it to a top-down view of the scene. This is usually a useful representation for mobile robots as the distances between obstacles are preserved. Furthermore, it is easy to interpret and utilize to perform path planning and navigation task. For this, we need to know the coordinate system in which the points references.

We will use the right-hand coordinate system as defined below

Fig 6: Camera coordinate and pixel coordinate

For this simple example, which plane do you think the points should be projected?

If your guess is on the plane y= 0, you are right as y represent height as defined by the camera coordinate system. We simply collapse the y component in the projection matrix.

Looking at the figure below, you can easily measure the separation distances between all vehicles and objects.

Reference

[1] Hirschmuller, H. (2005). Accurate and Efficient Stereo Processing by Semi Global Matching and Mutual Information. CVPR

[2] Tinghui Zhou, Matthew Brown, Noah Snavely, and David Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017

[3] Clement Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.