Depth Estimation in Structured Driving Scenes

Depth Estimation in Structured Driving Scenes: Comparison

Please note this is a comparison between Version 2 by Peter Tang and Version 1 by Yan Jiafu.

Depth estimation is an important part of the perception system in autonomous driving. CurrenSt studies often reconstruct dense depth maps from RGB images and sparse depth maps obtained from other sensors. However, existing methods often pay insufficient attention to latent semantic information. Considering the highly structured characteristics of driving scenes, wethe researchers propose a dual-branch network to predict dense depth maps by fusing radar and RGB images. The driving scene is divided into three parts in the proposed architecture, each predicting a depth map, which is finally merged into one by implementing the fusion strategy in order to make full use of the potential semantic information in the driving scene.

depth estimation
camera
dual-branch network

1. Introduction

In recent years, the automotive industry is in an era of change, the rapid development of automated driving-related technologies, Google, Huawei, Baidu, Tesla and other companies are competing to participate in the research, and constantly promote the development of related technologies. Automatic driving technology is a comprehensive subject that crosses many disciplines, and its research content can be roughly divided into three sub-modules: perception, decision-making and control ^[1]. Among them, the perception system, as the “eyes” of the automatic driving system, is responsible for collecting road information and processing it to provide the necessary information for subsequent decision-making, and its importance is self-evident. Depth estimation is aimed at estimating the depth information in the scene through image information, which can effectively facilitate the realization of 3D target detection [2^[2][3][4],3,4], scene reconstruction [5,6,7]^[5][6][7] and other tasks ^[8], and has extremely important application value for automatic driving.

Accurate depth estimation, which aims to predict the depth value of each pixel from the RGB image, can effectively guarantee the safety of autonomous vehicles. At present, as convolutional neural networks (CNN) have achieved great success in many other fields, it is also introduced into depth estimation tasks extensively, to predict the dense depth map corresponding to the input RGB image in an end-to-end manner, and it indeed shows its strong capability [9,10,11]^[9][10][11]. Owing to the use of CNN, the existing algorithms have greatly improved the accuracy of depth estimation compared to conventional methods. It is still a common paradigm for addressing this problem today. Many different types of networks and loss functions are proposed successively in order to pursue better performance of depth estimation.

2. Fusion of Radar and Camera Applications

Radar technology is known for its high-precision obstacle detection capability, which can detect a wide range of targets such as vehicles, pedestrians, and buildings. Unlike cameras, radar operates using electromagnetic waves and is therefore unaffected by low-light conditions (e.g., sunset, nighttime, or inclement weather), enabling it to operate reliably in a wide range of environments, significantly improving the reliability of autonomous driving systems. Fusing radar with sensors, such as cameras, not only provides richer information, but also enables the system to understand the surrounding environment more accurately. Radar–camera fusion has been widely used in many fields, such as security, earthquake relief and autonomous driving. In literature [15^[12][13],16], the fusion of target velocity and azimuth information obtained by radar with image information not only ensures consistency of target tracking but also improves tracking accuracy. It is shown that the fusion of camera and radar is not only effective for target detection, but also plays a role in the fields of gesture recognition [17]^[14], vital signs monitoring [18]^[15], human bone detection [19]^[16] and earthquake rescue [20]^[17], among others. Especially in the field of autonomous driving, the fusion of radar and camera is of great importance. For example, Hussain et al. [21]^[18] designed a low-cost method for detecting drivable areas in long-distance areas of self-driving cars by fusing radar and camera. Similarly, Wu et al. [22]^[19] solved the challenge of missing parking boundaries on maps or difficult parking spot detection by jointly using radar and cameras. In addition to sensing the external environment, Kubo et al. [23]^[20] proposed a non-contact driver drowsiness detection method that estimates driver drowsiness with high accuracy and detail. De et al. [24]^[21] estimated the vehicle’s position based on the fusion of radar and camera sensors, speed and direction information. In summary, the fusion of radar and camera plays a key role in multidisciplinary applications, providing more accurate and comprehensive information by integrating the advantages of different sensors and promoting the development of automation technology in various fields.

3. Monocular Depth Estimation

Monocular depth estimation is quite a challenging subject, as 3D information is lost when images are fetched by a monocular camera. Traditional algorithms rely heavily on hand-crafted features, such as texture and geometry, combined with a probabilistic model [25,26]^[22][23]. Over the past few years, CNN has achieved a convincing effect on image processing, so it is introduced in depth estimation and gradually becomes the most popular method for this task. In general, depth estimation is treated as a regression problem. Eigen et al. [27]^[24] construct a multi-scale deep CNN to generate a dense depth map for the first time. Some methods attempt to combine CNN with conditional random field (CRF) to improve network performance [28]^[25], but they also increase the complexity of the system. New architectures are proposed to better extract features. Laina et al. [29]^[26] design a fully convolutional residual network (FCRN) for depth estimation, which is an encoder–decoder structure. In this structure, the FC layer and the last pooling layer are no longer used, instead of an efficient decoder structure consisting of a series of upper convolutional modules, which significantly reduces the number of parameters. Some inspired improvements like [30]^[27] have been proposed since then. Some methods improve the model by introducing attention mechanisms [31,32]^[28][29]. Chen et al. [33]^[30] used a novel spatial attention block to guide different feature layers to focus on different structural information, i.e., local details or global structures. On the other hand, many researchers treat depth estimation as a classification problem, which divides the depth value into discrete bins where each pixel falls into one of them, followed by some post-processing means to map to the continuous space and obtain the final continuous depth map. Typical methods include Cao et al. [34]^[31], where CNN was used to determine the appropriate bin that pixels should fall into, and then a fully connected CRF was employed to generate continuous prediction maps from the derived classification result. Fu et al. [35]^[32] discretized the depth values in log space and treated it as an ordinal regression problem.

4. Depth Completion

Depth completion uses RGB images to give sparse depth maps for densification. Two typical challenges are how to better handle sparse data and how to integrate two modalities of data in an efficient manner. Many methods based on CNN architecture have been proposed so far. As in Figure 1, early fusion models concatenate a sparse depth map directly with an RGB image and then send it to a network as a whole. Because the initial depth map is highly sparse, conventional convolutional operations have a poor effect on processing. Thus, Uhrig et al. [36]^[33] proposed a sparse invariant convolution that uses a binary validity mask to adapt sparse input. For the late fusion strategy, a dual-branch encoder is universal. The features of RGB image and sparse depth map are usually extracted, fused at intermediate layers and transmitted to the decoder in the following steps [37]^[34]; otherwise, depth maps are inferred by the respective features and the final map is obtained by merging two outputs [38]^[35]. Some work devises a two-stage network, simply predicting a coarse depth map in the first stage before passing it to the fine-tuning stage. Cheng et al. [39]^[36] proposed the convolutional spatial propagation network (CSPN), which refines the results by learning the affinity matrix and propagating sparse depth samples in local neighbors.

Figure 1.

Different stages of fusion. (

) Early fusion; (

) late fusion.

5. Depth Estimation with Semantic Information

The depth information in an image describes the spatial relationship of the scene, while the semantic information represents the physical nature of the scene. The two share similar contextual information. Therefore, it is natural for researchers to consider introducing semantic information into the task of depth estimation as an assistant. Typically, depth estimation and semantic segmentation are combined for training, and the parameters of two tasks are shared by a unified structure to promote each other. Zhang et al. [40]^[37] proposed a joint learning framework to recursively refine the results of two tasks. Zhu et al. [41]^[38] used semantic segmentation to smooth the depth quality of object edge regions.

References

Ju, Z.; Zhang, H.; Li, X.; Chen, X.; Han, J.; Yang, M. A survey on attack detection and resilience for connected and automated vehicles: From vehicle dynamics and control perspective. IEEE Trans. Intell. Veh. 2022, 7, 815–837.
Peng, X.; Zhu, X.; Wang, T.; Ma, Y. SIDE: Center-based stereo 3D detector with structure-aware instance depth estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 119–128.
Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 13–14 February 2023; Volume 37, pp. 1477–1485.
Alaba, S.Y.; Ball, J.E. Deep Learning-Based Image 3-D Object Detection for Autonomous Driving. IEEE Sens. J. 2023, 23, 3378–3394.
Wei, R.; Li, B.; Mo, H.; Zhong, F.; Long, Y.; Dou, Q.; Liu, Y.H.; Sun, D. Distilled Visual and Robot Kinematics Embeddings for Metric Depth Estimation in Monocular Scene Reconstruction. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 8072–8077.
Sayed, M.; Gibson, J.; Watson, J.; Prisacariu, V.; Firman, M.; Godard, C. SimpleRecon: 3D reconstruction without 3D convolutions. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; pp. 1–19.
Xu, R.; Dong, W.; Sharma, A.; Kaess, M. Learned depth estimation of 3d imaging radar for indoor mapping. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 13260–13267.
Hong, F.T.; Zhang, L.; Shen, L.; Xu, D. Depth-aware generative adversarial network for talking head video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3397–3406.
Lee, J.H.; Heo, M.; Kim, K.R.; Kim, C.S. Single-image depth estimation based on fourier domain analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 330–339.
Ramamonjisoa, M.; Du, Y.; Lepetit, V. Predicting sharp and accurate occlusion boundaries in monocular depth estimation using displacement fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14648–14657.
Qi, X.; Liao, R.; Liu, Z.; Urtasun, R.; Jia, J. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 283–291.
Vandana, G.; Pardhasaradhi, B.; Srihari, P. Intruder detection and tracking using 77 ghz fmcw radar and camera data. In Proceedings of the 2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 8–10 July 2022; pp. 1–6.
Ram, S.S. Fusion of inverse synthetic aperture radar and camera images for automotive target tracking. IEEE J. Sel. Top. Signal Process. 2022, 17, 431–444.
Hazra, S.; Feng, H.; Kiprit, G.N.; Stephan, M.; Servadei, L.; Wille, R.; Weigel, R.; Santra, A. Cross-modal learning of graph representations using radar point cloud for long-range gesture recognition. In Proceedings of the 2022 IEEE 12th Sensor Array and Multichannel Signal Processing Workshop (SAM), Trondheim, Norway, 20–23 June 2022; pp. 350–354.
Shokouhmand, A.; Eckstrom, S.; Gholami, B.; Tavassolian, N. Camera-augmented non-contact vital sign monitoring in real time. IEEE Sens. J. 2022, 22, 11965–11978.
Sengupta, A.; Cao, S. mmpose-nlp: A natural language processing approach to precise skeletal pose estimation using mmwave radars. IEEE Trans. Neural Netw. Learn. Syst. 2022.
Schroth, C.A.; Eckrich, C.; Kakouche, I.; Fabian, S.; von Stryk, O.; Zoubir, A.M.; Muma, M. Emergency Response Person Localization and Vital Sign Estimation Using a Semi-Autonomous Robot Mounted SFCW Radar. arXiv 2023, arXiv:2305.15795.
Hussain, M.I.; Azam, S.; Rafique, M.A.; Sheri, A.M.; Jeon, M. Drivable region estimation for self-driving vehicles using radar. IEEE Trans. Veh. Technol. 2022, 71, 5971–5982.
Wu, B.X.; Lin, J.J.; Kuo, H.K.; Chen, P.Y.; Guo, J.I. Radar and Camera Fusion for Vacant Parking Space Detection. In Proceedings of the 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), Incheon, Republic of Korea, 13–15 June 2022; pp. 242–245.
Kubo, K.; Ito, T. Driver’s Sleepiness Estimation Using Millimeter Wave Radar and Camera. In Proceedings of the 2022 IEEE CPMT Symposium Japan (ICSJ), Kyoto, Japan, 9–11 November 2022; pp. 98–99.
de Araujo, P.R.M.; Elhabiby, M.; Givigi, S.; Noureldin, A. A Novel Method for Land Vehicle Positioning: Invariant Kalman Filters and Deep-Learning-Based Radar Speed Estimation. IEEE Trans. Intell. Veh. 2023, 1–12.
Liu, B.; Gould, S.; Koller, D. Single image depth estimation from predicted semantic labels. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 1253–1260.
Ladicky, L.; Shi, J.; Pollefeys, M. Pulling things out of perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 89–96.
Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, USA, 8–13 December 2014.
Li, B.; Shen, C.; Dai, Y.; Van Den Hengel, A.; He, M. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1119–1127.
Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248.
Hu, J.; Ozay, M.; Zhang, Y.; Okatani, T. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1043–1051.
Chen, Y.; Zhao, H.; Hu, Z.; Peng, J. Attention-based context aggregation network for monocular depth estimation. Int. J. Mach. Learn. Cybern. 2021, 12, 1583–1596.
Xu, D.; Wang, W.; Tang, H.; Liu, H.; Sebe, N.; Ricci, E. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3917–3925.
Chen, T.; An, S.; Zhang, Y.; Ma, C.; Wang, H.; Guo, X.; Zheng, W. Improving monocular depth estimation by leveraging structural awareness and complementary datasets. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 90–108.
Cao, Y.; Wu, Z.; Shen, C. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 3174–3182.
Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2002–2011.
Uhrig, J.; Schneider, N.; Schneider, L.; Franke, U.; Brox, T.; Geiger, A. Sparsity invariant cnns. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 11–20.
Jaritz, M.; De Charette, R.; Wirbel, E.; Perrotton, X.; Nashashibi, F. Sparse and dense data with cnns: Depth completion and semantic segmentation. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 52–60.
Hu, M.; Wang, S.; Li, B.; Ning, S.; Fan, L.; Gong, X. Penet: Towards precise and efficient image guided depth completion. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13656–13662.
Cheng, X.; Wang, P.; Yang, R. Learning depth with convolutional spatial propagation network. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2361–2379.
Zhang, Z.; Cui, Z.; Xu, C.; Jie, Z.; Li, X.; Yang, J. Joint task-recursive learning for RGB-D scene understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2608–2623.
Zhu, S.; Brazil, G.; Liu, X. The edge of depth: Explicit constraints between segmentation and depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13116–13125.