LiDAR Point Clouds Semantic Segmentation in Autonomous Driving

LiDAR Point Clouds Semantic Segmentation in Autonomous Driving: Comparison

Please note this is a comparison between Version 2 by Camila Xu and Version 1 by Shaojing Song.

Although semantic segmentation of 2D images is crucial to attaining scene understanding, there are still some limitations to visual sensors, such as the inefficiency of acquiring information under insufficient light, lack of depth information and limited field of view. In contrast, LiDAR can obtain accurate depth information with higher density and wider viewing field regardless of lighting conditions, which makes it a more reliable source of information for environmental perception.

LiDAR point clouds
semantic segmentation
deep learning
asymmetric convolution

1. Introduction

Scene understanding is one of the most critical tasks in autonomous driving. With the challenges introduced by recent technologies such as autonomous driving, a detailed and accurate understanding of the road scene has become a main part of any outdoor autonomous robotic system in recent years. Although semantic segmentation of 2D images is crucial to attaining scene understanding, there are still some limitations to visual sensors, such as the inefficiency of acquiring information under insufficient light, lack of depth information and limited field of view. In contrast, LiDAR can obtain accurate depth information with higher density and wider viewing field regardless of lighting conditions, which makes it a more reliable source of information for environmental perception. Therefore, the scene understanding of LiDAR point clouds with semantic segmentation has become a focal point in autonomous driving.

According to point clouds’ encoding methods, the current LiDAR point clouds semantic segmentation methods can be divided into three categories: point-based methods, voxel-based methods, and projection-based methods. In terms of speed, there is a lot of computation and memory consumption in point-based and voxel-based methods, which makes it difficult to achieve real-time effects with the on-board computing platform. A higher priority should be placed on real-time performance when it comes to autonomous driving than segmentation accuracy. In contrast, projection-based methods are lightweight and fast, so real-time effects can be achieved during deployment. In terms of segmentation accuracy, the projection-based method has shown some success. However, since the point cloud information is not fully utilized during feature extraction, there is still room for improving segmentation accuracy.

When achieving real-time effects, it is of great relevance to improve the segmentation accuracy in autonomous driving scenarios. PolarNet ^[1] is the baseline network of ACPNet, which encodes point clouds through polar bird’s-eye-view (BEV) representation. BEV is the abbreviation for Bird’s Eye View, which is a perspective that views an object or scene from above, just like a bird looking down at the ground in the air. Also known as God’s perspective, which is a perspective or coordinate system used to describe the perception of the world. The using of polar BEV has some advantages: First, in terms of point allocation within grid cells, the polar BEV method will assign point clouds to their respective grid cells more evenly. Second, since the partitioning method brings about a more balanced distribution of points, the theoretical upper limit of prediction accuracy for the semantic classification of point clouds will be increased, thereby improving the performance of downstream semantic segmentation models ^[1]. In ACPNet, the encoded point cloud features are fed into an Asymmetric Convolution Backbone Network (ACBN) for feature extraction. Then, the features extracted by the backbone are input to the Contextual Feature Enhancement Module (CFEM) for further mining of contextual features. Moreover, global scaling and global translation are used as Enhanced Data Augmentation (EDA) while ACPNet is being trained.

2. LiDAR Point Clouds Semantic Segmentation in Autonomous Driving Based on Asymmetrical Convolution

Due to the sparsity and disorderliness of point clouds, encoding the input point cloud is a crucial issue when using convolutional neural networks for semantic segmentation of 3D point clouds. According to the encoding methods for point clouds, existing point cloud encoding methods can be divided into three categories: Point-based Methods, Voxel-based Methods, and Projection-based Methods.

2.1. Point-Based Methods

PointNet [5]^[2] is a point-wise learning method for point cloud features, and max pooling is used to integrate global features. PointNet++ [6]^[3] is an extension to PointNet, and the ability to extract local information of different scales is strengthened. A spatially continuous convolution is proposed in PointConv [7]^[4], which reduces the memory consumption of the algorithm effectively. For semantic segmentation in large-scale point clouds scenarios, the point clouds are represented as interconnected superpoint graphs in SPG [8]^[5], and then PointNet was used to learn the features of the superpoint graph. An attention-based module was designed in RandLA-Net [9]^[6] to integrate local features, achieving efficient segmentation in large-scale point clouds. Segmentation performance was further improved in KPConv [10]^[7] with a novel spatial kernel-based point convolution. Lu et al. [11]^[8] suggested the use of distinct aggregation strategies for both within-category and between-category data. Employing aggregation or enhancement techniques on local features [12]^[9] can effectively enhance the perception of intricate details. Furthermore, to effectively learn features from extensive point clouds encompassing diverse target types, Fan et al. [13]^[10] introduced the SCF-Net. This network incorporates a dual-distance attention mechanism and global contextual features to enhance semantic segmentation performance. Point-based methods directly work on the raw point clouds without excessive initialization transformation steps. However, when handling expansive point cloud scenes, the local nearest neighbor search is inevitably involved, which is computationally inefficient. Thus, there is still clearly room for improvement in point-based methods.

2.2. Voxel-Based Methods

Point clouds are regularly divided into 3D cubic voxels, and Voxel-based methods employ 3D convolution for the extraction of features. SEGCloud [14]^[11] is one of the earlier methods for semantic segmentation based on voxel representation. In order to utilize 3D convolution efficiently and expand the receptive field, 3D sparse convolution [15]^[12] is used in Minkowski CNN [16]^[13], which reduces the computational complexity of convolution. In pursuit of higher segmentation accuracy, a neural architecture search (NAS) based model SPVNAS [17]^[14] is proposed, which trades high computational cost for accuracy. In order to fit the spatial distribution of the LiDAR point clouds, a cylinder voxel division method is proposed in Cylinder3D [18]^[15], which makes it obtain high accuracy. In order to streamline computations and enhance the intricacies of smaller instances, an attention-focused feature fusion module and an adaptive feature selection module are proposed by Cheng et al. [19]^[16]. To improve the speed of voxel-based networks, a method of knowledge distillation from point to voxel is proposed in PVKD [20]^[17] to achieve model compression. High segmentation accuracy is typically achieved in voxel-based methods. However, 3D convolution is inevitably used, resulting in significant memory occupation and high computational consumption.

2.3. Projection-Based Methods

The basic concept behind projection-based methods is to transform point clouds into images that can undergo 2D convolution operations. The SqueezeSeg [21,22,23]^[18][19][20] series of algorithms based on SqueezeNet [24]^[21] perform semantic segmentation after projecting point clouds. RangeNet++ [25]^[22] implements semantic segmentation based on the backbone network of DarkNet53 [26]^[23], and a K-Nearest Neighbor (KNN) algorithm is proposed to improve segmentation accuracy. 3D-MiniNet [27]^[24] is based on a lightweight backbone to build the network, achieving a faster speed. A polar BEV representation method is proposed in PolarNet ^[1], which uses a simplified version of PointNet to encode the point clouds of each polar coordinate grid to obtain a pseudo image, and KNN post-processing operation is no longer needed. Peng et al. [28]^[25] introduced a multi-attention mechanism to enhance the understanding of driving scenes, specifically focused on dense top-view semantic segmentation using sparse LiDAR data. SalsaNext [29]^[26] introduced a new context module, which replaces the ResNet encoder blocks with a residual convolution stack that has increasing receptive fields. Additionally, it incorporated a pixel-shuffle layer into the decoder. MINet [30]^[27] employed multiple paths with varying scales to effectively distribute computational resources across different scales. FIDNet-Point [31]^[28] designed a fully interpolation decoding module that directly upsamples the multi-resolution feature maps using bilinear interpolation. CENet+KNN [32]^[29] incorporated convolutional layers with larger kernel sizes, replacing MLP, and integrated multiple auxiliary segmentation heads into its architecture. There are obvious advantages in computational complexity and speed in projection-based methods. Therefore, it is significant to improve the segmentation accuracy of projection-based methods for practical application in autonomous driving.

References

Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. Polarnet: An improved grid representation for online lidar point clouds semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9601–9610.
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660.
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5099–5108.
Wu, W.; Qi, Z.; Fuxin, L. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9621–9630.
Landrieu, L.; Simonovsky, M. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4558–4567.
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117.
Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420.
Lu, T.; Wang, L.; Wu, G. Cga-net: Category guided aggregation for point cloud semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11693–11702.
Qiu, S.; Anwar, S.; Barnes, N. Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1757–1767.
Fan, S.; Dong, Q.; Zhu, F.; Lv, Y.; Ye, P.; Wang, F.-Y. SCF-Net: Learning spatial contextual features for large-scale point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14504–14513.
Tchapmi, L.; Choy, C.; Armeni, I.; Gwak, J.; Savarese, S. Segcloud: Semantic segmentation of 3d point clouds. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 537–547.
Graham, B.; Engelcke, M.; Van Der Maaten, L. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9224–9232.
Choy, C.; Gwak, J.; Savarese, S. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084.
Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching efficient 3d architectures with sparse point-voxel convolution. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 685–702.
Zhou, H.; Zhu, X.; Song, X.; Ma, Y.; Wang, Z.; Li, H.; Lin, D. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv 2020, arXiv:2008.01550.
Cheng, R.; Razani, R.; Taghavi, E.; Li, E.; Liu, B. 2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12547–12556.
Hou, Y.; Zhu, X.; Ma, Y.; Loy, C.C.; Li, Y. Point-to-voxel knowledge distillation for lidar semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8479–8488.
Wu, B.; Wan, A.; Yue, X.; Keutzer, K. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1887–1893.
Wu, B.; Zhou, X.; Zhao, S.; Yue, X.; Keutzer, K. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4376–4382.
Xu, C.; Wu, B.; Wang, Z.; Zhan, W.; Vajda, P.; Keutzer, K.; Tomizuka, M. Squeezesegv3: Spatially-adaptive convolution for efficient point-cloud segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 1–19.
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360.
Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. Rangenet++: Fast and accurate lidar semantic segmentation. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4213–4220.
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
Alonso, I.; Riazuelo, L.; Montesano, L.; Murillo, A.C. 3d-mininet: Learning a 2d representation from point clouds for fast and efficient 3d lidar semantic segmentation. IEEE Robot. Autom. Lett. 2020, 5, 5432–5439.
Peng, K.; Fei, J.; Yang, K.; Roitberg, A.; Zhang, J.; Bieder, F.; Heidenreich, P.; Stiller, C.; Stiefelhagen, R. MASS: Multi-attentional semantic segmentation of LiDAR data for dense top-view understanding. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15824–15840.
Cortinhal, T.; Tzelepis, G.; Erdal Aksoy, E. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. In Proceedings of the Advances in Visual Computing: 15th International Symposium, ISVC 2020, San Diego, CA, USA, 5–7 October 2020; Proceedings, Part II 15. pp. 207–222.
Li, S.; Chen, X.; Liu, Y.; Dai, D.; Stachniss, C.; Gall, J. Multi-scale interaction for real-time lidar data segmentation on an embedded platform. IEEE Robot. Autom. Lett. 2021, 7, 738–745.
Zhao, Y.; Bai, L.; Huang, X. Fidnet: Lidar point cloud semantic segmentation with fully interpolation decoding. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 4453–4458.
Cheng, H.X.; Han, X.F.; Xiao, G.Q. Cenet: Toward concise and efficient lidar semantic segmentation for autonomous driving. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6.