Although semantic segmentation of 2D images is crucial to attaining scene understanding, there are still some limitations to visual sensors, such as the inefficiency of acquiring information under insufficient light, lack of depth information and limited field of view. In contrast, LiDAR can obtain accurate depth information with higher density and wider viewing field regardless of lighting conditions, which makes it a more reliable source of information for environmental perception.
1. Introduction
Scene understanding is one of the most critical tasks in autonomous driving. With the challenges introduced by recent technologies such as autonomous driving, a detailed and accurate understanding of the road scene has become a main part of any outdoor autonomous robotic system in recent years. Although semantic segmentation of 2D images is crucial to attaining scene understanding, there are still some limitations to visual sensors, such as the inefficiency of acquiring information under insufficient light, lack of depth information and limited field of view. In contrast, LiDAR can obtain accurate depth information with higher density and wider viewing field regardless of lighting conditions, which makes it a more reliable source of information for environmental perception. Therefore, the scene understanding of LiDAR point clouds with semantic segmentation has become a focal point in autonomous driving.
According to point clouds’ encoding methods, the current LiDAR point clouds semantic segmentation methods can be divided into three categories: point-based methods, voxel-based methods, and projection-based methods. In terms of speed, there is a lot of computation and memory consumption in point-based and voxel-based methods, which makes it difficult to achieve real-time effects with the on-board computing platform. A higher priority should be placed on real-time performance when it comes to autonomous driving than segmentation accuracy. In contrast, projection-based methods are lightweight and fast, so real-time effects can be achieved during deployment. In terms of segmentation accuracy, the projection-based method has shown some success. However, since the point cloud information is not fully utilized during feature extraction, there is still room for improving segmentation accuracy.
When achieving real-time effects, it is of great relevance to improve the
segmentation accuracy in autonomous driving scenarios. PolarNet
[1] is the baseline network of ACPNet, which encodes point clouds through polar bird’s-eye-view (BEV) representation. BEV is the abbreviation for Bird’s Eye View, which is a perspective that views an object or scene from above, just like a bird looking down at the ground in the air. Also known as God’s perspective, which is a perspective or coordinate system used to describe the perception of the world. The using of polar BEV has some advantages: First, in terms of point allocation within grid cells, the polar BEV method will assign point clouds to their respective grid cells more evenly. Second, since the partitioning method brings about a more balanced distribution of points, the theoretical upper limit of prediction accuracy for the semantic classification of point clouds will be increased, thereby improving the performance of downstream semantic segmentation models
[1]. In ACPNet, the encoded point cloud features are fed into an Asymmetric Convolution Backbone Network (ACBN) for feature extraction. Then, the features extracted by the backbone are input to the Contextual Feature Enhancement Module (CFEM) for further mining of contextual features. Moreover, global scaling and global translation are used as Enhanced Data Augmentation (EDA) while ACPNet is being trained.
2. LiDAR Point Clouds Semantic Segmentation in Autonomous Driving Based on Asymmetrical Convolution
Due to the sparsity and disorderliness of point clouds, encoding the input point cloud is a crucial issue when using convolutional neural networks for semantic segmentation of 3D point clouds. According to the encoding methods for point clouds, existing point cloud encoding methods can be divided into three categories: Point-based Methods, Voxel-based Methods, and Projection-based Methods.
2.1. Point-Based Methods
PointNet
[2] is a point-wise learning method for point cloud features, and max pooling is used to integrate global features. PointNet++
[3] is an extension to PointNet, and the ability to extract local information of different scales is strengthened. A spatially continuous convolution is proposed in PointConv
[4], which reduces the memory consumption of the algorithm effectively. For semantic segmentation in large-scale point clouds scenarios, the point clouds are represented as interconnected superpoint graphs in SPG
[5], and then PointNet was used to learn the features of the superpoint graph. An attention-based module was designed in RandLA-Net
[6] to integrate local features, achieving efficient segmentation in large-scale point clouds. Segmentation performance was further improved in KPConv
[7] with a novel spatial kernel-based point convolution. Lu et al.
[8] suggested the use of distinct aggregation strategies for both within-category and between-category data. Employing aggregation or enhancement techniques on local features
[9] can effectively enhance the perception of intricate details. Furthermore, to effectively learn features from extensive point clouds encompassing diverse target types, Fan et al.
[10] introduced the SCF-Net. This network incorporates a dual-distance attention
mechanism and global contextual features to enhance semantic segmentation performance.
Point-based methods directly work on the raw point clouds without excessive initialization transformation steps. However, when handling expansive point cloud scenes, the local nearest neighbor search is inevitably involved, which is computationally inefficient. Thus, there is still clearly room for improvement in point-based methods.
2.2. Voxel-Based Methods
Point clouds are regularly divided into 3D cubic voxels, and Voxel-based methods employ 3D convolution for the extraction of features. SEGCloud
[11] is one of the earlier methods for semantic segmentation based on voxel representation. In order to utilize 3D convolution efficiently and expand the receptive field, 3D sparse convolution
[12] is used in Minkowski CNN
[13], which reduces the computational complexity of convolution. In pursuit of higher segmentation accuracy, a neural architecture search (NAS) based model SPVNAS
[14] is proposed, which trades high computational cost for accuracy. In order to fit the spatial distribution of the LiDAR point clouds, a cylinder voxel division method is proposed in Cylinder3D
[15], which makes it obtain high accuracy. In order to streamline computations and enhance the intricacies of smaller instances, an attention-focused feature fusion module and an adaptive feature selection module are proposed by Cheng et al.
[16]. To improve the speed of voxel-based networks, a method of knowledge distillation from point to voxel is proposed in PVKD
[17] to achieve model compression.
High segmentation accuracy is typically achieved in voxel-based methods. However, 3D convolution is inevitably used, resulting in significant memory occupation and high computational consumption.
2.3. Projection-Based Methods
The basic concept behind projection-based methods is to transform point clouds into images that can undergo 2D convolution operations. The SqueezeSeg
[18][19][20] series of algorithms based on SqueezeNet
[21] perform semantic segmentation after projecting point clouds. RangeNet++
[22] implements semantic segmentation based on the backbone network of DarkNet53
[23], and a K-Nearest Neighbor (KNN) algorithm is proposed to improve segmentation accuracy. 3D-MiniNet
[24] is based on a lightweight backbone to build the network, achieving a faster speed. A polar BEV representation method is proposed in PolarNet
[1], which uses a simplified version of PointNet to encode the point clouds of each polar coordinate grid to obtain a pseudo image, and KNN post-processing operation is no longer needed. Peng et al.
[25] introduced a multi-attention mechanism to enhance the understanding of driving scenes, specifically focused on dense top-view semantic segmentation using sparse LiDAR data. SalsaNext
[26] introduced a new context module, which replaces the ResNet encoder blocks with a residual convolution stack that has increasing receptive fields. Additionally, it incorporated a pixel-shuffle layer into the decoder. MINet
[27] employed multiple paths with varying scales to effectively distribute computational resources across different scales. FIDNet-Point
[28] designed a fully interpolation decoding module that directly upsamples the multi-resolution feature maps using bilinear interpolation. CENet+KNN
[29] incorporated convolutional layers with larger kernel sizes, replacing MLP, and integrated multiple auxiliary segmentation heads into its architecture.
There are obvious advantages in computational complexity and speed in projection-based methods. Therefore, it is significant to improve the segmentation accuracy of projection-based methods for practical application in autonomous driving.