For autonomous vehicles driving in off-road environments, it is crucial to have a sensitive environmental perception ability. However, semantic segmentation in complex scenes remains a challenging task. Most current methods for off-road environments often have the problems of single scene and low accuracy.
1. Introduction
Autonomous vehicles are an innovative mode of transportation based on advanced sensors, computer vision, and artificial intelligence technologies. They do not require human intervention, thus performing driving tasks in various environments more safely and efficiently. For autonomous vehicles, in addition to performing tasks on regular, structured road scenes, sometimes they also need to work in off-road scenes, such as battlefields, post-disaster scenes, etc. In this case, the road is usually unstructured, which is described as an off-road environment. As we can see, it is especially crucial to establish a perception algorithm to deal with autonomous vehicles driving in off-road environments, which will help the unmanned platform realize full-scene perception.
Environmental perception is an important part of the unmanned platform. Cameras and LiDAR (light detection and ranging) are the two main sensors used to obtain environmental information. Camera-based semantic segmentation algorithms mostly rely on the texture or color features of roads, such as boundaries
[1], lane lines
[2], or vanishing points
[3]. Some depth camera-based methods
[2] use depth information as an auxiliary to perform semantic segmentation in off-road environments. Although good segmentation results have been achieved, they are not robust enough due to the influence of illumination changes. Essentially, camera-based environmental perceptions are realized based on color and texture, which are greatly affected by light, thus causing them to fail at night. In off-road environments, it is necessary to work at night, so the camera sensor is not suitable to be the main sensor. As an active detection sensor, LiDAR is also widely used in environmental perception algorithms. Compared to other onboard sensors, LiDAR can provide richer environmental information
[4]. Yu et al.
[5] used LiDAR to perceive street light poles, and Liu et al.
[6] used LiDAR to greatly improve the detection distance of vehicles, pedestrians, and cyclists. Unlike a passive receiving device such as a camera, LiDAR is a suitable sensor to be used on rainy or foggy days and where the light intensity changes drastically. The advantage is that its sensitivity to the environment is extremely low. Therefore, LiDAR has been widely used in the field of environmental perception on the vehicle side. Additionally, its high robustness to environmental perception makes it very suitable for off-road environments.
2. Structured Road Scene 3D Semantic Segmentation
Semantic segmentation of point cloud refers to classifying each point of the input according to the corresponding class so that different types of objects can be distinguished. For the semantic segmentation of 3D point clouds in structured outdoor scenes, the input point cloud can be encoded in three ways: voxel-based, point-based, and projection-based.
In the projection-based algorithm, the 3D point cloud is projected into the 2D space, and then the semantic segmentation network is used to calculate the “pseudo image” in the 2D space. After that, the segmentation result is back-projected to the coordinate space of the 3D point cloud by interpolation to realize the semantic segmentation of the original point cloud. Among them, SqueezeSeg
[7], SqueezeSegv2
[8], SqueezeSegv3
[9], Salsanext
[10], etc., use spherical projection, while PolarNet
[11] and VD3D-FCN
[12] algorithms use bird’s-eye view projection for feature extraction.
Voxel-based semantic segmentation algorithms re-encode the 3D space with voxels. For example, VoxelNet
[13] is a typical algorithm that uses voxels to achieve semantic segmentation of 3D point clouds. It divides the 3D point cloud spaced at equal intervals, which are called voxels. Each voxel is converted into a unified feature representation vector by a VFE (voxel feature encoding) layer, and feature extraction is performed based on this.
Semantic segmentation based on point cloud sequence is a method used to directly extract the feature on the original unordered point cloud sequence. Additionally, a multi-layer perceptron is used to directly perform semantic encoding and spatial position calculation on the point cloud itself. For example, PointNet
[14], PointNet++
[15], RandLA-Net
[16], and KPConv
[17] are point-based algorithms.
These three feature extraction methods have their own advantages and disadvantages. The method based on projection is usually faster than the method that extracts features directly in three-dimensional space. However, the precision loss caused by the forward and reverse projections cannot be ignored. Voxel-based methods are also widely used. After voxel coding, whether deep learning or traditional clustering algorithms are used, target recognition or segmentation tasks can be effectively performed. However, the 3D convolution algorithm is less efficient as the data size increases. Algorithms based on point sequences have high computational efficiency, but poor locality and easy loss of features, thus making it difficult for some small objects to be segmented from large objects. Until the PVCNN
[18] algorithm was proposed, the fusion of voxel and point cloud sequences for feature extraction greatly improved the accuracy and efficiency. After that, RPVNet
[19] fused three feature extraction methods and obtained excellent semantic segmentation results.
3. Off-Road Point Cloud Semantic Segmentation
The above methods for structured road provide many valuable references for off-road segmentation. The biggest difference between the off-road environments and the structured road scene is that its drivable area has no lane lines, no obvious road boundaries, and even no regular shape. Such off-road is very different from structured road, so it is difficult to directly apply the 3D point cloud semantic segmentation algorithm based on structured road scenes. At present, the semantic segmentation algorithms of off-road scenes are mainly divided into three categories: feature engineering based on point clouds, weakly supervised learning, and transfer learning
The feature engineering algorithm based on point clouds performs road segmentation by extracting the geometric features of roads in off-road scenes. On one hand, Liu et al.
[20] focus on identifying negative obstacles on the road. First, three LiDARs are installed directly above and on both sides of the vehicle. Then a mathematical model of the LiDAR scan line is established, and an adaptive filtering algorithm is proposed to identify negative obstacles based on this model. Finally, the operation results of the three LiDARs are fused to detect the drivable area and negative obstacles of the off-road. On the other hand, Liang et al.
[21] project the LiDAR point cloud into a two-dimensional image plane and generate a histogram from it. Water, positive obstacles, and drivable areas in off-road scenes are detected from the histogram. Finally, the result is back-projected into the LiDAR coordinate system. Although the feature engineering method has achieved good results in specific off-road scenes, it has significant constraints, merely possessing the capacity to classify a few specific elements in the scene, and failing to adapt to various off-road scenes.
Gao et al.
[22] projected the original point cloud onto the image plane through a bird’s-eye view, and then used the GPS information of the moving vehicle to obtain the driving trajectory. On the projected image, the region growth algorithm is performed on the driving trajectory to automatically generate the label of the drivable region, and combined with a small amount of manually labeled data as the training dataset, a good segmentation result is finally achieved. Meanwhile, the workload of manual annotation is greatly reduced. Holder et al.
[23] use an existing CNN framework to pre-train on a dataset of urban structured road scenes. They then use a small dataset of off-road scenes to re-determine the segmentation classes for transfer learning. While achieving good results, it can effectively reduce the labeling of off-road scene LiDAR point cloud data.
To sum up, the main problem in designing semantic segmentation algorithms for off-road scenes is the lack of datasets. Existing algorithms mainly use geometric features or combine specific algorithms with a small amount of data to perform semantic segmentation. However, lower accuracy is still a big problem. Therefore, on one hand, the research should focus on how to obtain a large amount of high-quality data. Relying on computer simulation technology, typical off-road scenes can be built to obtain a large number of accurately labeled datasets. On the other hand, more targeted algorithms should be designed according to the characteristics of off-road scenes. The above two aspects have important engineering value and academic significance for improving the semantic segmentation accuracy of off-road scenes.
This entry is adapted from the peer-reviewed paper 10.3390/wevj14100291