2. Lane Marking Detection
As aforementioned, traditional lane marking detection approaches generally rely on sophisticated model design and hand-crafted features, involving color conversion
[4], combination of Kalman and particle filter
[18], bar filter
[19] and Hough transform
[5]. These approaches directly output lane segments, which are further post-processed to remove false positives and grouped to form the lane markings. Aly
[20] proposed a robust real-time lane marking detection method, which first generated a top view image by projection transform and then extracted lane markings using a bar filter and a simple Hough transform. Assidiq et al.
[21] detected edges with the Canny operator and extracted line features through the Hough transform. The lane marking was obtained by line fitting to selected pixels. However, limited by the poor feature representation, traditional methods show inrobustness in complex scenarios, such as with broken lane markings or occlusion by vehicles and pedestrians.
In recent years, the deep learning technique has significantly boosted the lane marking detection performance. According to the modeling strategy, such approaches can be classified into four categories: segmentation-based, anchor-based, row-wise detection, and parametric prediction methods. The segmentation-based methods commonly adopt the semantic segmentation or instance segmentation to make pixel-wise predictions
[6,9,10][6][9][10]. Supervised by a sufficient amount of labeled data, these approaches show advantages in detecting various kinds of lane markings. The aerial LaneNet
[22] proposed a fully convolutional neural network in a symmetrical structure, which is enhanced by wavelet transform for lane marking segmentation in aerial imagery. Guan et al.
[23] incorporated the attention mechanism into FPN networks to extract better road marking segmentation results from high resolution UAV images. The anchor-based methods leverage the anchor concept from traditional object detection, but differ from them by taking into account the shape characteristics of lane markings. For instance, the PointLaneNet
[7] and CurveLane-NAS
[24] define anchors with vertical lines, while the Line-CNN
[11] and LaneATT
[12] adopt the Line Proposal Unit, which resembles the Region Proposal network (RPN) of the Faster-RCNN
[25]. The row-wise detection approaches make full use of the prior shape of lane markings as well as their spatial distribution characteristics. They divide the image into grids and make row-wise predictions to locate the lane markings
[26,27,28][26][27][28]. In contrast, the parametric prediction methods define lane markings (especially lane lines) as curve functions with a set of parameters, such as polynomials
[13[13][14],
14], and Bézier curves
[15]. Their interpretations are significantly different from the above-mentioned methods and the corresponding curve parameters are difficult to learn. In addition, to solve the problem of difficult scenes for lane marking detection such as occlusion and low-visibility, Wang et al.
[29] proposed a dynamic data augmentation framework based on imitating real scenes.
3. Lane Detection
The task of lane detection is also known as the drivable area detection, which is mainly classified as a segmentation task at present. As a result of the great successes of the deep learning, many methods based on semantic segmentation and instance segmentation can be transferred to the drivable area detection. The FCN
[30] is the first work to introduce the fully convolutional network to semantic segmentation, which makes CNN-based methods widely applicable for lane detection. The UNet
[31] further constructs an encoder–decoder framework to extract lane semantic information from high-dimensional features. The DeepLabV3
[32] combines the atrous convolutions
[33] with different artous rates to fuse the feature pyramid, namely ASPP, obtaining different receptive fields on feature maps. The PSPNet
[34] proposes the pyramid pooling module for feature extraction of various scales, which enhances the accuracy of the model. It is also worth noting that both DeepLabV3 and PSPNet leverage the fusion of multi-scale feature information to improve the segmentation performance. He et al.
[35] embedded the Swin transformer into the classical network (UNet) to improve the semantic segmentation performance for remote sensing images. Xie et al.
[36] presented a segmentation method for RGB-D data and adopted the motion detection to improve the inference accuracy. Meyer et al.
[37] expanded the Cityscapes dataset
[38] by lane-level annotations and presented a novel lane detection pipeline, which used the stereo system to convert the front-view segmentation results into a form of 3D point cloud and projected it to the top-view. Sun et al.
[39] proposed to leverage crowd-sourced GPS data to extract roads from an aerial image, which achieved improved road segmentation compared to previous works. Fontanelli et al.
[40] performed lane detection in the front-view image and projected it to the top-view for the construction of the path, which is used to plan the future motion of the robot.
4. Multi-Task Approaches
Although previous studies have achieved excellent performance in a single detection or segmentation task, the multi-task architecture to process perception information is more friendly to practical applications. The goal of multi-task approaches is to establish a trade-off between the detection performance and the computational complexity by utilizing the shared feature information and model structure.
The MultiNet
[41] first introduces a multi-task architecture into the autonomous driving perception task. The architecture adopts a shared backbone and three decoders to perform tasks of road segmentation, vehicle detection, and scene classification simultaneously. The DLT-Net
[42] inherits the encoder–decoder architecture with a shared backbone and multi-task decoders. It transmits the information from the drivable area decoder, namely the context tensor, to both the lane marking decoder and the traffic object decoder, thus sharing the decoder information to a certain extent. The RBNet
[43] proposes a multi-task neural network model for unified detection of road and road boundary, which combined the input image, road and road boundary as three nodes into a Bayesian network. Zhang et al.
[44] considered the geometric constraint between the road and its boundaries and constructed interlinked sub-networks for overall performance improvement of both detection tasks. The RoadNet
[45] develops a multi-task convolutional neural network to simultaneously make predictions of road boundaries, surfaces, and centerline based on the high-resolution images from remote sensing. The HYDRO-3D
[46] incorporates object detection features with historical object tracking information to improve the performance of both tasks, which achieves robust object detection. Xia et al.
[47] proposes a platform for automated driving system data acquisition and analysis, which presents a holistic pipeline for data processing based on connected automated vehicles. However, the exploration on the interaction between lane and lane marking information is still insufficient in the above-mentioned studies.