The Detection of Lanes and Lane Markings

The Detection of Lanes and Lane Markings: Comparison

Please note this is a comparison between Version 1 by Xianwang Yu and Version 2 by Wendy Huang.

Vision-based identification of lane area and lane marking on the road is an indispensable function for intelligent driving vehicles, especially for localization, mapping and planning tasks. However, due to the increasing complexity of traffic scenes, such as occlusion and discontinuity, detecting lanes and lane markings from an image captured by a monocular camera becomes persistently challenging. The lanes and lane markings have a strong position correlation and are constrained by a spatial geometry prior to the driving scene. Most existing studies only explore a single task, i.e., either lane marking or lane detection, and do not consider the inherent connection or exploit the modeling of this kind of relationship between both elements to improve the detection performance of both tasks.

lane marking detection
lanes
deep learning
multi-task
autonomous driving

1. Introduction

Lanes and lane markings are essential road information for intelligent driving vehicles. The lane marking detection aims to accurately locate road elements like lane lines, crosswalks, and stop zones, while the lane detection focuses on segmenting lane-level areas where vehicles can drive on the road. Due to the low cost and the high representability of scene information, optical sensors and instruments, such as the on-board camera, are widely adopted for road information perception. By applying lane and lane marking detection approaches, visual features of road symbols, arrows, lane markings, pedestrian crosswalks, and vehicle drivable areas, etc., are extracted from the image. These features are indispensable for both high-level autonomous driving or for general ADAS-assisted driving systems. They can be considered as elements in the high-definition map construction, or further converted into the information required by the planning and control system, to assist the driving behavior of vehicles, especially in applications such as adaptive cruise control (ACC), driving route navigation, lane keeping assistance (LKA), etc., thus ensuring driving safety and reliability ^[1][2][3][1,2,3].

Generally, the detection of lanes and lane markings can be classified in two categories: the traditional paradigm ^[4][5][4,5], and the deep learning paradigm ^[6][7][8][6,7,8]. Traditional methods rely on hand-crafted features and sophistically designed rules to manipulate the information from color space or shape structure to detect lanes and lane markings. Due to their poor feature representability, these methods are only limited scalable to varied scenes. In recent years, the deep learning approaches in computer vision have achieved remarkable progress, especially in object detection and semantic segmentation tasks. Since the lanes and lane lines are normally made with inherent long and thin shapes and even irregular ones, the difficulty lies in the exploration of effective representation learning of their complex structures.

In current studies, the detection of lanes and lane markings are considered as two individual tasks. The lane detection is typically interpreted as a pixel-wise semantic segmentation problem while the lane marking can be predicted with various formulations such as instance segmentation ^[9][10][9,10], point regression ^[11][12][11,12], curve parameter estimation ^[13][14][15][13,14,15], etc. Although both tasks have witnessed persistent progress in recent years, especially on public benchmarks ^[9][16][17][9,16,17], one fact that has been neglected is that the information of lanes and lane markings on the road scene are complementary to each other. For instance, on structured roads, the associated lane lines can be used to identify the lane boundaries while in scenarios where lane lines or crosswalks are partially missing or broken (due to occlusion), they can still be inferred by the width of the lane. Thus, the detection of lane and lane marking are inherently correlated due to their spatial connectivity. In real driving scenarios, the detection robustness of a single task is poor, and it can be easily disturbed by the disappearance of visual markings, e.g., due to occlusions. However, leveraging the spatial connectivity between the lane and lane marking, the detection robustness can be improved by modeling this internal connection, which has not been studied in existing methods.

2. Lane Marking Detection

As aforementioned, traditional lane marking detection approaches generally rely on sophisticated model design and hand-crafted features, involving color conversion [4], combination of Kalman and particle filter [18], bar filter [19] and Hough transform [5]. These approaches directly output lane segments, which are further post-processed to remove false positives and grouped to form the lane markings. Aly [20] proposed a robust real-time lane marking detection method, which first generated a top view image by projection transform and then extracted lane markings using a bar filter and a simple Hough transform. Assidiq et al. [21] detected edges with the Canny operator and extracted line features through the Hough transform. The lane marking was obtained by line fitting to selected pixels. However, limited by the poor feature representation, traditional methods show inrobustness in complex scenarios, such as with broken lane markings or occlusion by vehicles and pedestrians. In recent years, the deep learning technique has significantly boosted the lane marking detection performance. According to the modeling strategy, such approaches can be classified into four categories: segmentation-based, anchor-based, row-wise detection, and parametric prediction methods. The segmentation-based methods commonly adopt the semantic segmentation or instance segmentation to make pixel-wise predictions ^[6][9][10][6,9,10]. Supervised by a sufficient amount of labeled data, these approaches show advantages in detecting various kinds of lane markings. The aerial LaneNet [22] proposed a fully convolutional neural network in a symmetrical structure, which is enhanced by wavelet transform for lane marking segmentation in aerial imagery. Guan et al. [23] incorporated the attention mechanism into FPN networks to extract better road marking segmentation results from high resolution UAV images. The anchor-based methods leverage the anchor concept from traditional object detection, but differ from them by taking into account the shape characteristics of lane markings. For instance, the PointLaneNet [7] and CurveLane-NAS [24] define anchors with vertical lines, while the Line-CNN [11] and LaneATT [12] adopt the Line Proposal Unit, which resembles the Region Proposal network (RPN) of the Faster-RCNN [25]. The row-wise detection approaches make full use of the prior shape of lane markings as well as their spatial distribution characteristics. They divide the image into grids and make row-wise predictions to locate the lane markings ^[26][27][28][26,27,28]. In contrast, the parametric prediction methods define lane markings (especially lane lines) as curve functions with a set of parameters, such as polynomials ^[13][14][13,14], and Bézier curves [15]. Their interpretations are significantly different from the above-mentioned methods and the corresponding curve parameters are difficult to learn. In addition, to solve the problem of difficult scenes for lane marking detection such as occlusion and low-visibility, Wang et al. [29] proposed a dynamic data augmentation framework based on imitating real scenes.

3. Lane Detection

The task of lane detection is also known as the drivable area detection, which is mainly classified as a segmentation task at present. As a result of the great successes of the deep learning, many methods based on semantic segmentation and instance segmentation can be transferred to the drivable area detection. The FCN [30] is the first work to introduce the fully convolutional network to semantic segmentation, which makes CNN-based methods widely applicable for lane detection. The UNet [31] further constructs an encoder–decoder framework to extract lane semantic information from high-dimensional features. The DeepLabV3 [32] combines the atrous convolutions [33] with different artous rates to fuse the feature pyramid, namely ASPP, obtaining different receptive fields on feature maps. The PSPNet [34] proposes the pyramid pooling module for feature extraction of various scales, which enhances the accuracy of the model. It is also worth noting that both DeepLabV3 and PSPNet leverage the fusion of multi-scale feature information to improve the segmentation performance. He et al. [35] embedded the Swin transformer into the classical network (UNet) to improve the semantic segmentation performance for remote sensing images. Xie et al. [36] presented a segmentation method for RGB-D data and adopted the motion detection to improve the inference accuracy. Meyer et al. [37] expanded the Cityscapes dataset [38] by lane-level annotations and presented a novel lane detection pipeline, which used the stereo system to convert the front-view segmentation results into a form of 3D point cloud and projected it to the top-view. Sun et al. [39] proposed to leverage crowd-sourced GPS data to extract roads from an aerial image, which achieved improved road segmentation compared to previous works. Fontanelli et al. [40] performed lane detection in the front-view image and projected it to the top-view for the construction of the path, which is used to plan the future motion of the robot.

4. Multi-Task Approaches

Although previous studies have achieved excellent performance in a single detection or segmentation task, the multi-task architecture to process perception information is more friendly to practical applications. The goal of multi-task approaches is to establish a trade-off between the detection performance and the computational complexity by utilizing the shared feature information and model structure. The MultiNet [41] first introduces a multi-task architecture into the autonomous driving perception task. The architecture adopts a shared backbone and three decoders to perform tasks of road segmentation, vehicle detection, and scene classification simultaneously. The DLT-Net [42] inherits the encoder–decoder architecture with a shared backbone and multi-task decoders. It transmits the information from the drivable area decoder, namely the context tensor, to both the lane marking decoder and the traffic object decoder, thus sharing the decoder information to a certain extent. The RBNet [43] proposes a multi-task neural network model for unified detection of road and road boundary, which combined the input image, road and road boundary as three nodes into a Bayesian network. Zhang et al. [44] considered the geometric constraint between the road and its boundaries and constructed interlinked sub-networks for overall performance improvement of both detection tasks. The RoadNet [45] develops a multi-task convolutional neural network to simultaneously make predictions of road boundaries, surfaces, and centerline based on the high-resolution images from remote sensing. The HYDRO-3D [46] incorporates object detection features with historical object tracking information to improve the performance of both tasks, which achieves robust object detection. Xia et al. [47] proposes a platform for automated driving system data acquisition and analysis, which presents a holistic pipeline for data processing based on connected automated vehicles. However, the exploration on the interaction between lane and lane marking information is still insufficient in the above-mentioned studies.