Some advanced techniques, such as meta-learning, can be examined to ameliorate the generalization of the CNN method. It can also be improved by diversifying the training dataset. The augmentation technique has a significant role in diversifying and increasing the number of data in the image dataset. In this process, data can be cropped, rotated, brightened, and mirrored to assort the training dataset shown in Figure 1 as a reference.
There are many strategies to detect the LMD using a deep learning network, though these strategies can be categorized based on defining the LMD task. Therefore, these techniques can be classified as object detection, classification, and segmentation of lanes. Every feature point on lane segments is labeled, and detects the lanes as an object by the regression coordinates. In comparison, lane position is determined by combining the prior information in the classification techniques. On the contrary, background and lane pixels are labelled as distinct classes and detect the lane through semantic or instant segmentation. However, some LMD techniques are also satisfied with multiple purposes along with detecting lane marks, such as road marking detection, road type classification, and drivable area detection. Initially, architectural information can be managed from the primary convolution network, such as ResNet, VGG, and FCN.
Though it has improved LMD compared to the traditional methods, it also has some research limitations. The approach requires complex data processing unit and has a complex architecture of eight layers. Therefore, other researchers have developed other improved deep neural networks to overcome the existing limitations.
Image classification refers to the discrimination process of objects available in the input image frame. However, the location of the lane can not be tracked through this process. Therefore, some modification is required in the classification technique to track the lane’s location. Let us consider the amendment on the classification is y = f(x,pm(p)), where f(x) is the CNN mapping function, and pm(p) is the prior knowledge depending on the lane location. Gurghian et al. [
18] have come up with DeepLane depending on the same idea, which network architecture is shown in
Figure 5. DeepLane received the training dataset, which was created from the image frames of the downward camera. It was classified into 317 classes, among which, 316 were for the probable lane position and reaming one was for missing lanes. A softmax function was applied to the last fully connected layer to achieve the probability distribution. The lane position was estimated
Ei through the following equation:
Figure 5. Schematic diagram of DeepLane.
Though DeepLane has achieved a better result than a complex network [
17], the prior fixing of the lane position has limited its robustness. In addition, the classification techniques do not fit with lane marking detection, as it is associated with the high-level task. As discussed earlier, the regression of the lane coordinate as an object detection process is also a better possible way to detect lane marking detection.
2.2.4. Lane Detection Based on the Segmentation
Segmentation approaches such as [
29,
30,
31] can be the best option for lane marking detection, as mentioned by Shriyash et al. [
32]. These approaches strictly emphasize per-pixel classification rather than focusing on particular shapes. Lane detection based on the segmentation framework achieved more efficient results, except for the concern of the above limitation. This problem is solved by many strategies, such as the strategy proposed by Chiu et al. [
33], which referred to the lane marking detection system as an image segmentation problem. However, the conventional segmentation approaches did not last long.
Due to the previous reason, the researcher started to apply end-to-end segmentation approaches for lane marking detection. The network can carry more features according to the larger size of the convolution kernel. Zhang et al. introduced a GCN [
34] algorithm to detect particular lane areas. A lane departure system based on Mask-RCNN [
35] is proposed by Riera Luis et al. to detect the lane marks and an additional Kalman filter to track the lanes. Shriyash et al. [
36] proposed a CNN architecture that consists of ten neuron layers to detect the lanes in real time. Different types of lanes also have a notable contribution to more comprehensive recognition detection. The modified ERFNet architecture was designed by Fabio et al. [
37] to classify the road lanes and identify the drivable area.
Semantic Segmentation through DCNN may have some deficiencies, as it has no learnable pooling parameters. For instance, there is no learnable parameter in max/min pooling or un-sampling layers. Therefore, there is an extreme possibility of losing many features when attempting to recognize a large-perspective field. Kontun et al. [
38] introduced dilated convolution to resolve this issue, which can be studied more in [
39]. Though this framework had significant advantages, the effective design of CNN architecture emphasizing dilated convolution has become a new issue.
Chen et al. proposed a Deep Convolution Neural Network based on the lane markings detector (LMD), aiming to have the optimal CNN architecture design with dilated convolution [
40]. The lane markings detector, similar to ResNet [
41] and VGG [
42], is used as an encoder to classify, and DeconvNet [
43], U-Net [
44], and FCN [
45] are used as a decoder to create feature maps. Additionally, dilated convolutions were embedded in the encode–decode section of the architecture shown in
Figure 6. Lo et al. [
40] introduced a CNN architecture based on DDB (Digressive Dilation Block) and FSS (Feature Size Selection), considering the spatial and downsampling operation, which was also embedded with dilation convolution [
46].
Figure 6. Schematic diagram of Deep Convolution Neural Network based on the lane markings detector (LMD).
Long-range information in lane marking detection is another concern. Wang et al. [
47] designed a non-local operation depending on a non-local framework [
48]. The model could extract the long-distance or range information, as long-distance information is also one of a lane’s properties. Li et al. [
49] proposed Instance batch normalization and Attention Network (IANet) to emphasize the model for considering a particular lane region. It is more appropriate for two-class segmentation scenarios, according to the experimental result.
Considering efficient classification by focusing on pixels rather than shape, Jan et al. [
50] came up with an adversarial network known as generative adversarial networks (GAN). It has a generator to create the synthetic data and a discriminator to differentiate the real data from the generator’s output data. The initial concept for the GAN was to predict data closely approximate to the real data. The recent concept tells us to differentiate accurately to determine whether the input is generated or real. A reader can go through [
51,
52,
53] for further information about the GAN. Ghafoorian et al. [
19] designed Embedding loss GAN (EL-GAN) based on the GNN concept. The framework is divided into two segments, as generator and discriminator. The schematic diagram of the EL-GAN framework is shown in
Figure 7. U-Net’s unique algorithm is applied for the generator to train the input, and Tiramisu DenseNet [
54] is used for detecting the lane markings. This process is continued up to the level of convergence. In the case of the discriminator, DenseNet [
55,
56] is used with the fully connected Generative Adversarial Network classification [
57].
Figure 7. Schematic diagram of EL-GAN.
The framework generator is trained by adversarial embedding and Adam optimizer, whereas the discriminator is trained by stochastic gradient descent and ordinary cross-entropy. Embedding loss can be considered perceptual loss [
58], whereas EL-GAN combines perceptual loss and CGAN.
Geometrical features of roads also have an important role in lane marking detection, which have better performance results than VPGNet. Zhang et al. [
59] proposed Geometric Constrained Network (GLCNet), which has multitasked to interlink the lane boundary and lane segmentation sub-structure. The architecture of GLCNet [
59] is shown in
Figure 8, which indicates that every decode section has a link with the encode section to transfer corresponding features into two distinct tasks. Therefore, the information from the decode sections can be redounded reciprocally. This multitask strategy opened the gate for the researchers to develop a framework for the link between lane boundary and lane area. Considering the same idea as GCLNet, John et al. [
60] designed PSINet for multiple detection purposes, such as road scene labels, lane marks, and free space on the road.
Figure 8. Schematic diagram of GLCNet.
In addition to the geometric or special feature, temporal correlation might have a significant effect where a lane can not be detected due to the linear structure of the captured video. As Long short-term memory (LSTM) has memory capture capability, the lane can be extracted from the previous frame by this LSTM approach. Hence, Qin et al. [
61] proposed a CNN-LSTM method that includes two LSTM layers between the encode–decode stage. The major achievement of this method is that it has obtained ameliorate performance results under different occlusion scenarios. The architecture of the CNN-LSTM method is depicted in
Figure 9, which indicates the temporal information transfer between the encode–decode stage through LSTM.
Figure 9. Schematic diagram of CNN-LSTM.
2.2.5. Simplification of the Post-Processing Step
Without considering the optimization by the post-processing step, the described frameworks extracted lane features more efficiently. It is very challenging to differentiate the lane features from the output, excluding the post-processing approach. Effective strategies are more important than particular network architecture to discover the optimal result. This sub-section focuses on these strategies, rather than a deep neural network (DNN) architecture, on lane marking detection.
There are two types of algorithmic output possible for lane marking detection using DNN, such as lane points and lane lines. Hence, the possibilty is raised to utilize different lane features, excluding post-processing steps. There might be three possible solutions to overcome the particular constraint: semantic segmentation by labelling each line as separate classes, instance segmentation by referring to every lane as a different instance, and multi-branch CNN structure by detecting every lane line through the individual branch.
Xingang et al. [
20] applied a Spatial Convolution Neural Network (SCNN) to detect the lanes under occlusion scenarios as multi-class semantic segmentation. SCNN framework is based on the LargeFOV layout [
62], and the weight of the initial thirteen convolution layers is taken from VGG16 [
42]. To predict the lanes precisely, it generates pixel-wise probability maps for training the network. Consequently, it applies a CNN to differentiate the lane markings on its own. Finally, the probability maps are sent to the system to predict the lane markings of different classes. The architecture of the SCCN is shown in
Figure 10, where various branches were designed to predict other lane classes.
Figure 10. Schematic diagram of SCCN.
Shriyash et al. [
32] proposed Coordinate Network (CooNet) as a lane point regression approach. It is a multi-branch neural network shown in
Figure 11, where lanes are predicted in their perspective branches. However, this network has no clustering process as the network directly provides the lane output through the coordinate regression.
Figure 11. Schematic diagram of CooNet.
To detect multi lanes with changes from the lanes, Davy et al. [
63] introduced an end-to-end lane detection approach by applying the LaneNet deep learning method based on the encoding–decoding procedure E-Net [
64], as shown in
Figure 12. It takes the shared encodes from the input images and finds the embedding binary segmentation for each pixel for creating the cluster together. All pixels can associate with the neighbourhood pixels. It utilized the H-Net to collect the ideal information about the perspective transformation by imposing a relevant condition on the input image. The research aimed to take the challenge on lane changes, unlike the bird’s eye view. Additionally, this approach has no limitation on the number of lanes, whereas CooNet and SCNN can only detect up to four lanes.
Figure 12. Schematic diagram of Lanenet.