Autonomous vehicle systems (AVS) have advanced at an exponential rate, particularly due to improvements in artificial intelligence, which have had a significant impact on social as well as road safety and the future of transportation systems. Deep learning is fast becoming a successful alternative approach for perception-based AVS as it reduces both cost and dependency on sensor fusion.
1. Introduction
Recently, the autonomous vehicle system (AVS) has become one of the most trending research domains that focus on driverless intelligent transport for better safety and reliability on roads
[1]. One of the main motives for enhancing AVS developments is its ability to overcome human driving mistakes, including distraction, discomfort and lack of experience, that cause nearly 94% of accidents, according to a statistical survey by the National Highway Traffic Safety Administration (NHTSA)
[2]. In addition, almost 50 million people are severely injured by road collisions, and over 1.25 million people worldwide are killed annually in highway accidents. The possible reasons for these injuries may derive from less emphasis on educating drivers with behavior guidance and poorly developed drivers’ training procedures, fatigue while driving, visual complexities, that is, human error, which can be potentially solved by adopting highly efficient self-driving vehicles
[3][4]. The NHTSA and the U.S. Department of Transportation formed the SAE International levels of driving automation, identifying autonomous vehicles (AV) from ‘level 0′ to the ‘level 5′
[5], where levels 3 to 5 are considered to be fully AV. However, as of 2019, the manufacturing of level 1 to 3 vehicle systems has been achieved but level 4 vehicle systems are in the testing phase
[6]. Moreover, it is highly anticipated that autonomous vehicles will be employed to support people in need of mobility as well as reduce the costs and times of transport systems and provide assistance to people who cannot drive
[7][8]. In the past couple of years, not only the autonomous driving academic institutions but also giant tech companies like Google, Baidu, Uber and Nvidia have shown great interest
[9][10][11] and vehicle manufacturing companies such as Toyota, BMW and Tesla are already working on launching AVSes within the first half of this decade
[12]. Although different sensors such as radar, lidar, geodimetric, computer views, Kinect and GPS are used by conventional AVS to perceive the environment
[13][14][15][16][17], it is indeed expensive to equip vehicles with these sensors and the high costs of these sensors are often limited to on-road vehicles
[18].
Table 1 shows a comparison of three major vision sensors based on a total of nine factors. While the concept of driverless vehicles has existed for decades, the exorbitant costs have inhibited development for large-scale deployment
[19]. To resolve this issue and build a system that is cost efficient with high accuracy, deep learning applied vision-based systems are becoming more popular where RGB vision is used as the only camera sensor. The recent developments in this field of deep learning have accelerated the potential of profound learning applications for the solution of complex real-world challenges
[20].
Table 1. Comparison of vision sensors.
VS |
VR |
FoV |
Cost |
PT |
DA |
AAD |
FE |
LLP |
AWP |
Camera |
High |
High |
Low |
Medium |
Medium |
High |
High |
Medium |
Medium |
Lidar |
High |
Medium |
High |
Medium |
High |
Medium |
Medium |
High |
Medium |
Radar |
Medium |
Low |
Medium |
High |
High |
Low |
Low |
High |
Low |
In addition, a good amount of attention was given to developing safe AVS systems for pedestrian detection. Multiple deep learning approaches such as DNN, CNN, YOLO V3-Tiny, DeepSort R-CNN, single-shot late-fusion CNN, Faster R-CNN, R-CNN combined ACF model, dark channel prior-based SVM, attention-guided encoder–decoder CNN outperformed the baseline of applied datasets that presented a faster warning area by bounding each pedestrian in real time
[21], detection in crowded environments, and dim lighting or haze scenarios
[22][23] for position estimation
[23], minimizing computational cost and outperforming state-of-the-art methods
[24]. The approaches offer an ideal pedestrian method once their technical challenges have been overcome, for example, dependency on preliminary boxing during detection, presumption of constant depths in input image and improvement to avoid missing rate when dealing with a complex environment.
Moreover, to estimate steering angles, velocity alongside controlling for lane keeping or changing, overcome slow drifting, take action on a human’s weak zone such as a blind spot and decreasing manual labelling for data training, multiple methods, such as multimodal multitask-based CNN
[25], CNN with LSTM
[26] and ST-LSTM
[27], were studied in this research for AVS’s end-to-end control system.
Furthermore, one of the most predominant segments of AVS, traffic scene analysis, was covered to understand scenes from a challenging and crowded movable environment
[28], improve performance by making more expensive spatial-feature risk prediction
[29] and on-road damage detection
[24]. For this purpose, HRNet + contrastive loss
[30], Multi-Stage Deep CNN
[31], 2D-LSTM with RNN
[32], DNN with Hadamard layer
[33], Spatial CNN
[29], OP-DNN
[34] and the methods mentioned in
Table 2 were reviewed. However, there are still some limitations, for instance, data dependency or relying on pre-labelled data, decreased accuracy in challenging traffic or at nighttime.
Table 2. Summary of multiple deep learning methods for traffic scene analysis.
Ref. |
Method |
Outcomes |
Advantages |
Limitations |
[35] |
VGG-19 SegNet |
Highest 91% classification accuracy. |
Efficient in specified scene understanding, reducing the person manipulation. |
Showed false rate for not having high-resolution labelled dataset. |
[28] |
Markov Chain Monte Carlo |
Identify intersections with 90% accuracy. |
Identified intersections from challenging and crowded urban scenario. |
Independent tractlets caused unpredictable collision in complex scenarios. |
[36] |
HRNet |
81.1% mIoU. |
Able to perform semantic segmentation with high resolution. |
Required huge memory size. |
[30] |
HRNet + contrastive loss |
82.2% mIoU. |
Contrastive loss with pixel-to-pixel dependencies enhanced performance. |
Did not show success of contrastive learning in limited data-labelled cases. |
[37] |
DeepLabV3 and ResNet-50 |
79% mIoU with 50% less labelled dataset. |
Reduce dependency on huge labelled data with softmax fine-tuning. |
Dependency on labelled dataset. |
[31] |
Multistage Deep CNN |
Highest 92.90% accuracy. |
Less model complexity and three times less time complexity than GoogleNet. |
Did not demonstrate for challenging scenes. |
[38] |
Fine- and coarse-resolution CNN |
13.2% error rate. |
Applicable at different scale. |
Multilabel classification from scene was missing. |
[32] |
2D-LSTM with RNN |
78.52% accuracy. |
Able to avoid the confusion of ambiguous labels by increasing the contrast. |
Suffered scene segmentation in foggy vision. |
[39] |
CDN |
Achieved 80.5% mean IoU. |
Fixed image semantic information and outperformed expressive spatial feature. |
Unable to focus on each object in low-resolution images. |
[33] |
DNN with Hadamard layer |
0.65 F1 score, 0.67 precision and 0.64 recall. |
Foresaw road topology with pixel-dense categorization and less computing cost. |
Restrictions by the double-loss function caused difficulties in optimizing the process. |
[40] |
CNN with pyramid pooling |
Scored 54.5 mIoU. |
Developed novel image augmentation technique from fisheye images. |
Not applicable for far field of view. |
[29] |
Spatial CNN |
96.53% accuracy and 68.2% mIoU. |
Re-architected CNN for long continuous road and traffic scenarios. |
Performance dropped significantly during low-light and rainy scenarios. |
[34] |
OP-DNN |
91.1% accuracy after 7000 iterations. |
Decreased the issue of overfitting in small-scale training set. |
Required re-weighting for improved result but inapplicable in uncertain environment. |
[41] |
CNN and LSTM |
90% accuracy in 3 s. |
Predict risk of accidents lane merging, tollgate and unsigned intersections. |
Slower computational time and tested in similar kinds of traffic scenes. |
[42] |
DNN |
68.95% accuracy and 77% recall. |
Determined risk of class from traffic scene. |
Sensitivity analysis was not used for crack detection. |
[43] |
Graph-Q and DeepScene-Q |
Obtained p-value of 0.0011. |
Developed dynamic interaction-aware-based scene understanding for AVS. |
Unable to see fast lane result and slow performance of agent. |
[44] |
PCA with CNN |
High accuracy for transverse classification. |
Identified damages and cracks in the road, without pre-processing. |
Required manual labelling which was time consuming. |
[45] |
CNN |
92.51%, 89.65% recall and F1 score, respectively. |
Automatic learning feature and tested in complex background. |
Had not performed in real-time driving environment. |
[24] |
SegNet and SqueezedNet |
Highest accuracy (98.93%) in GAPs dataset. |
Identified potholes with texture-reliant approach. |
Failed cases due to confusing with texture of the restoration patches. |
Taking into account all taxonomies as features, the decision-making process for AVS was broadly analyzed where driving decisions such as overtaking, emergency braking, lane shifting with collision and driving safety in intersections adopting methods such as deep recurrent reinforcement learning
[46], actor-critic-based DRL with DDPG
[47], double DQN, TD3, SAC
[48], dueling DQN
[49], gradient boosting decision tree
[50], deep RL using Q-masking and autonomically generated curriculum-based DRL
[51]. Despite solving most of the tasks for safe deployment in level 4 or 5 AVS, challenges remain, such as complex training cost, lack of proper surrounding vehicles’ behavior analysis and unfinished case in complex scenarios. Some problems remain to be resolved for better outcomes, such as the requirement of a larger labelled dataset
[52], struggle to classify in blurry visual conditions
[53] and small traffic signs from a far field of view
[54], background complexity
[55] and detecting two traffic signs rather than one, which occurred for different locations of the proposed region
[56]. Apart from these, one of the most complicated tasks for AVS, only vision-based path and motion planning were analyzed by reviewing approaches such as deep inverse reinforcement learning, DQN time-to-go method, MPC, Dijkstra with TEB method, DNN, discrete optimizer-based approach, artificial potential field, MPC with LSTM-RNN, advance dynamic window using, 3D-CNN, spatio-temporal LSTM and fuzzy logic, where solutions were provided by avoiding cost function and manual labelling, reducing the limitation of rule-based methods for safe navigation
[57] and better path planning for intersections
[58], motion planning by analyzing risks and predicting motions of surrounding vehicles
[59], hazard detection-based safe navigation
[60], avoiding obstacles for smooth planning in multilane scenarios
[61], decreasing computational cost
[62] and path planning by replicating human-like control thinking in ambiguous circumstances. Nevertheless, these approaches faced challenges such as lack of live testing, low accuracy in far predicted horizon, impaired performance in complex situations or being limited to non-rule-based approaches and constrained kinematics or even difficulty in establishing a rule base to tackle unstructured conditions.
Finally, to visualize overlaying outcomes generated from the previous methods superimposed on the front head-up display or smart windshield, augmented reality-based approaches combining deep learning methods were reviewed in the last section. AR-HUD based solutions such as 3D surface reconstruction, object marking, path overlaying, reducing drivers’ attention, boosting visualization in tough hazy or low-light conditions by overlapping lanes, traffic signs as well as on-road objects to reduce accidents using deep CNN, RANSAC, TTC methods and so on. However, there are still many challenges for practical execution, such as human adoption of AR-based HUD UI, limited visualization in bright daytime conditions, overlapping non-superior objects as well as visualization delay for fast moving on-road objects. In summary, the research established for vision-based deep learning approaches of 10 taxonomies for AVS with discussion of outcomes, challenges and limitations could be a pathway to improve and rapidly develop cost-efficient level 4 or 5 AVS without depending on expensive and complex sensor fusion.