The revolutionary deep learning has proven to be efficient in various fields including computer vision and image processing with application to various kinds of applications such as autonomous driving. In what follows in this section, we are going to report all the deep neural network architectures that have been used in the literature in the RBA tasks cited earlier in the previous section. Specifically, we are going to highlight the contribution of all the related state-of-the-art approaches based on the core architectures. The goal of this study is to let the practitioners decide the meta-architecture and feature extractors that are more suitable for their application.
4.1. Deep Convolutional Neural Networks
CNNs are the most appealing variant of deep neural networks in vision related-tasks. They were basically developed by LeCun et al. in [
52] and successfully reused in the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) competition by Krizhevsky et al. [
21]. CNNs are able to perform feature learning to get consistent representation from visual data (basically). Its architecture includes three kinds of layers/mapping functions; convolution, pooling, and ended with one or more fully connected or GAP (Global Average Pooling) layers. Dropout and batch normalization is generally used in CNN for regularization. The biggest advantage is the use of weight sharing (contrary to conventional neural networks), which helps in saving memory and reducing complexity. Along with this, CNNs have shown impressive performance and outperformed all the machine learning techniques devoted to visual understanding.
One of the main application of CNNs is object detection, which represents an important task and a major challenge in computer vision [
53]. Its challenges are mainly related to object localization and classification. Several methods based on CNN exist in the literature. Overall, methods using CNNs fall into two categories: (i) single-stage methods, which perform object localization and classification in a single network such as Single-Shot Detector (SSD) [
54] and You Only Look Once (YOLO) [
55], Both of these architectures produce as output a bounding box of each detected object, its class, and its confidence score [
53], and (ii) two-stage methods, which have two separate networks for each of these tasks.
Among its milestones models, the important engines in CNN are AlexNet [
21], VGG-16 [
56], GoogleNet/Inception [
57], ResNet [
58], and Xception [
59]. Autonomous driving systems take advantage of the so-called pre-trained model of an already-trained model over other visual tasks. Expressly, for a completely new problem/data, the ITS community makes use of the publicly available model to either (i) use open access meta-architectures to train it from scratch on their own data or (ii) use the pre-trained models as feature extractors by feeding new data to a trained model with its trained weights and tweak it to train the new task, which is what we call transfer learning. Commonly, the output layer is replaced with a new fine-tuned layer in order to determine the deviation of the prediction to the labeled data. Thus, using a non-large-scale dataset, three training ways might be considered: train only the output layer, train the whole CNN, or the last few layers. The ITS community makes use of all these learning strategies to build, train, or retrain their CNN backbones. With application to the on-road behavior tasks, while many endeavors succeed to generate robust features from pre-trained models [
17,
32,
35], major improvements have been proposed recently at many levels of the standard architecture with regards to the spatial exploitation, depth, and feature map exploitation.
Graph Convolutional Network (GCN). GCN consists in using graph-based learning instead of learning data represented in the Euclidean space (such as image or video). The strength of the graph-based framework is the ease of representing the interaction occurring between the instance’s components. This is of great importance in road scenes, which will serve in describing the causality between the intention and the attention of the drivers and all the road participants. This paradigm has been exploited by Li et al. in [
23] to model interactions, as detailed earlier in
Section 4. The two proposed networks, Ego-Thing Graph and Ego-Stuff Graph advanced the state-of-the-art of GCNs through the extension of the backbone proposed in Ref. [
60] with two-stream instead of one-stream. First, to generate graphs, 3D convolutions have been applied to obtain the first level of visual features which are feeding both streams; Thing and Stuff graphs. Then, as a second step, the Thing and Stuff representations are extracted using RoIAlign [
61] and the newly proposed approach called MaskAlign to deal with irregular objects. The extracted features from both streams are then used to generate the graph generators using a frame-wise fashion. Ego-Thing Graph and Ego-Stuff Graph allow afterward to pass on the connections between different kinds of objects through these GCNs. Outputs from the two obtained streams are fused and fed into a temporal module. This latter module aggregates spatial features using max-pooling to compute the final GCN output.
Fully Convolutional Networks (FCN). Unlike standard CNN, FCN is an end-to-end network where a Fully Connected (FC) layer [
62] or an MLP (Multi-Layer Perceptron) network [
63] are derived on the top of a CNN-like network where filters are learned at all the levels including the decision-making layers. This allows FCNs to learn representations and scoring based on local features/data. Ref. [
50] exploits this fact to propose an end-to-end architecture for generic motion models for autonomous vehicles under crowd contexts. The authors here propose the so-called dilated FCN approach. Taking advantage of the pre-trained CNN models, the second and the fifth pooling layers have been removed and a dilated convolution layer replaced the third convolution layer through FC7. The difference between the usual convolution with stride and the dilated one is the expansion of the filter’s size before doing the convolution.
Among other applications of FCN is semantic segmentation, where the output of the model has the same resolution as the input images, with a class prediction for each pixel. Since the pioneering work of Long et al. [
64], where the base structure of the model is an encoder and a decoder streams, many variants have been proposed. Most of these works focus on the improvement of the accuracy of segmentation. However, real-time performance is very important for autonomous driving. Combination of light architectures like SkipNet [
64] and ShuffleNet [
65] for the encoder and decoder parts, respectively, allows segmentation rates at about 16 FPS on a Jetson TX2, while maintaining high accuracy [
66].
3D CNNs. represent an extension of CNNs where a 3D activation map is generated over the convolution layers. The intuition behind this is to encode data that can be represented on more than two dimensions like volumetric or temporal data. 3D CNNs have been used in different contexts such as 3D shape estimation [
67], human activity detection [
68], and recently explored in RBA systems [
69] to further enhance the understanding of drivers’ behaviors. Indeed, a TRB (Temporal Reasoning Block) has been introduced to model the causes of behaviors. The aim of this 3D-CNN is to discriminate spatio-temporal representations with attention saliency mechanisms. By assuming inputs as coarse-grained videos, the main contribution here is the proposition of a novel reasoning block composed of two layers; the first one consists of fine-grained 3D convolution and the second one allows to keep the temporal continuity.
CNN Features Extractors. With the availability of large-scale visual data along with optimization algorithms and powerful CPUs/GPUs, it becomes possible to train deep networks that achieve impressive performance on roughly all the challenging tasks. The obtained models are shared for the community in order to avoid training from scratch and make it easier for researchers to enhance the available models or reuse them as features extractors. For example, a CNN network that has been trained to classify road objects will output several features from the low-level to the high-level layers with increasing complexity and abstraction. Complexity, in this case, goes from pixels, blobs, circles, wheels, stuff, faces, hoods until bicycles, cars, pedestrians, and the whole scene. A number of neurons are supposed to be activated for these abstraction levels. The key feature of this is that another classification task devoted for example for the indoor or in-the-wild scenes will make use of the same low- and mid-level features that are present in all the domains.
The ImageNet [
21] and the COCO [
70] pre-trained models are widely considered for RBA tasks and reused as backbones for feature extraction. The (meta-)architectures of neural networks reviewed in this survey did make use of state-of-the-art backbones along with task-specific layers (for, e.g., classification or detection heads). Thus, choosing the right feature extractor is important since its properties (mainly the type of layers and number of parameters) directly affect the performance of the whole network. By barring the few examples that decouple the backbone from the meta-architecture [
71], the main feature extractors used in the related works are AlexNet [
21], ResNet-50 [
58], VGG-16 [
56], Inception-v2 [
72], and InceptionResnet-v2 [
73] used in the following non-exhaustive list of papers detailed above [
16,
17,
18,
32,
43], respectively.
4.2. Deep Recurrent Neural Networks
Despite its impressive performance for tasks related to image analysis, CNNs examine only the current input and fail in handling sequential data. RNNs (Recurrent Neural Networks) however process only sequential data [
74]. Composed often of a single node with internal memory, the principle of RNNs is memorizing the outputs and feeding them back as inputs and continuing to do this until predicting the output of the layer. Thus, RNNs allow saving information that has occurred in the past and looks for patterns over time and the length of the sequence. With application to ADAS and AV systems, such end-to-end machines have been recently used to automatically model non-linear discriminative representations to improve the performance of vision-based analysis tools basically related to RBA. We can categorize the related work depending on the architectures into two groups: approaches based on long short-term memory (LSTM) [
75] and GRU (Gated Recurrent Units) [
76].
To start, an LSTM network has been used in Ref. [
17] to propose the so-called ADMD system composed of four blocks to learn driving maneuvers as spatio-temporal sequences. Features that serve as an input for the predictor, i.e., LSTM network, were transferred from a CNN model (InceptionResnet-v2 [
73]), as in [
22], an attention map generator, and raw vehicle signals.
Similarly, a novel network architecture called TraPHic is introduced in [
39] to predict road agent trajectories under complex contexts. The issue with LSTM under such circumstances is the inability to model relationships of heterogeneous road agents since the settings of an LSTM unit are independent of the others. To capture temporal dependencies of objects’ spatial coordinates, LSTMs are employed and combined with CNN to boost the learning of local objects’ relationships in space and time. In Ref. [
43], a two-stream deep architecture for event proposal and prediction is proposed. The first one proposed the key-frames of the event sequences. A standard LSTM classifier is employed to predict the corresponding class among
approaching, entering, passing. The output of this first stream, the candidate frames, is then forwarded to the second stream of prediction. The frames are here aggregated through GAP and output the event class.
As for the GRU networks applied for RBA tasks, this is still a niche area. For instance, Hong et al. [
51] propose a new encoder-decoder architecture where an RNN-based composed of a single GRU cell has been employed as a decoder. Unlike LSTM units, GRU has the ability to control the inputs without memorizing in-between status. This allows a less complex decoder block in this encoder-decoder architecture [
51]. A more fancy unit has been proposed in Ref. [
44] called GRFU (Gated Recurrent Fusion Units) aiming to learn temporal data and fusion simultaneously. Precisely, the novel gating mechanism learns a representation of every single-mode, i.e., sensor, for each instance in order to infer the best fusion strategy among Late Recurrent Summation (LRS), Early Gated Recurrent Fusion (EGRF), or Late Gated Recurrent State Fusion (LGRF).