In recent years, the autonomous driving field has experienced an impressive development, gaining a huge interest and expanding into many sub-fields that cover all aspects of the self-driving vehicle 
. Examples are vehicle-to-vehicle communications 
, energy-storage devices, sensors 
, safety devices 
, and more. Among them, a fundamental field is scene understanding, a challenging Computer Vision (CV) task that deals with the processing of raw environmental data to construct a representation of the scene in front of the car that allows for the subsequent interaction with the environment (e.g., route planning, safety breaks engagement, packet transmission optimizations, etc.).
Scene understanding is the process of perceiving, analysing, and elaborating on an interpretation of an observed scene through a network of sensors 
. It involves several complex tasks, from image classification to more advanced ones like object detection and Semantic Segmentation (SS). The first task deals with the assignment of a global label to an input image; however, it is of limited use in the autonomous driving scenario, given the need for localizing the various elements in the environment 
. The second task provides a more detailed description, localizing all identified objects and providing classification information for them 
. The third task is the most challenging one, requiring the assignment of a class to each pixel of an input image.
2. Semantic Segmentation with Deep Learning
A graphic example of a possible deployment of the task in autonomous driving scenarios is reported in Figure 1.
Figure 1. The car screen shows an example of semantic segmentation of the scene in front of the car.
Early approaches to semantic segmentation were based on the use of classifiers on small image patches 
, until the introduction of deep learning, which has enabled great improvements in this field as well.
The first approach to showcase the deep learning potential on this task is found in 
, which introduced an end-to-end convolutional model, the so-called Fully Convolutional Network (FCN) model, which is made of an encoder (or contraction segment) and a decoder (or expansion segment). The former maps the input into a low-resolution feature representation, which is then upsampled in the expansion block. The encoder (also called backbone) is typically a pretrained image classification network used as a feature extractor. Among these networks, popular choices are VGG 
, ResNet 
, or the more lightweight MobileNet 
Other remarkable architectures that followed FCN are ParseNet (Liu et al. 
), which models global context directly rather than only relying on a larger receptive field, and DeconvNet (Noh et al. 
) which proposes an architecture that contains overlapping deconvolution and unpooling layers to perform nonlinear upsampling, resulting in improving the performance at the cost of increasing the complexity of the training procedure.
A slightly different approach is proposed in the Feature Pyramid Network (FPN), developed by Lin et al. 
, where a bottom-up pathway, a top-down pathway, and lateral connections are used to join low-resolution and high-resolution features and to better propagate the low-level information into the network. Inspired by the FPN model, Chen et al. 
proposes the DeepLab architecture, which adopts pyramid pooling modules wherein the feature maps are implicitly downsampled through the use of dilated convolutions of different rates. According to the researchers, dilated convolutions allow for an exponential increase in the receptive field without a decrease in resolution or increase in parameters, as may happen in the traditional pooling or stride-based approaches. Chen et al. 
further extended the work by employing depth-wise separable convolutions.
Nowadays the current objective in semantic segmentation consists of improving the multiscale feature learning while making a trade-off between keeping the inference time low and increasing the receptive field/upsampling capability.
One recent strategy is feature merging through attention-based methods. Recently, such techniques gained a lot of traction in Computer Vision, following its success in Natural Language Processing (NLP) tasks. The most famous approach of this class is the transformer architecture 
, introduced by Vaswani et al. in 2017 in an effort to reduce the dependence of NLP architectures on recurrent blocks, which have difficulty in handling long-time relationships between input data. This architecture has been adapted to the image understanding field in the Vision Tranformers (ViT) 
work, which presents a convolution-free, transformer-based vision approach able to surpass previous state-of-the-art techniques in image classification (at the cost of much higher memory and training data requirements). Transformers have been used as well in semantic segmentation in numerous works 
Although semantic segmentation was originally tackled by RGB data, recently many researchers started investigating its application for LiDAR data 
. The development of such approaches is supported by an ever-increasing number of datasets that provide labeled training samples, e.g., Semantic KITTI 
. More in detail, PointNet 
was one of the first general-purpose 3D pointcloud segmentation architectures, but although it achieved state-of-the-art results on indoor scenes, the sparse nature of LiDAR data led to a significant performance decrease in outdoor settings, limiting its applicability in autonomous driving scenarios. An evolution of this technique is developed in RandLANet 
, where an additional grid-based downsampling step is added as preprocessing, together with a feature aggregation based on random-centered KD-trees, to better handle the sparse nature of LiDAR samples. Other approaches are SqueezeSeg 
and RangeNet 
, wherein the segmentation is performed through a CNN architecture. In particular, the LiDAR data is converted to a spherical coordinate representation allowing one to exploit 2D semantic segmentation techniques developed for images. The most recent and better-performing architecture is Cylinder3D 
, which exploits the prior knowledge of LiDAR topologies—in particular their cylindrical aspect—to better represent the data fed into the architecture. The underlying idea is that the density of points in each voxel is inversely dependent on the distance from the sensor; therefore the architecture samples the data according to a cylindrical grid, rather than a cuboid one, leading to a more uniform point density.
RGB data carries a wealth of visual and textual information, which in many cases has successfully been used to enable semantic segmentation. Nevertheless, depth measurements provide useful geometric cues, which help significantly in the discrimination of visual ambiguities, e.g., to distinguish between two objects with a similar appearance. Moreover, RGB cameras are sensitive to light and weather conditions which can lead to failures in outdoor environments 
. Thermal cameras give temperature-based characteristics of the objects, which can better enhance the recognition of some objects, thereby improving the resilience of semantic scene understanding in challenging lighting conditions 
3. Multimodal Segmentation Techniques in Autonomous Driving
Table 1 shows a summarized version of the methods, comparing them according to
modalities used for the fusion;
datasets used for training and validation;
approach to feature fusion (e.g., sum, concatenation, attention, etc.); and
fusion network location (e.g., encoder, decoder, specific modality branch, etc.).
Table 1. Summary of recent multimodal semantic segmentation architectures. Modality shorthand: Dm, raw depth map; Dh, depth HHA; De, depth estimated internally; E, event camera; T, thermal; Lp, light polarization; Li, LiDAR; Ls, LiDAR spherical; F, optical flow. Location: D, decoder; E, encoder. Direction: D, decoder; C, color; B, bi-directional; M, other modality.
On the other hand, in Table 2
, researchers report the numerical score (mIoU) attained by the methods in three benchmark datasets, respectively: Cityscapes 
for 2.5D SS in Table 2
a, KITTI 
for 2D + 3D SS in Table 2
b and MSSSD/MF 
for RGB + Thermal SS in Table 2
Table 2. Architectures Performance Comparison.
Early attempts of multimodal semantic segmentation approaches combine RGB data and other modalities into multi-channel representations that were then fed into classical semantic segmentation networks based on the encoder–decoder framework 
. This simple early fusion combination strategy is not too effective because it struggles to capture the different types of information carried by the different modalities (e.g., RGB images contain color and texture, whereas the other modalities typically better represent the spatial relations among objects). Within this reasoning, feature-level and late-fusion approaches have been developed. Fusion strategies have typically been categorized into early, feature and late-fusion strategies, depending on the fact that the fusion happens at the input level, in some intermediate stage or at the end of the understanding process. However, most recent approaches try to get the best of the three modalities by performing multiple fusion operations at different stages of the deep network 
A very common architectural choice is to adopt a multi-stream architecture for the encoder with a network branch processing each modality (e.g., a two-stream architecture for RGB and depth) and additional network modules connecting the different branches that combine modality-specific features into fused ones and/or carry information across the branches 
. This hierarchical fusion strategy leverages multilevel features via progressive feature merging and generates a refined feature map. It entails fusing features at various levels rather than at early or late stages.
The feature fusion can take place through simple operations e.g., concatenation, element-wise addition, multiplication, etc., or a mixture of these, which is typically addressed as a fusion block, attention, or gate module. In this fashion, multi-level features can be fed from one modality to another, e.g., in 
where depth cues are fed to the RGB branch, or mutually between modalities. The fused content can either reach the next layer or the decoder directly through skip connections 
The segmentation map is typically computed by a decoder taking in input the fused features and/or the output of some of the branches. Multiple decoders can also be used but it is a less common choice