Robot Environment Perception for Navigation

Robot Environment Perception for Navigation: Comparison

Please note this is a comparison between Version 1 by Liyana Nipuna Wijayathunga and Version 2 by Peter Tang.

Robot external and internal environment sensing by extraction of raw sensor data and their interpretation is the basic principle of robot perception. In the modular or end-to-end robot navigation approach, sensors play a critical role in capturing the environment or internal robot attributes for robot perception. A sensor modality represents a sensor that inputs a particular form of energy and processes the signal using similar methods. Modalities include raw input types for sensors like sound, pressure, light (infrared, visible), or magnetic fields. Robot perception sensor modalities commonly include cameras (infrared, RGB or depth), LiDAR, radar, sonar, GNSS, IMU, and odometry sensors.

unstructured environments
mobile robots
robot navigation
perception
robot vision
sensor fusion

1. Vision and Ranging Sensors

Camera sensors are usually incorporated in vision applications by researchers to retrieve environmental information for the use of mobile robots. However, LiDAR sensors have shown more reliability in low-light environmental conditions than cameras and produce highly accurate depth measurements. LiDAR sensors come with 2D or 3D mapping capability ^[1], and these sensors can generate high-fidelity point clouds of outdoor environments. However, these unstructured point clouds tend to become increasingly sparse as the sensor range is increased.

Vision-based Sensor Types

Vision is crucial for a robot to navigate in unknown environments. Robots are equipped with vision-based sensors like cameras to understand environmental information. Cameras are generally cheaper than LiDAR sensors and are a type of passive sensor (although some depth cameras use active sensing). The monocular configuration is highly used in standard RGB cameras, for example, the GoPro Hero camera series. They are compact passive sensors that consume low power, approximately 5 watts, depending on resolution and mode settings. Monocular SLAM has been explored in the research literature due to its simple hardware design ^[2][3]. However, the algorithms that are used for it are very complex because depth measurements cannot be directly retrieved from static monocular images. Monocular cameras also suffer from pose estimation problems ^[4]. The pose of the camera is obtained by referencing previous poses. Hence, errors in pose estimation propagate through the process, and this phenomenon is called scale drift ^[5][6][7].

Stereo cameras are inspired by human eyes and use two lenses and separate passive image sensors to obtain two perspectives of the same scenes. They have been used in indoor and outdoor SLAM applications. These camera types use the disparity of the two images to calculate the depth information. Stereo camera vision does not suffer from scale drift. Available popular stereo cameras are Bumblebee 2, Bumblebee XB3, Surveyor stereo vision system, Capella, Minoru 3D Webcam, Ensenso N10, ZED 2, and PCI nDepth vision system. Stereo camera power consumption is generally between 2 to 15 watts. The maximum range of stereo cameras varies from 5 to 40 m at different depth resolution values. The accuracy of these sensors varies from around a few millimetres to 5 cm at the maximum range ^[8]. The cost of these sensors varies from one hundred to several thousand Australian dollars. RGB-D camera sensors consist of monocular or stereo cameras coupled to infrared transmitters and receivers. Kinect camera from Microsoft is a relatively inexpensive RGB-D sensor that provides colour images and depth information of image pixels. The Kinect sensor is mainly used for indoor robot applications due to the saturation of infrared receivers in outdoor scenarios from sunlight ^[9]. The Kinect sensor has three major versions: Kinect 1, Kinect 2, and Azure Kinect (the latest version). Kinect 1 uses structured light for depth measurement, and the other models use Time-Of-Flight (TOF) as the depth measuring principle. The newest Kinect model, Azure Kinect, also generates substantial noise in outdoor bright light conditions with a practical measuring range below 1.5 m ^[10]. In general, RGB-D sensors utilise three depth measurement principles: structured light, TOF, and active infrared stereo techniques. The structured light RGB-D sensors underperform compared to TOF techniques in measuring the range of distant objects. The structured light technique is vulnerable to multi-device interference. The TOF methods suffer from multi-path interference and motion artefacts. The active infrared stereo principle has drawbacks due to common stereo matching problems such as occluded pixels and flying pixels near contour edges ^[11]. Active infrared stereo cameras are identified as an extension of passive stereo cameras. They offer more reliable performance in indoor and some outdoor scenarios. However, they require high computation capability. Table 1 shows the main depth measurement methods that are used in RGB-D sensors.

Table 1.

Depth sensor modalities.

Techniques	Typical Sensors
Configurations		Advantages	Advantages Disadvantages	Disadvantages
Monocular	Compactness, low hardware requirements	No direct depth measurements
Stereo	Depth measurements, low occlusions

	Wide angle view (alternative to rotating cameras)	Lower resolution, needs special methods to compensate for image distortions

^]. The range of a pulsed LiDAR is limited by the signal-to-noise ratio (SNR) of the sensor. Pulsed LiDAR sensors are suitable in indoor and outdoor environments due to their instantaneous higher peak pulse contrast to ambient irradiance noise. However, Amplitude Modulated Continuous Wave (AMCW) LiDAR sensors with the same average signal power have lower continuous wave peaks and, hence, are vulnerable to solar irradiance. AMCW LiDAR sensors are popular in indoor robot applications due to low SNR. In general, there is no significant depth resolution difference between pulsed LiDAR and AMCW LiDAR, but pulsed LiDAR may outperform AMCW LiDAR accuracy at the same optical power levels because of higher SNR. Increasing the modulation frequency of AMCW LiDAR sensors improves depth resolution but reduces ambiguity distance. Thus, pulsed sensors have higher ranges compared to AMCW sensors. AMCW sensors also have slow frame rates relative to pulsed sensors and usually underperform in dynamic environments. Frequency Modulated Continuous Wave (FMCW) sensors have higher depth resolution in comparison to pulsed and AMCW LiDAR sensors. These sensors can measure the depth and velocity of targets simultaneously, a highly advantageous feature for the autonomous vehicle industry. These sensors can also avoid interference from other LiDAR sensors because of the frequency modulation. FMCW sensors, however, require accurate frequency sweeps to generate emitter signals, which is a challenging task. FMCW sensor technology has been in continuous development but has not yet established itself at the commercial level. A summary of these three LiDAR technologies is provided in Table 3.

Table 3.

LiDAR sensor types.

Configurations	Advantages	Disadvantages
Structured light	Kinect 1, Xtion PROLive, RealSense SR300 and F200	High accuracy and precision in indoor environments	Limited range, not suitable for outdoor environment due to noise from ambient light, interference from reflections and other light sources
TOF	Kinect 2	Good for indoor outdoor applications, long range, robust to illumination changes	Lower image resolution than structured light cameras, high power consumption, cost varies with resolution, rain fog can affect sensor performance
Active infrared stereo	RealSense R200, RealSense D435, D435i	Compact, lightweight, dense depth images	Stereo matching requires high processing power, struggle at high occlusions and featureless environments, relatively low range especially outdoors

Event cameras are asynchronous sensors that use different visual information acquisition principles than the standard frame-based image sensors. This camera type samples light based on the changes in the scene dynamics (asynchronously measuring brightness per pixel). These cameras currently cost several thousands of dollars. Event cameras have advantages like very high temporal resolution, high dynamic range, low latency (in microseconds), and lower power consumption than standard cameras. However, this camera type is not suitable for the detection of static objects. The main burden is the requirement of new methods (algorithms and hardware, e.g., neuromorphic-based approaches or spiking neural networks) to process event camera outputs to acquire data and information because traditional image processing methods are not directly applicable ^[12].

Omnidirectional cameras are utilised in robotic applications where more information is needed about surrounding environments. These cameras provide a wider-angle view than conventional cameras, which typically have a limited field of view. Due to the lens configuration of omnidirectional cameras, the obtained images have distortions. Therefore, these cameras require different mathematical models, like the unified projection model, to correct image distortions. A summary of these different camera types and their advantages and disadvantages is shown in Table 2.

Table 2.

Camera configurations.


Fails in featureless environments, CPU intensive, accuracy/range depends on camera quality
RGB-D	Colour and depth information per pixel	Limited range, reflection problems on transparent, shiny, or very matte and absorbing objects
Event	High temporal resolution, suitable for changing light intensities, low latency ^[13]	No direct depth information, costly, not suitable for static scenes, requires non-traditional algorithms
Omni-directional

Active Ranging Sensors

Active sensors emit energy to the environment and measure the return signal from the environment. There are several types of active-ranging sensors, such as reflectivity, ultrasonic, laser rangefinder (e.g., LiDAR), optical triangulation (1D), and structured light (2D). The commonly used LiDAR sensor is an active-ranging sensor that shows improved performance compared to ultrasonic TOF sensors for perception in autonomous navigation systems ^[14]. LiDAR imaging is one of the most studied topics in the optoelectronics field. The TOF measurement principle is used by LiDAR sensors for obtaining depth measurements in different environments. Rotating LiDAR imagers were the first type to successfully achieve acceptable performance using a rotating-mirror mechanical configuration with multiple stacked laser detectors ^[15]. New trends in LiDAR technology are in the development of low-cost, compact, solid-state commercial LiDAR sensors ^[16][17]. The three most widely used LiDAR techniques utilise pulsed, amplitude-modulated, and frequency-modulated laser beams. The most common commercially available method is to use a pulsed laser beam, which directly measures the time taken by the pulsed signal to return to the sensor. In such sensors, time measurements require resolutions in picoseconds (high-speed photon detectors). Therefore, the cost of pulsed LiDAR sensors is comparably higher than the other two methods for equivalent range and resolution ^[18


Pulsed	High frame rate	Low depth resolution, higher inference from other LiDAR sensors
AMCW	Not limited by low SNRs; however, not effective at very low SNRs	Low accuracy than FMCW, lower depth resolution than FMCW
FMCW	Velocity and range detection in a single shot, higher accuracy than AMCW, higher depth resolution	Currently at the research and development stage

The concept of robotic perception pertains to various robotics applications that utilise sensory information and modern deep learning methods. These applications include identifying objects, creating representations of the environment, comprehending scenes, detecting humans/pedestrians, recognising activities, categorising locations based on meaning, scene reconstruction, and others. Towards the goal of fully autonomous robot navigation, robust and accurate environmental perception is a necessity. Passive RGB cameras are relatively low cost and can capture rich environment details. However, their perception abilities are vulnerable to environmental illumination changes and occlusions. Therefore, sensor fusion methods are important to gain robust perception.

2. LiDAR and Camera Data Fusion

Achieving a reliable real-time understanding of external 3D environments is very important for safe robot navigation. Visual perception using cameras is commonly employed in many mobile robotic systems. Camera images can be efficiently and often effectively processed by CNN-based deep learning architectures. However, relying on one sensor can lead to robustness challenges, particularly in applications like self-driving vehicles, autonomous field robots, etc. Therefore, different sensor modalities are often combined to achieve better reliability and robustness for autonomous systems. The fusion of LiDAR and camera data is one of the most investigated sensor fusion areas in the multimodal perception literature ^[19]. Camera and LiDAR fusion has been applied in various engineering fields and has shown better robustness than camera-only robot navigation approaches ^[20]. This fusion strategy is more effective and popular than other sensor fusion approaches such as radar-camera, LiDAR-radar, and LiDAR-thermal camera. Still, the technical challenges and cost of sensors and processing power requirements have constrained the application of these methods in more general human activities. Recent deep-learning algorithms have significantly improved the performance of camera-LiDAR fusion methods ^[21], with monocular and stereo cameras mainly used with LiDAR sensors to fuse images and point cloud data ^[22].

Deep learning-based LiDAR and camera sensor fusion methods have been applied in depth completion, object detection, and semantic and instance segmentation. Image and point cloud scene representations include volumetric, 2D multi-view projection, graph-based and point-based. In general, most of the early methods fuse LiDAR and camera image semantics using 2D CNNs. Many 2D-based networks project LiDAR points onto respective image planes to process feature maps through 2D convolutions ^{[23][24][25][26]}. Several works have used point cloud processing techniques such as PointNet ^[27] to extract features or 3D convolutions ^[28] to detect objects in volumetric representations ^[29]. Some other LiDAR and image fusion methods use 2D LiDAR representations for feature fusion and then cluster and segment 3D LiDAR points to generate 3D region proposals ^[30]. Voxel-based representations and multi-view camera-LiDAR fusion approaches are utilised to generate 3D proposals in object detection. State-of-the-art camera-LiDAR semantic segmentation methods employ feature fusion methods to obtain 2D and 3D voxel-based segmentation results. Multi-view approaches map RGB camera images onto the LiDAR Bird’s-Eye-View (BEV) plane to align respective features from the RGB image plane to the BEV plane ^{[31][32][33][34]}, and several other methods propose to combine LiDAR BEV features with RGB image features directly ^[19][22]. These direct mapping methods use trained CNNs to align image features with LiDAR BEV features from different viewpoints.

Computer vision has been a rapidly growing field in the past decade, and the development of machine learning methods has only accelerated this. Recently, deep learning strategies have influenced the rapid advancement of various computer vision algorithms. Computer vision includes subtopics like object detection, depth estimation, semantic segmentation, instance segmentation, scene reconstruction, motion estimation, object tracking, scene understanding and end-to-end learning ^[9]. Computer vision methods have been applied to a great extent in emerging autonomous navigation applications. However, these vision techniques may be less effective in previously unseen or complex environments and highly rely on the trained domain. Therefore, continuous improvements are being made towards the development of fully autonomous systems. Several state-of-the-art benchmarking datasets have been utilised to compare the performance of different autonomous driving vision methods. KITTI ^[35], Waymo ^[36], A2D2 ^[37], nuScenes ^[38], and Cityscapes ^[39] are some examples of these state-of-the-art autonomous driving datasets.

Dense Depth Prediction

Dense depth completion is a technique that estimates dense depth images from sparse depth measurements. Achieving depth perception is of significant importance in many different engineering industries and research applications such as autonomous robotics, self-driving vehicles, augmented reality, and 3D map construction. LiDAR sensors, monocular cameras, stereo cameras, and RGB-D cameras have been the most utilised in dense depth estimation applications, but these sensors have specific limitations in the estimation process. LiDAR sensors with high accuracies are costly to use in large-scale applications. Three main challenges have been identified in LiDAR depth completion ^[40]. The first relates to the fact that, in general, even expensive LiDAR sensors produce sparse measurements for distant objects. In addition, LiDAR points are irregularly spaced compared to monocular RGB images. Therefore, it is non-trivial to increase the depth prediction accuracy using the corresponding colour image. Secondly, there are difficulties in combining multiple sensor modalities. The third challenge is that depth estimation using deep learning-based approaches has limitations with regard to the availability of pixel-level ground truth depth labels for training networks. Another possible approach to depth estimation is by using stereo cameras. Stereo cameras, however, require accurate calibration, demand high computational requirements, and fail in featureless or uniformly patterned environments. RGB-D cameras are capable of depth sensing but have a limited measuring range and poor performance in outdoor environments. A technique called depth inpainting can be used as a depth completion method for structured light sensors like the Microsoft Kinect 1, and these sensors produce relatively dense depth measurements but are generally only usable in indoor environments. Dense depth estimation techniques generally up-sample sparse and irregular LiDAR depth measurements to dense and regular depth predictions. Depth completion methods, however, still have a variety of problems that need to be overcome. These challenges are primarily sensor-dependent, and solutions should overcome respective difficulties at the algorithm development stage.

Many state-of-the-art dense depth prediction networks combine relatively dense depth measurements or sparse depth maps with RGB images to assist the prediction process. In general, retrieving dense depth detail from relatively dense depth measurements is easier than from sparse depth maps. In relatively dense depth images, a higher percentage of pixels (typically over 80%) have observed depth values. Therefore, in similar scenarios, predicting dense depth is relatively less complex. However, in autonomous navigation applications, 3D LiDAR sensors account for only approximately 4% of pixels when depth measurements are mapped onto the camera image space, which creates challenges in generating reliable dense depth images ^[40].

Dense Depth from Monocular Camera and LiDAR Fusion

Depth estimation based solely on monocular images is not reliable or robust. Therefore, to address these monocular camera limitations, the LiDAR-monocular camera fusion-based depth estimation process has been proposed by researchers. Using monocular RGB images and sparse LiDAR depth maps, a residual network learning-based autoencoder decoder network was introduced by ^[41] to estimate dense maps. However, this method needs a ground truth depth image when retrieving sparse depth images during the network training process. In practice, obtaining such ground truth images is not simple or easily scalable ^[40]. To mitigate the requirement of a ground truth depth image, ref. ^[40] presented a self-supervised model-based network that only requires a monocular RGB image sequence and LiDAR sparse depth images in the network training step. This network consists of a deep regression model to identify a one-to-one transformation from a sparse LiDAR depth map to a dense map. This method achieved state-of-the-art performance on the KITTI dataset and considers the pixel-level depth estimation as a deep regression problem in machine learning. LiDAR sparse depth maps use per-pixel depth, and pixels without measured depth are set to zero. The proposed network follows an encoder-decoder architecture. The encoder has a sequence of convolutions, and the decoder has a set of transposed convolutions to up sample feature spatial resolutions. Convolved sparse depth data and colour images are concatenated into a single tensor and input to residual blocks of ResNet-34 ^[42]. The self-supervised training framework requires only colour/intensity images from monocular cameras and sparse depth images from LiDAR sensors. In the network training step, a separate RGB supervision signal is used (a nearby frame). LiDAR sparse depth can be used as a supervision signal. However, this framework requires a static environment to be able to warp the second RGB frame to the first one.

With this implementation, the root-mean-square of the depth prediction error reduces as a power function with the increment of the resolution of the LiDAR sensor (i.e., the number of scan lines). One of the limitations of this approach is that the observed environment needs to be stationary. If not, the network will not generate accurate results. Large moving objects and surfaces that have specular reflectance can cause the process of training the network to fail. These reasons reduce the applicability of this method in dynamic situations that are often present in outdoor environments. In addition, this network training process may become stuck in the local minimums of the photometric loss function due to improper network weight initialisation. This effect may result in output depth images that are not close enough to the ground truth because of the erroneous training process.

In ^[43], a real-time sparse-to-dense map is constructed by using a Convolutional Spatial Propagation Network (CSPN). The propagation process preserves LiDAR sparse depth input values in the final depth map. This network aims to extract the affinity matrix for respective images. The introduced method learns the affinity matrix by using a deep convolution neural network. The training process of the network model is achieved by incorporating a stochastic gradient descent optimiser. This network implementation showed memory paging cost as a dominant factor when larger images were fed into the PyTorch-based network. The CSPN network has shown good performance in real-time and thus is well-suited for applications such as robotics and autonomous driving. CSPN++ ^[44] is an improved version of the CSPN network with adaptively learning convolutional kernel sizes and numbers of iterations for propagation. The network training experiments were carried out using four NVIDIA Tesla P40 Graphic Processing Units (GPUs) on the KITTI dataset. This research shows that hyper-parameter learning from weighted assembling can lead to significant accuracy improvements, and weighted selection could reduce the computational resource with the same or better accuracy compared to the CSPN network.

Dense Depth from Stereo Camera and LiDAR Fusion

Estimating depth using stereo cameras provides more reliable results compared to monocular cameras. LiDAR sensors can produce depth measurements with improved accuracy over increased ranges and varying lighting conditions. The fusion of LiDAR and stereo camera sensors produces more accurate 3D mappings of environments than LiDAR-monocular camera depth completion. However, stereo cameras commonly have shorter detection ranges, and depth estimation becomes challenging in textureless environments and high occlusion scenarios. One of the significant works in LiDAR-stereo camera fusion is presented in ^[45]. This paper presents the first unsupervised LiDAR-stereo fusion network. The network does not require dense ground truth maps, and training is done in an end-to-end manner that shows a broad generalisation capability in various real-world scenarios. The sparsity of LiDAR depth measurements can vary in real-world applications. One advantage of the network proposed in this work is that it handles a high range of sparsity up to the point where the LiDAR sensor has no depth measurements. A feedback loop has been incorporated into the network to connect outputs with inputs to compensate for noisy LiDAR depth measurements and misalignments between the LiDAR and stereo sensors. This network is currently regarded as one of the state-of-the-art methods for LiDAR-stereo camera fusion.

Multimodal Object Detection

Reliable object detection is a vital part of the autonomous navigation of robots/vehicles. Object detection in autonomous navigation is described as identifying and locating various objects in an environment scene in the form of bounding boxes, including dynamic objects and static objects. Object detection may become difficult due to sensor accuracy, lighting conditions, occlusions, shadows, reflections, etc. One major challenge in object detection is occlusion, which consists of different types. The main occlusion types are self-occlusion, inter-object occlusion, and object-to-background occlusion ^[46]. Early image-based object detection algorithms commonly included two steps. The first stage was dividing the image into multiple smaller sections. Then, these sections were conveyed into an image classifier to identify whether the image section contained an object or not. If any object was detected in an image section, the respective portion of the original image was marked with the relevant object label. The sliding window approach is one way of achieving the above-mentioned first step ^[47].

A different set of algorithms uses a technique in contrast to the sliding window approach by grouping similar pixels of an image to form a region. These regions are then fed to a classifier to identify semantic classes (with grouping done using image segmentation methods). Further improved image segmentation can be achieved by using the selective search algorithm ^[48]. The selective search algorithm emphasises a hierarchical grouping-based segmentation algorithm. In this method, initially detected image regions are merged in a stepwise manner by selecting the most similar segments until the whole image represents a single region. These regions resulting from each step are added to the region proposals and fed to a classifier. The classifier performance depends on the used region proposal method. This object detection approach does not produce real-time performance suitable for autonomous navigation applications. However, advances such as SSPnet ^[49], Fast Regional-based Convolutional Neural Networks (R-CNNs) ^[50], and Faster R-CNN ^[51] were introduced to address this issue. The Faster R-CNN network generates a feature map by utilising the CNN layer output, and the region proposal generation is achieved by sliding a window (comprising three different aspect ratios and sizes) over the feature map. Each sliding window is mapped to a vector and fed to two parallel classification and regression networks. The classification network calculates the probability of region proposals containing objects, and the regression network indicates the coordinates of each of the proposals.

Object detection research has been mainly employed in the autonomous vehicle industry (for vehicle and pedestrian detection ^[52]) and mobile robotics. In contrast to camera-only object detection, sensor fusion has been implemented in different real-world applications to obtain more accurate and robust detection results. As previously discussed, LiDAR and camera sensor fusion are some of the most used and highest-performing sensor fusion methods. LiDAR and camera sensor fusion object detection approaches consist of two main techniques. These are the sequential and one-step models ^[22]. Sequential models use 2D proposal-based methods or direct 3D proposals to detect objects. In the sequential approach, 2D/3D regions are proposed in the first stage, and then 3D bounding box regression is done in the second stage. The 2D/3D region proposal stage incorporates fused image-point cloud regions that may contain objects. In the bounding box regression stage, feature extraction from region proposals and bounding box prediction is done. One-step models generate region proposals and achieve bounding box regression in parallel in a single step. The 2D proposal-based sequential approach uses 2D image semantics to generate a 2D proposal and point cloud processing methods to detect dynamic objects. This approach utilises already developed image processing models to identify 2D object proposals and then project these proposals to LiDAR 3D point cloud space for object detection.

Two approaches are mainly used to manipulate image-based 2D object proposals and irregular 3D LiDAR data. In the first method, image-based 2D bounding boxes are projected to the LiDAR 3D point cloud to implement 3D point cloud object detection algorithms. The second approach utilises point cloud projections on the 2D images and applies 2D semantic segmentation techniques to achieve point-wise semantic labels of the points within the semantic regions ^[22]. LiDAR-camera 2D sequential object detection methods include result, feature, and multi-level fusion strategies. These 2D proposal-based result-level fusion methods incorporate image object detection algorithms to retrieve 2D region proposals. These retrieved 2D object bounding boxes are then mapped onto 3D LiDAR point clouds. The enclosing points in frustum proposals are transferred into a point cloud-based 3D object detection algorithm ^[25]. The overall performance of this object detection approach depends on the modular behaviour of the 2D detection architecture. Sequential fusion may lose complementary data in LiDAR point clouds due to initial 2D image object proposal detection. Two-dimensional proposal-based feature-level fusion uses a mapping from 3D LiDAR point clouds onto the respective 2D images and employs image-based techniques for image feature extraction and object detection. One of these approaches appends per-point 2D semantic details as additional channels of LiDAR 3D point clouds and uses an existing LiDAR-based object detection method ^[53]. However, this approach is not optimal for identifying objects in a three-dimensional world because 3D details in the point clouds may be lost due to the projection.

Multi-level fusion combines 2D result-level fusion with feature-level fusion. This approach uses already available 2D object detectors and generates 2D bounding boxes. Then, points within these bounding boxes are detected. Subsequently, image and point cloud features are combined within the bounding boxes to estimate 3D objects. LiDAR and camera object detection using 3D proposal-based sequential models avoid 2D to 3D proposal transformations and directly generate 3D proposals from 2D or 3D data. This technique consists of two approaches, namely multi-view and voxel-based. MV3D ^[34] is a multi-view object detection network that uses LiDAR and camera data to predict the full 3D envelope of objects in the 3D environment. A deep fusion scheme was proposed to fuse features from multiple views in respective regions. The detection network comprises two networks: the 3D proposal network and the region-based fusion network. As the inputs, the LiDAR BEV, LiDAR front view and the RGB camera image are fed to the network. The LiDAR BEV is fed to the 3D proposal network to retrieve 3D box proposals. These proposals are used to extract features from the LiDAR front-view and camera RGB image inputs. Then, using these extracted features, the deep fusion network predicts object size, orientation, and location in the 3D space. The network was built on the 16-layer VGGnet ^[54], and the KITTI dataset was used for the training process.

One of the drawbacks of the multi-level fusion method is the loss of small objects in the detection stage due to feature map down-sampling. Combining image and point cloud feature maps by RoI (Regions of Interest)-pooling decreases the fine-grained spatial details. MVX-Net ^[33] introduces a method to fuse point cloud and image data voxel-wise or point-wise. Two-dimensional CNNs are used for the image feature extraction process, and a VoxelNet ^[55] based network detects 3D objects in the voxel representation. The input 3D LiDAR point cloud is mapped to the 2D image for image feature extraction in the point-wise fusion method, and then voxelisation and processing are done using VoxelNet. In voxel-wise fusion, the point cloud is firstly voxelised and then projected onto the image-based 2D feature representation to extract features. This sequential approach achieved state-of-the-art performance for 3D object detection at the time of its publication. Object detection utilising one-stage models performs object proposal retrieval and bounding box prediction in a single process. These detection models are suitable for real-time autonomous robot decision-making scenarios. State-of-the-art single-stage object detection methods, such as ^[56], simultaneously process depth images and RGB images to fuse points with image features, and then the generated feature map is used for bounding box prediction. The introduced method ^[56] utilises, two CNN-based networks to parallelly process point cloud and RGB front-view images. One CNN identifies LiDAR features, and the other CNN extracts RGB image features. Then, these RGB image features are mapped into the LiDAR range view. Finally, mapped RGB and LiDAR image features are concatenated and fed into LaserNet ^[57] for object detection and semantic segmentation. This network has been trained in an end-to-end manner. The network training was done for 300K iterations with a batch size of 128, distributed over 32 GPUs. The image fusion, object detection, and semantic segmentation process took 38 milliseconds on an Nvidia Titan Xp GPU.

Multimodal Semantic Segmentation

Scene semantic segmentation assigns a semantic category label to each pixel in a scene image, and it can be regarded as a refinement of object detection ^[58]. A scene can incorporate obstacles, free space, and living creatures (not necessarily limited to these categories). The complete semantic segmentation of images applies these semantic categories to all pixels across an image. Many recent computer vision methods rely on CNN architectures. These networks favour dense depth image data over sparse data. In general, LiDAR sensors produce irregular, sparse depth measurements. Ref. ^[41] introduced a technique to utilise LiDAR sparse data and RGB images to achieve depth completion and semantic segmentation in a 2D view. This network can work with sparse depth measurement densities as low as 0.8%, and at the time of its publication, this method showed state-of-the-art performance on the KITTI benchmark dataset. The base network of this prior work was adopted from NASNet ^[59], which has an encoder and decoder architecture. Using LiDAR and RGB image feature fusion, ref. ^[60] proposed a novel method to achieve 2D semantic segmentation in 2D images. This method introduced a self-supervised network that suits different object categories, geometric locations, and environmental contexts. This self-supervised network uses two sensor modality-specific encoder streams, which concatenate to a single intermediate encoder and then connect into a decoder to fuse the complementary features. The segmentation part is achieved with a network termed AdapNet++ ^[60] that consists of an encoder-decoder architecture. All these network models were implemented using the deep learning TensorFlow library. Another high-performing deep learning-based LiDAR-Camera 2D semantic segmentation method was presented in ^[61]. In this method, the generated 3D LiDAR point data is mapped to the 2D image and up-sampled to retrieve a 2D image set that consists of spatial information. Then, fully convolutional networks are used to segment the image using three approaches: signal level, feature, and cross-fusion. In the cross-fusion method, the network was designed to learn directly from the input data.

The techniques discussed up to now have been 2D semantic segmentation methods. In contrast to 2D methods, 3D semantic segmentation approaches provide a realistic 3D inference of environments. An early approach for a 3D scene semantic segmentation network is presented in ^[62] termed 3DMV. This method requires relatively dense depth scans along with RGB images, and it was developed to map indoor scenarios. Voxelised 3D geometries are built by using LiDAR depth scans. Two-dimensional feature maps are extracted from the RGB images using CNNs, and these image feature maps are mapped in a voxel-wise manner with the 3D grid geometry. This fused 3D geometry is then fed into 3D CNNs to obtain a per-voxel semantic prediction. The overall performance of the approach depends on the voxel resolution, and real-time processing is challenging for higher voxel resolutions. Therefore, this dense volumetric grid becomes impractical for high resolutions. The system was implemented using PyTorch and utilised 2D and 3D convolution layers already provided by the application programming interface. Semantic segmentation of point clouds is challenging for structureless and featureless regions ^[63]. A point-based 3D semantic segmentation framework has been introduced by ^[63]. This approach effectively optimises geometric construction and pixel-level features of outdoor scenes. The network projects features of detected RGB images into LiDAR space and learns 2D surface texture and 3D geometric attributes. These multi-viewpoint features are extracted by implementing a semantic segmentation network and then fused point-wise in the point cloud. Then, this point data is passed to a PointNet++ ^[64] based network to identify per-point semantic label predictions. A similar approach was followed by Multi-view Point-Net ^[65] to fuse RGB semantics with 3D geometry to obtain per LiDAR point semantic labels.

Instead of localised or point cloud representations, ref. ^[66] have used a high dimensional lattice representation for LiDAR and camera data processing. This representation reduces memory usage and computational cost by utilising bilateral convolutional layers. Then, these layers employ convolutions for unoccupied sections in the generated lattice representation. Firstly, the identified features of point clouds are mapped to a high-dimensional lattice, and then convolutions are used. Following this, CNNs are applied to detect image features from multi-view images, and these features are projected to the lattice representation to combine with three-dimensional lattice features. The generated lattice feature map was assessed by using 3D CNNs to obtain point-wise labels. A spatial-aware and hierarchical learning strategy has been incorporated to learn 3D features. The introduced network was capable of training in an end-to-end manner.

Multimodal Instance Segmentation

Instance segmentation identifies individual instances within a semantic category. It is considered a more advanced semantic segmentation method. This method not only provides per-pixel semantic categories but also distinguishes object instances, which is more advantageous for robot scene understanding. However, instance segmentation in autonomous navigation introduces more challenges than semantic segmentation. Instance segmentation approaches based on fused LiDAR-camera sensor data show proposal-based and proposal-free architectures. A voxel-wise 3D instance segmentation approach was introduced by ^[67] that consists of two-stage 3D CNN networks. A feature map was extracted from the low-resolution voxel grid by implementing 3D CNNs. Another feature map was obtained from the RGB multi-view images using 2D CNNs and projected onto the associated voxels in the volumetric grid to append with respective 3D geometry features. Then, object classes and 3D bounding boxes are predicted by feeding these fused features to a 3D CNN architecture. In the second phase, another 3D CNN estimates the per-pixel object instances using already identified features, object classes and bounding boxes.

These voxel-based segmentation methods are constrained by the voxel resolution and require increased computation capabilities with higher grid resolutions. The application of instance segmentation in LiDAR-camera fusion for real-time systems is challenging. Some research studies had limitations, such as the system developed in ^[68], which does not support dynamic environments. A proposal-free deep learning framework that jointly realises 3D semantic and instance segmentation is presented in ^[69]. This method performs 3D instance segmentation in the BEV of point clouds. However, this approach is less effective in identifying vertically oriented objects because of the BEV segmentation approach. This method first extracts a 2D semantic and instance map from a 2D BEV representation of the observed point cloud. Then, using the mean shift algorithm ^[70], and semantic features of the 2D BEV, instance segmentation is achieved by propagating 3D features onto the point cloud. It should be noted that the instance segmentation approaches discussed have been developed to segment 3D point clouds from static indoor environments, and these methods have not shown any segmentation capabilities in dynamic environments.

Overall, while significant progress has been made in the perception capabilities of autonomous robots, particularly in regard to object detection and scene segmentation, many of the existing approaches have only been tested in relatively structured indoor environments, and substantial additional work may be required to adapt these techniques to be used in unstructured outdoor environments. Nonetheless, some very useful research directions have been identified that have the potential to significantly advance this field.

References

Carrasco, P.; Cuesta, F.; Caballero, R.; Perez-Grau, F.J.; Viguria, A. Multi-sensor fusion for aerial robots in industrial GNSS-denied environments. Appl. Sci. 2021, 11, 3921.
Galvao, L.G.; Abbod, M.; Kalganova, T.; Palade, V.; Huda, M.N. Pedestrian and vehicle detection in autonomous vehicle perception systems—A review. Sensors 2021, 21, 7267.
Li, R.; Wang, S.; Gu, D. DeepSLAM: A robust monocular SLAM system with unsupervised deep learning. IEEE Trans. Ind. Electron. 2020, 68, 3577–3587.
Aguiar, A.; Santos, F.; Sousa, A.J.; Santos, L. Fast-fusion: An improved accuracy omnidirectional visual odometry system with sensor fusion and GPU optimization for embedded low cost hardware. Appl. Sci. 2019, 9, 5516.
Fayyad, J.; Jaradat, M.A.; Gruyer, D.; Najjaran, H. Deep learning sensor fusion for autonomous vehicle perception and localization: A review. Sensors 2020, 20, 4220.
Valada, A.; Oliveira, G.L.; Brox, T.; Burgard, W. Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In Proceedings of the International Symposium on Experimental Robotics, Tokyo, Japan, 3–6 October 2016; pp. 465–477.
Li, Y.; Brasch, N.; Wang, Y.; Navab, N.; Tombari, F. Structure-SLAM: Low-drift monocular SLAM in indoor environments. IEEE Robot. Autom. Lett. 2020, 5, 6583–6590.
Zaffar, M.; Ehsan, S.; Stolkin, R.; Maier, K.M. Sensors, SLAM and long-term autonomy: A review. In Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS), Edinburgh, UK, 6–9 August 2018; pp. 285–290.
Sabattini, L.; Levratti, A.; Venturi, F.; Amplo, E.; Fantuzzi, C.; Secchi, C. Experimental comparison of 3D vision sensors for mobile robot localization for industrial application: Stereo-camera and RGB-D sensor. In Proceedings of the 12th International Conference on Control Automation Robotics & Vision (ICARCV), Guangzhou, China, 5–7 December 2012; pp. 823–828.
Tölgyessy, M.; Dekan, M.; Chovanec, L.; Hubinskỳ, P. Evaluation of the Azure Kinect and its comparison to Kinect V1 and Kinect V2. Sensors 2021, 21, 413.
Evangelidis, G.D.; Hansard, M.; Horaud, R. Fusion of range and stereo data for high-resolution scene-modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 2178–2192.
Glover, A.; Bartolozzi, C. Robust visual tracking with a freely-moving event camera. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 3769–3776.
Gallego, G.; Delbrück, T.; Orchard, G.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; et al. Event-based vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 154–180.
Yuan, W.; Li, J.; Bhatta, M.; Shi, Y.; Baenziger, P.S.; Ge, Y. Wheat height estimation using LiDAR in comparison to ultrasonic sensor and UAS. Sensors 2018, 18, 3731.
Moosmann, F.; Stiller, C. Velodyne SLAM. In Proceedings of the IEEE Intelligent Vehicles Symposium, Baden-Baden, Germany, 5–9 June 2011; pp. 393–398.
Li, K.; Li, M.; Hanebeck, U.D. Towards high-performance solid-state-lidar-inertial odometry and mapping. IEEE Robot. Autom. Lett. 2021, 6, 5167–5174.
Poulton, C.V.; Yaacobi, A.; Cole, D.B.; Byrd, M.J.; Raval, M.; Vermeulen, D.; Watts, M.R. Coherent solid-state LIDAR with silicon photonic optical phased arrays. Opt. Lett. 2017, 42, 4091–4094.
Behroozpour, B.; Sandborn, P.A.; Wu, M.C.; Boser, B.E. LiDAR system architectures and circuits. IEEE Commun. Mag. 2017, 55, 135–142.
Xu, X.; Zhang, L.; Yang, J.; Cao, C.; Wang, W.; Ran, Y.; Tan, Z.; Luo, M. A review of multi-sensor fusion SLAM systems based on 3D LIDAR. Remote Sens. 2022, 14, 2835.
Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. Deepfusion: LiDAR-camera deep fusion for multi-modal 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17182–17191.
Zheng, W.; Xie, H.; Chen, Y.; Roh, J.; Shin, H. PIFNet: 3D object detection using joint image and point cloud features for autonomous driving. Appl. Sci. 2022, 12, 3686.
Cui, Y.; Chen, R.; Chu, W.; Chen, L.; Tian, D.; Li, Y.; Cao, D. Deep learning for image and point cloud fusion in autonomous driving: A review. IEEE Trans. Intell. Transp. Syst. 2021, 23, 722–739.
Du, X.; Ang, M.H.; Karaman, S.; Rus, D. A general pipeline for 3D detection of vehicles. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 3194–3200.
Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Ipod: Intensive point-based object detector for point cloud. arXiv 2018, arXiv:1812.05276.
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3D object detection from RGB-D data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927.
Shin, K.; Kwon, Y.P.; Tomizuka, M. Roarnet: A robust 3D object detection based on region approximation refinement. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2510–2515.
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660.
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497.
Maturana, D.; Scherer, S. Voxnet: A 3D convolutional neural network for real-time object recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 922–928.
Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3D bounding box estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253.
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D proposal generation and object detection from view aggregation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8.
Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3D object detection. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656.
Sindagi, V.A.; Zhou, Y.; Tuzel, O. MVX-Net: Multimodal voxelnet for 3D object detection. In Proceedings of the International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7276–7282.
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915.
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237.
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454.
Geyer, J.; Kassahun, Y.; Mahmudi, M.; Ricou, X.; Durgesh, R.; Chung, A.S.; Hauswald, L.; Pham, V.H.; Mühlegg, M.; Dorn, S.; et al. A2d2: Audi autonomous driving dataset. arXiv 2020, arXiv:2004.06320.
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631.
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223.
Ma, F.; Cavalheiro, G.V.; Karaman, S. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In Proceedings of the International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3288–3295.
Ma, F.; Karaman, S. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In Proceedings of the IEEE international conference on robotics and automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 4796–4803.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
Cheng, X.; Wang, P.; Yang, R. Depth estimation via affinity learned with convolutional spatial propagation network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 103–119.
Cheng, X.; Wang, P.; Guan, C.; Yang, R. CSPN++: Learning context and resource aware convolutional spatial propagation networks for depth completion. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 10615–10622.
Cheng, X.; Zhong, Y.; Dai, Y.; Ji, P.; Li, H. Noise-aware unsupervised deep LiDAR-stereo fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6339–6348.
Jalal, A.S.; Singh, V. The state-of-the-art in visual object tracking. Informatica 2012, 36, 1–22.
Tang, P.; Wang, X.; Wang, A.; Yan, Y.; Liu, W.; Huang, J.; Yuille, A. Weakly supervised region proposal network and object detection. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 352–368.
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171.
Hong, M.; Li, S.; Yang, Y.; Zhu, F.; Zhao, Q.; Lu, L. SSPNet: Scale selection pyramid network for tiny person detection from UAV images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5.
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448.
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1–9.
Kim, J.; Cho, J. Exploring a multimodal mixture-of-YOLOs framework for advanced real-time object detection. Appl. Sci. 2020, 10, 612.
Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning rich features from RGB-D images for object detection and segmentation. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 345–360.
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015.
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499.
Meyer, G.P.; Charland, J.; Hegde, D.; Laddha, A.; Vallespi-Gonzalez, C. Sensor fusion for joint 3D object detection and semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1–8.
Meyer, G.P.; Laddha, A.; Kee, E.; Vallespi-Gonzalez, C.; Wellington, C.K. Lasernet: An efficient probabilistic 3D object detector for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12677–12686.
Guo, Z.; Huang, Y.; Hu, X.; Wei, H.; Zhao, B. A survey on deep learning based approaches for scene understanding in autonomous driving. Electronics 2021, 10, 471.
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710.
Valada, A.; Mohan, R.; Burgard, W. Self-supervised model adaptation for multimodal semantic segmentation. Int. J. Comput. Vis. 2020, 128, 1239–1285.
Caltagirone, L.; Bellone, M.; Svensson, L.; Wahde, M. LIDAR–camera fusion for road detection using fully convolutional neural networks. Robot. Auton. Syst. 2019, 111, 125–131.
Dai, A.; Nießner, M. 3DMV: Joint 3D-multi-view prediction for 3D semantic scene segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 452–468.
Chiang, H.Y.; Lin, Y.L.; Liu, Y.C.; Hsu, W.H. A unified point-based framework for 3D segmentation. In Proceedings of the International Conference on 3D Vision (3DV), Québec, QC, Canada, 16–19 September 2019; pp. 155–163.
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 1–10.
Jaritz, M.; Gu, J.; Su, H. Multi-view pointnet for 3D scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 1–9.
Su, H.; Jampani, V.; Sun, D.; Maji, S.; Kalogerakis, E.; Yang, M.H.; Kautz, J. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2530–2539.
Hou, J.; Dai, A.; Nießner, M. 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4421–4430.
Narita, G.; Seno, T.; Ishikawa, T.; Kaji, Y. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; pp. 4205–4212.
Elich, C.; Engelmann, F.; Kontogianni, T.; Leibe, B. 3D bird’s-eye-view instance segmentation. In Proceedings of the 41st DAGM German Conference on Pattern Recognition, Dortmund, Germany, 10–13 September 2019; pp. 48–61.
Comaniciu, D.; Meer, P. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 603–619.