1. Visual-SLAM
Visual-Simultaneous Localisation and Mapping (SLAM)AM and sensors have been the main research direction for SLAM solutions due to their capability of collecting a large amount of information and measurement range for mapping. The principle of Visual-SLAM lies in a sequential estimation of the camera motions depending on the perceived movements of pixels in the image sequence. Besides robotics, Visual-SLAM is also essential for many enormous vision-based applications such as virtual and augmented reality. Many existing Visual-SLAM methods explicitly model camera projections, motions, and environments based on visual geometry. Recently, many methods have assigned and incorporated semantic meaning to the observed objects to provide a more successful localisation that is robust against observation noise and dynamic objects. The following In this sections, we will introducereview the different families of algorithms within the branch of Visual-SLAM.
2. Feature-Based and Direct SLAM
Feature-based SLAM can be divided into filter-based and Bundle Adjustment based methods
introduced earlier in previous sections. Earlier SLAM approaches utilised
extended Kalman filters (EKFs)EKFs for estimating the robot pose while updating the landmarks observed by the robots simultaneously
[1][2][3][53,54,55]. However, the computational complexity of these methods increased with the number of landmarks, and they did not efficiently handle non-linearities in the measurements
[4][56]. FastSLAM was proposed to improve the EKF-SLAM by combining particle filters with EKFs for landmark estimation
[5][45]. However, it also suffered from the limitations of sample degeneracy when sampling the proposal distribution. Parallel Tracking and Mapping
[6][57] was proposed to address the issue by splitting the pose and map estimation into separate threads, which enhance their real-time performance
[7][8][58,59].
A place recognition system with ORB features was first proposed in
[9][60], which is developed based on Bag-of-Words (BoW). The ORB is a rotational invariant and scale-aware feature
[10][61], which can be used to extract features at a high frequency. Place recognition algorithms can often be highly efficient and run in real time. The algorithm is helpful in relocalisation and loop-closure for Visual-SLAM, and it is further developed with monocular cameras for operating in a large-scale environment
[11][62].
RGB-D SLAM
[11][62] is another feature-based SLAM that uses feature points for generating dense and accurate 3D maps. Several models are proposed to utilise the active camera sensor to develop a 6-DOF motion tracking model capable of 3D reconstruction and achieve impressive performance even under challenging scenarios
[12][13][63,64]. In contrast to low-level point features, high-level objects often provide a more accurate tracking performance. For example, using a planar SLAM system,
peoplwe can detect the planar in the environment for yielding a planar map while detecting objects such as desks and chairs for localisatino
[14][65]. The recognition of the objects, however, requires an offline supervised-learning procedure before executing the SLAM procedure.
Direct SLAM refers to methods that directly use the input images without any feature detector and descriptors. In contrast to feature-based methods, these feature-less approaches are generally used in photometric consistency to register two successive images. Using deep-learning models for extracting the environment’s feature representation is promising in numerous robotic domains
[15][16][66,67]. For example, DTAM
[17][68], L
arge-S
cale Direct (LSD)D-SLAM
[18][69] and SVO
[19][70] are some of the models that had gain lots of successes. DSO models
[20][21][71,72] are also shown to be capable of using bundle adjustment pipeline of temporal multi-view stereo for achieving high accuracy in a real-time system. In additions, models such as CodeSLAM
[22][73] and CNN-SLAM
[23][74] use deep-learning approach for extracting a dense representations of the environment for performing direct SLAM. However, direct SLAM is often more time-consuming when compared to feature-based SLAM since they operate directly on the image space.
3. Localisation with Scene Modelling
Deep learning plays an essential role in scene understanding by utilising a range of information in techniques such as CNN classifications. CNN can be utilised over RGB images for extracting semantic information such as detecting scene or pedestrians within the images
[24][25][26][75,76,77]. CNN can also directly operates on captured point cloud information from range-based sensors such as Li
ght D
etection and Ranging (LiDAR)AR. Models such as PointNet
[27][78] in
Figure 13 can understand classifying the class of the objects based purely on point clouds. For example, PointNet++
[27][78], TangentConvolutions
[28][79], DOPS
[29][80], and RandLA-Net
[30][81] are some of the recent deep learning models that can perform semantic understanding using a large scale of point clouds. Most models are trained on some point cloud dataset that enables the model to infer objects and scene information based purely on the geometric orientations of the input points.
Figure 13.
Example of using
PointNet [27] for performing part segmentation directly on input point clouds. [78] for performing part segmentation directly on input point clouds.
Dynamic objects can introduce difficulties in SLAM during loop-closure due to the moving objects. SLAM can tackle this difficulty by utilising semantics information to filter dynamic objects from the input images
[31][82]. Using the scene understanding module,
pwe
ople can filter out moving objects from the images to prevent the SLAM algorithm conditioning on dynamic objects. For example, the
SUMA++ model illustrated on the right of
Figure 24 can obtain a semantic understanding of each detected object to filter out dynamic objects such as pedestrians and other moving vehicles. However, the increased SLAM accuracy comes with the cost of lowering the accuracy of the estimated robot pose due to the method neglecting parts of the perceived information.
Figure 24. Using environment features to create a semantic map.
SUMA++ [32][86] operating under an environment using LiDAR sensors, which provides rich information to understand the environment around the vehicle.
4. Scene Modelling with Typological Relationship and Dynamic Models
Scene graphs is a different approach to building a model of the environment that includes both the metric, semantic, and primary topological relationship between the scene objects and the overall environment
[33][83]. Scene graphs can construct an environmental graph that spans an entire building, including objects, materials and rooms within the building
[34][84]. The main disadvantage of scene graphs is the need to compute offline, requiring a known 3D mesh of the building with the registered RGB images to generate the 3D scene graphs. Previous approaches rely on registering RGB images with the 3D mesh of the buildings for generating the 3D scene graphs, which limits their applicability to static environments.
Figure 35 illustrates one of the approaches, Dynamic scene graphs (DSG)
[35][85], that can also include dynamic elements within the environment. For example, DSG can model humans that are navigating within the building. The original DSG approach needs to be built offline, but an extension has been proposed
[35][85] which is capable of building a 3D dynamic DSG from visual-inertial data in a fully automatic manner. The approach first builds a 3D mesh-based semantic map fed to the dynamic scene generator.
Figure 35.
Using
Dynamic Scene Graph (DSG) [35] for generating multi-layer abstraction of an indoor environment. [85] for generating multi-layer abstraction of an indoor environment.
In addition,
pwe
ople can perform reasoning on the current situation by projecting what will likely happen based on previous events
[36][87]. This class of methods relies on predicting the possible future state of the robot by conditioning on the current belief of
our
robot state and the robot’s dynamic model
[37][88]. In addition, dynamic models can be incorporated into the objects in the surrounding environment, such as pedestrians and vehicles, for the model to recognise the predicted future pose of the nearby objects with some amount of uncertainty
[38][89].
5. Semantic Understanding with Segmentation
Pixel-wise semantic segmentation is another promising direction in SLAM semantic understanding. FCN
[24][75] is a fully convolutional neural network that uses pixel-wise segmentation in the computer vision community for SLAM. ParseNet derived a similar CNN architecture
[39][90] and injected the global context information into the global pooling layers in FCN. The global context information allows the model to achieve better scene segmentation with a more feature-rich representation of the network. SegNet is another novel netowkr
[40][91] that uses an encoder-decoder architecture for segmentation. The decoder architecture helps upsample the captured low-resolution features from the images. Bayesian approaches are helpful in many learning-based robotics application
[41][42][92,93]. Bayesian SegNet
[41][92] took a probabilistic approach by using dropout layers in the original SegNet for sampling. The Bayesian approach estimates the probability for pixel-level segmentation, which often outperforms the original approach. Conditional Random Fields had been combined with CNN architecture
[43][94] for deriving a mean-field approximate inference as Recurrent neural Networks.
Semantic information is particularly valued in an environment where a robot needs to interact with human
[44][95]. The progress in computer vision semantic segmentation using deep learning is constructive for pushing the research progress in semantic SLAM. By combining model-based SLAM methods with spatio-temporal CNN-based semantic segmentation
[45][96],
pwe
ople can often provide the SLAM model with a more informative feature representation for localisation. The proposed system can simultaneously perform 3D semantic scene mapping and 6-DOF localisation even in a large indoor environment. Pixel-voxel netowk
[46][97] is another similar approach that uses CNN-like architecture for semantic mapping. SemanticFusion
[47][98] integrates the CNN-based semantic segmentation with the dense SLAM technology ElasticFusion
[48][99], resulting in a model that produces a dense semantic map and performs well in an indoor environment.
6. Sensors Fusions for Semantic Scene Understanding
With the recent advancements in Deep Learning, numerous Visual-SLAM have also gained treatment success in using the learned models for semantic understanding using data fusion. Models such as Frustrum PointNets
[49][100] utilise both RGB camera and LiDAR sensors to improve the accuracy of understanding the semantics of the scene.
Figure 46 illustrates how Frustrum PointNet utilises information from both sensors for data fusion, where a PointNet is first applied for object instance segmentation and amodal bounding box regression. Sensor fusion provides a more rich feature representation for performing data association. For example, VINet is a sensor fusion network
[50][101] that can use the estimated pose from DeepVO
[51][102] along with the inertial sensor readings with an LSTM. During the model training procedure, the prediction and the fusion network are trained jointly to allow the gradient to pass through the entire network. Therefore, both networks can compensate each other, and the fusion system has high performance compared to traditional sensor fusion methods. The same methodology can also be used as a fusion system
[52][103] which is capable of fusing the 6-DOF pose data from the cameras and the magnetic sensors
[53][104].
Figure 46. Multi-modal model
Frustrum PointNets [49][100] which uses CNN model to projects detected objects from
RGB images into 3D space, thus improving the accuracy on semantic understanding.
The information obtained from a camera can also be fused with GPS, INS, and wheel odometry readings as an ego-motion estimation system
[54][105]. The model essentially uses deep learning to capture the temporal motion dynamics. The motion from the camera is utilised in a mixture density network to construct an optical flow vector for better estimation. Direct methods for visual odometry (VO) can often exploit information from the intensity level gathered from the input images. However, these methods cannot guarantee optimality compared to feature-based methods. Semi-direct VO (SVO2)
[55][106] is a hybrid method that uses direct methods to track pixels while relying on feature-based methods for joint optimisation of structure and motions. The hybrid methods take advantage of both approaches to improve the robustness of VO. Similar approaches such as VINS-Fusion
[56][107] are capable of using IMU fused with monocular visual input for estimating odometry with high reliability. Deep neural networks can further learn the rigid-body motion in a CNN architecture
[57][108] using raw point cloud data as input for predicting the SE3 rigid transformation of the robot.