Visual-SLAM: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Subjects: Robotics
Contributor:

Simultaneous Localisation and Mapping (SLAM) is one of the fundamental problems in autonomous mobile robots where a robot needs to reconstruct a previously unseen environment while simultaneously localising itself with respect to the map. In particular, Visual-SLAM uses various sensors from the mobile robot for collecting and sensing a representation of the map.

  • Visual SLAM
  • Simultaneous Localisation and Mapping
  • deep-learning SLAM

1. Visual-SLAM

Visual-SLAM and sensors have been the main research direction for SLAM solutions due to their capability of collecting a large amount of information and measurement range for mapping. The principle of Visual-SLAM lies in a sequential estimation of the camera motions depending on the perceived movements of pixels in the image sequence. Besides robotics, Visual-SLAM is also essential for many enormous vision-based applications such as virtual and augmented reality. Many existing Visual-SLAM methods explicitly model camera projections, motions, and environments based on visual geometry. Recently, many methods have assigned and incorporated semantic meaning to the observed objects to provide a more successful localisation that is robust against observation noise and dynamic objects. In this section, we will review the different families of algorithms within the branch of Visual-SLAM.

2. Feature-Based and Direct SLAM

Feature-based SLAM can be divided into filter-based and Bundle Adjustment based methods introduced earlier in previous sections. Earlier SLAM approaches utilised EKFs for estimating the robot pose while updating the landmarks observed by the robots simultaneously [53,54,55]. However, the computational complexity of these methods increased with the number of landmarks, and they did not efficiently handle non-linearities in the measurements [56]. FastSLAM was proposed to improve the EKF-SLAM by combining particle filters with EKFs for landmark estimation [45]. However, it also suffered from the limitations of sample degeneracy when sampling the proposal distribution. Parallel Tracking and Mapping [57] was proposed to address the issue by splitting the pose and map estimation into separate threads, which enhance their real-time performance [58,59].
A place recognition system with ORB features was first proposed in [60], which is developed based on Bag-of-Words (BoW). The ORB is a rotational invariant and scale-aware feature [61], which can be used to extract features at a high frequency. Place recognition algorithms can often be highly efficient and run in real time. The algorithm is helpful in relocalisation and loop-closure for Visual-SLAM, and it is further developed with monocular cameras for operating in a large-scale environment [62].
RGB-D SLAM [62] is another feature-based SLAM that uses feature points for generating dense and accurate 3D maps. Several models are proposed to utilise the active camera sensor to develop a 6-DOF motion tracking model capable of 3D reconstruction and achieve impressive performance even under challenging scenarios [63,64]. In contrast to low-level point features, high-level objects often provide a more accurate tracking performance. For example, using a planar SLAM system, we can detect the planar in the environment for yielding a planar map while detecting objects such as desks and chairs for localisatino [65]. The recognition of the objects, however, requires an offline supervised-learning procedure before executing the SLAM procedure.
Direct SLAM refers to methods that directly use the input images without any feature detector and descriptors. In contrast to feature-based methods, these feature-less approaches are generally used in photometric consistency to register two successive images. Using deep-learning models for extracting the environment’s feature representation is promising in numerous robotic domains [66,67]. For example, DTAM [68], LSD-SLAM [69] and SVO [70] are some of the models that had gain lots of successes. DSO models [71,72] are also shown to be capable of using bundle adjustment pipeline of temporal multi-view stereo for achieving high accuracy in a real-time system. In additions, models such as CodeSLAM [73] and CNN-SLAM [74] use deep-learning approach for extracting a dense representations of the environment for performing direct SLAM. However, direct SLAM is often more time-consuming when compared to feature-based SLAM since they operate directly on the image space.

3. Localisation with Scene Modelling

Deep learning plays an essential role in scene understanding by utilising a range of information in techniques such as CNN classifications. CNN can be utilised over RGB images for extracting semantic information such as detecting scene or pedestrians within the images [75,76,77]. CNN can also directly operates on captured point cloud information from range-based sensors such as LiDAR. Models such as PointNet [78] in Figure 3 can understand classifying the class of the objects based purely on point clouds. For example, PointNet++ [78], TangentConvolutions [79], DOPS [80], and RandLA-Net [81] are some of the recent deep learning models that can perform semantic understanding using a large scale of point clouds. Most models are trained on some point cloud dataset that enables the model to infer objects and scene information based purely on the geometric orientations of the input points.
Figure 3. Example of using PointNet [78] for performing part segmentation directly on input point clouds.
Dynamic objects can introduce difficulties in SLAM during loop-closure due to the moving objects. SLAM can tackle this difficulty by utilising semantics information to filter dynamic objects from the input images [82]. Using the scene understanding module, we can filter out moving objects from the images to prevent the SLAM algorithm conditioning on dynamic objects. For example, the SUMA++ model illustrated on the right of Figure 4 can obtain a semantic understanding of each detected object to filter out dynamic objects such as pedestrians and other moving vehicles. However, the increased SLAM accuracy comes with the cost of lowering the accuracy of the estimated robot pose due to the method neglecting parts of the perceived information.
Figure 4. Using environment features to create a semantic map. SUMA++ [86] operating under an environment using LiDAR sensors, which provides rich information to understand the environment around the vehicle.

4. Scene Modelling with Typological Relationship and Dynamic Models

Scene graphs is a different approach to building a model of the environment that includes both the metric, semantic, and primary topological relationship between the scene objects and the overall environment [83]. Scene graphs can construct an environmental graph that spans an entire building, including objects, materials and rooms within the building [84]. The main disadvantage of scene graphs is the need to compute offline, requiring a known 3D mesh of the building with the registered RGB images to generate the 3D scene graphs. Previous approaches rely on registering RGB images with the 3D mesh of the buildings for generating the 3D scene graphs, which limits their applicability to static environments. Figure 5 illustrates one of the approaches, Dynamic scene graphs (DSG) [85], that can also include dynamic elements within the environment. For example, DSG can model humans that are navigating within the building. The original DSG approach needs to be built offline, but an extension has been proposed [85] which is capable of building a 3D dynamic DSG from visual-inertial data in a fully automatic manner. The approach first builds a 3D mesh-based semantic map fed to the dynamic scene generator.
Figure 5. Using Dynamic Scene Graph (DSG) [85] for generating multi-layer abstraction of an indoor environment.
In addition, we can perform reasoning on the current situation by projecting what will likely happen based on previous events [87]. This class of methods relies on predicting the possible future state of the robot by conditioning on the current belief of our robot state and the robot’s dynamic model [88]. In addition, dynamic models can be incorporated into the objects in the surrounding environment, such as pedestrians and vehicles, for the model to recognise the predicted future pose of the nearby objects with some amount of uncertainty [89].

5. Semantic Understanding with Segmentation

Pixel-wise semantic segmentation is another promising direction in SLAM semantic understanding. FCN [75] is a fully convolutional neural network that uses pixel-wise segmentation in the computer vision community for SLAM. ParseNet derived a similar CNN architecture [90] and injected the global context information into the global pooling layers in FCN. The global context information allows the model to achieve better scene segmentation with a more feature-rich representation of the network. SegNet is another novel netowkr [91] that uses an encoder-decoder architecture for segmentation. The decoder architecture helps upsample the captured low-resolution features from the images. Bayesian approaches are helpful in many learning-based robotics application [92,93]. Bayesian SegNet [92] took a probabilistic approach by using dropout layers in the original SegNet for sampling. The Bayesian approach estimates the probability for pixel-level segmentation, which often outperforms the original approach. Conditional Random Fields had been combined with CNN architecture [94] for deriving a mean-field approximate inference as Recurrent neural Networks.
Semantic information is particularly valued in an environment where a robot needs to interact with human [95]. The progress in computer vision semantic segmentation using deep learning is constructive for pushing the research progress in semantic SLAM. By combining model-based SLAM methods with spatio-temporal CNN-based semantic segmentation [96], we can often provide the SLAM model with a more informative feature representation for localisation. The proposed system can simultaneously perform 3D semantic scene mapping and 6-DOF localisation even in a large indoor environment. Pixel-voxel netowk [97] is another similar approach that uses CNN-like architecture for semantic mapping. SemanticFusion [98] integrates the CNN-based semantic segmentation with the dense SLAM technology ElasticFusion [99], resulting in a model that produces a dense semantic map and performs well in an indoor environment.

6. Sensors Fusions for Semantic Scene Understanding

With the recent advancements in Deep Learning, numerous Visual-SLAM have also gained treatment success in using the learned models for semantic understanding using data fusion. Models such as Frustrum PointNets [100] utilise both RGB camera and LiDAR sensors to improve the accuracy of understanding the semantics of the scene. Figure 6 illustrates how Frustrum PointNet utilises information from both sensors for data fusion, where a PointNet is first applied for object instance segmentation and amodal bounding box regression. Sensor fusion provides a more rich feature representation for performing data association. For example, VINet is a sensor fusion network [101] that can use the estimated pose from DeepVO [102] along with the inertial sensor readings with an LSTM. During the model training procedure, the prediction and the fusion network are trained jointly to allow the gradient to pass through the entire network. Therefore, both networks can compensate each other, and the fusion system has high performance compared to traditional sensor fusion methods. The same methodology can also be used as a fusion system [103] which is capable of fusing the 6-DOF pose data from the cameras and the magnetic sensors [104].
Figure 6. Multi-modal model Frustrum PointNets [100] which uses CNN model to projects detected objects from RGB images into 3D space, thus improving the accuracy on semantic understanding.
The information obtained from a camera can also be fused with GPS, INS, and wheel odometry readings as an ego-motion estimation system [105]. The model essentially uses deep learning to capture the temporal motion dynamics. The motion from the camera is utilised in a mixture density network to construct an optical flow vector for better estimation. Direct methods for visual odometry (VO) can often exploit information from the intensity level gathered from the input images. However, these methods cannot guarantee optimality compared to feature-based methods. Semi-direct VO (SVO2) [106] is a hybrid method that uses direct methods to track pixels while relying on feature-based methods for joint optimisation of structure and motions. The hybrid methods take advantage of both approaches to improve the robustness of VO. Similar approaches such as VINS-Fusion [107] are capable of using IMU fused with monocular visual input for estimating odometry with high reliability. Deep neural networks can further learn the rigid-body motion in a CNN architecture [108] using raw point cloud data as input for predicting the SE3 rigid transformation of the robot.

This entry is adapted from the peer-reviewed paper 10.3390/s22197265

This entry is offline, you can click here to edit this entry!
Video Production Service