The visualbased simultaneous localization and mapping (SLAM) techniques use one or more cameras in the sensor system, receiving 2D images as the source of information. In general, the visualbased SLAM algorithms are divided into three main threads: initialization, tracking, and mapping.
1. Introduction
Simultaneous localization and mapping (SLAM) technology, first proposed by Smith in 1986 ^{[1]}, is used in an extensive range of applications, especially in the domain of augmented reality (AR) ^{[2]}^{[3]}^{[4]} and robotics ^{[5]}^{[6]}^{[7]}. The SLAM process aims at mapping an unknown environment and simultaneously locating a sensor system in this environment through the signals provided by the sensor(s). In robotics, the construction of a map is a crucial task, since it allows the visualization of landmarks, facilitating the environment’s visualization. In addition, it can help in the state estimation of the robot, relocating it, and decreasing estimation errors when revisiting registered areas ^{[8]}.
The map construction comes with two other tasks: localization and path planning. According to Stachniss ^{[9]}, the mapping problem may be described by examining three questions considering the robot’s perspective: What does the world look like? Where am I? and How can I reach a given location? The first question is clarified by the mapping task, which searches to construct a map, i.e., a model of the environment. To do so, it requires the location of the observed landmarks, i.e., the answer for the second question, provided by the localization task. The localization task searches to determine the robot’s pose, i.e., its orientation and position and, consequently, locates the robot on the map. Depending on the first two tasks, the path planning clears up the last question, and seeks to estimate a trajectory for the robot to achieve a given location. It relies on the current robot’s pose, provided by the localization task, and on the environment’s characteristics, provided by the mapping task. SLAM is a solution that integrates both the mapping and localization tasks.
Visualbased SLAM algorithms can be considered especially attractive, due to their sensor configuration simplicity, miniaturized size, and low cost. The visualbased approaches can be divided into three main categories: visualonly SLAM, visualinertial (VI) SLAM, and RGBD SLAM. The first one refers to the SLAM techniques based only on 2D images provided by a monocular or stereo camera. They present a higher technical difficulty due to their limited visual input ^{[10]}. The robustness in the sensor’s tracking of the visualSLAM algorithms may be increased by adding an inertial measurement unit (IMU), which can be found in their miniaturized size and low cost, while achieving high accuracy, essential aspects to many applications that require lightweight design, such as autonomous race cars ^{[11]}. In addition, visualbased SLAM systems may employ a depth sensor and process the depth information by applying a RGBDbased approach.
2. VisualBased SLAM
Figure 1 shows a general view of the three main parts generally present in visualbased SLAM approaches.
Figure 1. General components of a visualbased SLAM. The depth and inertial data may be added to the 2D visual input to generate a sparse map (generated with the ORBSLAM3 algorithm ^{[12]} in the MH_01 sequence ^{[13]}), semidense map (obtained with the LSDSLAM ^{[14]} in the dataset provided by the authors), and a dense reconstruction (Reprinted from ^{[15]}).
2.1. VisualOnly SLAM
The visualonly SLAM systems are based on 2D image processing. After the images’ acquisition from more than one point of view, the system performs the initialization process to define a global coordinate system and reconstruct an initial map. In the featurebased algorithms relying on filters (filteringbased algorithms), the first step consists of initializing the map points with high uncertainty, which may converge later to their actual positions. This procedure is followed by tracking, which attempts to estimate the camera pose. Simultaneously, the mapping process includes new points in the 3D reconstruction as more unknown scenes are observed.
2.1.1. FeatureBased Methods
SLAM algorithms based on features consider a certain number of points of interest, called keypoints. They can be detected in several images and matched by comparing their descriptors; this process provides the camera pose estimation information. The descriptor data and keypoint location compose the feature, i.e., the data used by the algorithm to process the tracking and mapping. As the featurebased methods do not use all the frame information, they are suitable to figure in embedded implementations. However, the feature extraction may fail in a textureless environment ^{[16]}, as well as it generates a sparse map, providing less information than a dense one.
2.1.2. Direct Methods
In contrast with the featurebased methods, the direct methods use the sensor data without preprocessing, considering pixels’ intensities, and minimizing the photometric error. There are many different algorithms based on this methodology, and depending on the chosen technique, the reconstruction may be dense, semidense, or sparse. The reconstruction density is a substantial constraint to the algorithm’s realtime operation, since the joint optimization of both structure and camera positions is more computationally expensive for dense and semidense reconstructions than for a sparse one ^{[17]}. Figure 2 shows the main difference between featurebased (indirect) and direct methods according to their frontend and backend, that is, the part of the algorithm responsible for sensor’s data abstraction and the part responsible for the interpretation of the abstracted data, respectively.
Figure 2. General differences between featurebased and direct methods. Top: main steps followed by the featurebased methods, resulting in a sparse reconstruction (map generated with the ORBSLAM3 algorithm ^{[12]} in the MH_01/EuRoC sequence ^{[13]}). Bottom: main steps followed by a direct method, that may result in a sparse (generated from the reconstruction of sequence_02/TUM MonoVO ^{[18]} with the DSO algorithm ^{[19]}) or dense reconstruction (Reprinted from ^{[15]}), according to the chosen technique.
2.2. VisualInertial SLAM
The VISLAM approach incorporates inertial measurements to estimate the structure and the sensor pose. The inertial data are provided by the use of an inertial measurement unit (IMU), which consists of a combination of gyroscope, accelerometer, and, additionally, magnetometer devices. This way, the IMU is capable of providing information relative to the angular rate (gyroscope) and acceleration (accelerometer) along the x, y, and zaxes, and, additionally, the magnetic field around the device (magnometer). While adding an IMU may increase the information richness of the environment and provide higher accuracy, it also increases the algorithm’s complexity, especially during the initialization step, since, besides the initial estimation of the camera pose, the algorithm also has to estimate the IMU poses. VISLAM algorithms can be divided according to the type of fusion between the camera and IMU data, which can be loosely or tightly coupled. The loosely coupled methods do not merge the IMU states to estimate the full pose: instead, the IMU data are used to estimate the orientation and changes in the sensor’s position ^{[20]}. On the other side, the tightly coupled methods are based on the fusion of camera and IMU data into a motion equation, resulting in a state estimation that considers both data.
2.3. RGBD SLAM
SLAM systems based on RGBD data started to attract more attention with the advent of Microsoft’s Kinect in 2010. RGBD sensors consist of a monocular RGB camera and a depth sensor, allowing SLAM systems to directly acquire the depth information with a feasible accuracy accomplished in realtime by lowcost hardware. As the RGBD devices directly provide the depth map to the SLAM systems, the general framework of SLAM based on this approach differs from the other ones already presented.
Most of the RGBDbased systems make use of the iterative closest point (ICP) algorithm to locate the sensor, fusing the depth maps to obtain the reconstruction of the whole structure. RGBD systems present advantages such as providing color image data and dense depth map without any preprocessing step, hence decreasing the complexity of the SLAM initialization ^{[10]}. Despite this, this approach is most suitable to indoor environments, and requires large memory and power consumption ^{[21]}.
3. VisualSLAM Algorithms
Considering the general approach of the SLAM systems, six criteria that influence system dimensioning, accuracy, and hardware implementation were established. They are: algorithm type, map density, global optimization, loop closure, availability, and embedded implementations:

Algorithm type: this criterion indicates the methodology adopted by the algorithm. For the visualonly algorithms, it divided into featurebased, hybrid, and direct methods. Considering the visualinertial algorithms, they must be filteringbased or optimizationbased methods. Lastly, the RGBD approach can be divided concerning their tracking method, which can be direct, hybrid, or featurebased.

Map density: in general, dense reconstruction requires more computational resources than a sparse one, having an impact on memory usage and computational cost. On the other hand, it provides a more detailed and accurate reconstruction, which may be a key factor in a SLAM project.

Global optimization: SLAM algorithms may include global map optimization, which refers to the technique that searches to compensate the accumulative error introduced by the camera movement, considering the consistency of the entire structure.

Loop closure: the loop closing detection refers to the capability of the SLAM algorithm to identify the images that were previously detected by the algorithm to estimate and correct the drift accumulated during the sensor movement.

Availability: several SLAM algorithms are open source and made available by the authors or have their implementations made available by third parties, facilitating their usage and reproduction.

Embedded implementations: the embedded SLAM implementation is an emerging field used in several applications, especially in robotics and automobile domains. This criterion depends on each algorithm’s hardware constraints and specificity, since there must be a tradeoff between algorithm architecture in terms of energy consumption, memory, and processing usage.
3.1. VisualOnly SLAM
3.1.1. MonoSLAM (2007)
The first monocular SLAM algorithm is MonoSLAM, which was proposed by Davidson et al. ^{[22]} in 2007. The first step of the algorithm consists of the system’s initialization. Then, it updates the state vector considering a constant velocity motion model, where the camera motion and environment structure are estimated in realtime using an extended Kalman filter (EKF). The algorithm is represented by Figure 3.
Figure 3. Diagram representing the MonoSLAM algorithm.
3.1.2. Parallel Tracking and Mapping (2007)
Another pioneer algorithm is the Parallel Tracking and Mapping (PTAM) ^{[23]} algorithm. PTAM was the first algorithm to separate Tracking and Mapping into two different threads and to apply the concept of keyframes to the mapping thread. First, the mapping thread performs the map initialization. New keyframes are added to the system as the camera moves and the initial map is expanded. Triangulation between two consecutive keyframes calculates the new point’s depth information. The tracking thread computes the camera poses, and for each new frame, it estimates an initial pose for performing the projection of the map points on the image. PTAM uses the correspondences to compute the camera pose by minimizing the reprojection error. Figure 4 represents the steps performed by the PTAM algorithm.
Figure 4. Diagram representing the PTAM algorithm.
3.1.3. Dense Tracking and Mapping (2011)
Dense tracking and mapping (DTAM), proposed by Newcombe et al. ^{[24]}, was the first fully direct method in the literature. The algorithm is divided into two main parts: dense Mapping and dense tracking. The first stage searches to estimate the depth values by defining data cost volume representing the average photometric error of multiple frames computed for the inverse depth of the current frame. The inverse depth that minimizes the photometric error is selected to integrate the reconstruction. In the dense tracking stage, DTAM estimates the motion parameters by aligning an image from the dense model projected in a virtual camera and the current frame. Figure 5 shows a general view of the DTAM algorithm.
Figure 5. Diagram representing the DTAM algorithm.
3.1.4. SemiDirect Visual Odometry (2014)
The semidirect visual odometry (SVO) algorithm ^{[25]} combines the advantages of both featurebased and direct methods. The algorithm is divided into two main threads: motion estimation and mapping. The first thread searches to estimate the sensor’s motion parameters, which consists of minimizing the photometric error. The mapping thread is based on probabilistic depth filters, and it searches to estimate the optimum depth value for each 2D feature. When the algorithm achieves a low uncertainty, it inserts the 3D point in the reconstruction, as shown in Figure 6.
Figure 6. Diagram representing the SVO algorithm. Adapted from ^{[25]}.
3.1.5. LargeScale Direct Monocular SLAM (2014)
The largescale direct monocular SLAM (LSDSLAM) ^{[14]} is a direct algorithm that performs a semidense reconstruction. The algorithm consists of three main steps: tracking, depth map estimation, and map optimization. The first step minimizes the photometric error to estimate the sensor’s pose. Next, the LSDSLAM performs the keyframe selection in the depth map estimation step. If it adds a new keyframe to the algorithm, it initializes its depth map; otherwise, it refines the depth map of the current keyframe by performing several smallbaseline stereo comparisons. Finally, in the map optimization step, the LSDSLAM incorporates the new keyframe in the map and optimizes it by applying a pose–graph optimization algorithm. Figure 7 illustrates the procedure.
Figure 7. Diagram representing the LSDSLAM algorithm. Adapted from ^{[14]}.
3.1.6. ORBSLAM 2.0 (2017)
The ORBSLAM2 algorithm ^{[26]}, originated from ORBSLAM ^{[27]}, is considered the state of the art of featurebased algorithms. It works in three parallel threads: tracking, local mapping, and loop closing. The first thread locates the sensor by finding features correspondences and minimizing the reprojection error. The local mapping thread is responsible for the map management operations. The last thread, loop closing, is in charge of detecting new loops and correcting the drift error in the loop. After processing the three threads, the algorithm also considers the whole structure and estimated motion consistency by performing a full bundle adjustment. Figure 8 represents the threads that constitute the algorithm.
Figure 8. Diagram representing the ORBSLAM 2.0 algorithm. Adapted from ^{[27]}.
3.1.7. CNNSLAM (2017)
CNNSLAM ^{[28]} is one of the first works to present a realtime SLAM system based on convolutional neural networks (CNN). The algorithm may be divided into two different pipelines: one applied in every input frame and another in every keyframe. The first is responsible for the camera pose estimation by minimizing the photometric error between the current frame and the nearest keyframe. In parallel, for every keyframe, the depth is predicted by a CNN. In addition, the algorithm predicts the semantic segmentation for each frame. After these processing steps, the algorithm performs a posegraph optimization to obtain a globally optimized pose estimation, as shown in Figure 9.
Figure 9. Diagram representing the CNNSLAM algorithm. Adapted from ^{[28]}.
3.1.8. Direct Sparse Odometry (2018)
The direct sparse odometry (DSO) algorithm ^{[19]} combines a direct approach with a sparse reconstruction. The DSO algorithm considers a window of the most recent frames. It performs a continuous optimization by applying a local bundle adjustment that optimizes the keyframes window and the inverse depth map. The algorithm divides the image into several blocks and selects the highest intensity points. The DSO considers exposure time and lens distortion in the optimization to increase the algorithm’s robustness. Initially, this algorithm does not include global optimization or loop closure, but Xiang et al. ^{[29]} proposed an extension of the DSO algorithm, including loop closure detection and posegraph optimization. The DSO main steps are represented in Figure 10.
Figure 10. Diagram representing the DSO algorithm.
3.2. VisualInertial SLAM
3.2.1. MultiState Constraint Kalman Filter (2007)
The multistate constraint Kalman filter (MSCKF) ^{[30]} can be implemented using both monocular and stereo cameras ^{[31]}. The algorithm’s pipeline consists of three main steps: propagation, image registration, and update. In the first step, the MSCKF considers the discretization of a continuoustime IMU model to obtain the propagation of the filter state and covariance. Then, the image registration performs the state augmentation each time a new image is recorded. This estimation is added in the state and covariance matrix to initiate the image processing module (feature extraction). Finally, the algorithm performs the filter update. Figure 11 represents the algorithm.
Figure 11. Diagram representing the MSCKF algorithm.
3.2.2. Open KeyframeBased VisualInertial SLAM (2014)
Open Keyframebased VisualInertial SLAM (OKVIS) ^{[32]} is an optimizationbased method. It combines the IMU data and reprojection terms into an objective function, allowing the algorithm to jointly optimize both the weighted reprojection error and temporal error from IMU. The algorithm builds a local map, and then the subsequent keyframes are selected according to the keypoints match area. The algorithm can be depicted as shown in Figure 12.
Figure 12. Diagram representing the OKVIS algorithm.
3.2.3. Robust Visual Inertial Odometry (2015)
The Robust Visual Inertial Odometry (ROVIO) algorithm ^{[33]} is another filterbased method that uses the EKF approach, and similar to other filterbased methods, it uses the IMU data to state propagation, and the camera data to filter update. However, besides performing the feature extraction, ROVIO executes the extraction of multilevel patches around the features, as illustrated by Figure 13.
Figure 13. Diagram representing the feature handling performed by the ROVIO algorithm. Adapted from ^{[33]}.
3.2.4. Visual Inertial ORBSLAM (2017)
The VisualInertial ORBSLAM (VIORB) algorithm ^{[34]} is based on the already presented ORBSLAM algorithm ^{[26]}. As such, the system also counts with three main threads: tracking, local mapping, and loop closing. In VIORB, the tracking thread estimates the sensor pose, velocity, and IMU biases. Additionally, this thread performs the joint optimization of the reprojection error of the matched points and IMU error data. The local mapping thread adopts a different culling policy considering the IMU operation. Finally, the loop closing thread implements a place recognition module to identify the keyframes already visited by the sensors. Furthermore, the algorithm performs an optimization to minimize the accumulated error. Figure 14 seeks to illustrate the main differences between the ORBSLAM algorithm (see Figure 8) and its visualinertial version.
Figure 14. Diagram representing the VIORB algorithm.
3.2.5. Monocular VisualInertial System (2018)
Monocular VisualInertial System (VINSMono) ^{[35]} is a monocular visualinertial state estimator. It starts with a measurement process responsible for features extraction and tracking, and a preintegration of the IMU data between the frames. Then, the algorithm performs an initialization process to provide the initial values for a nonlinear optimization process that minimizes the visual and inertial errors. The VINS also implements a relocalization and a posegraph optimization module that merges the IMU measurements and features observations. Figure 15 illustrates the VINSMono algorithm.
Figure 15. Diagram representing the VINSMono algorithm.
3.2.6. VisualInertial Direct Sparse Odometry (2018)
The VisualInertial Direct Sparse Odometry (VIDSO) algorithm ^{[36]} is based on the already presented DSO algorithm ^{[19]}. The algorithm searches to minimize an energy function that combines the photometric and inertial errors, which is built considering a nonlinear dynamic model. Figure 16 shows an overview of the VIDSO algorithm that illustrates its main differences concerning the DSO technique.
Figure 16. Diagram representing the VIDSO algorithm.
3.2.7. ORBSLAM3 (2020)
ORBSLAM3 algorithm ^{[37]} is a technique that combines the ORBSLAM and VIORB algorithms. As with its predecessors, the algorithm is divided into three main threads: tracking, local mapping and, instead of loop closing, loop closing and map merging. In addition, ORBSLAM3 maintains a multimap representation called Atlas, which maintains an active map used by the tracking thread, and nonactive maps used for relocalization and place recognition. The first two threads follow the same principle as VIORB, while map merging is added to the last thread. The loop closing and map merging thread uses all the maps in Atlas to identify common parts and perform loop correction or merge maps and change the active map, depending on the location of the overlapped area. Another important aspect of ORBSLAM3 concerns the proposed initialization technique that relies on the MaximumaPosteriori algorithm individually applied to the visual and inertial estimations, which are later jointly optimized. This algorithm can be used with monocular, stereo, and RGBD cameras, and implements global optimizations and loop closures techniques.
3.3. RGBD SLAM
3.3.1. KinectFusion (2011)
The KinectFusion algorithm ^{[38]} was the first algorithm based on an RGBD sensor to operate in realtime. The algorithm includes four main steps: the measurement, pose estimation, reconstruction update, and surface prediction. In the first step, the RGB image and depth data are used to generate a vertex and a normal map. In the pose estimation step, the algorithm applies the ICP alignment between the current surface and the predicted one (provided by the previous step). Then, the reconstruction update step integrates the new depth frame into the 3D reconstruction, which is raycasted into the new estimated frame to obtain a new dense surface prediction. The KinectFusion algorithm is capable of good mapping in maximum mediumsized rooms ^{[38]}. However, it accumulates drift errors, since it does not perform loop closing ^{[39]}. An overview of the steps performed by the algorithm is shown in Figure 17.
Figure 17. Diagram representing the KinectFusion algorithm. Adapted from ^{[38]}.
3.3.2. SLAM++ (2013)
The SLAM++ algorithm ^{[40]} is an objectoriented SLAM algorithm that takes advantage of previously known scenes containing repeated objects and structures, such as a classroom. After the system initialization, SLAM++ operates in four steps: camera pose estimation, object insertion, and pose update, pose–graph optimization, and surface rendering. The first step estimates the current camera pose by applying the ICP algorithm, considering dense multiobject prediction in the current SLAM graph. Next, the algorithm searches to identify objects in the current frame using the database information. The third step inserts the considered objects in the SLAM graph by performing a pose–graph optimization operation. Finally, the algorithm renders the objects in the graph, as shown in Figure 18. SLAM++ performs loop closure detection and, by considering the object’s repeatability, it increases its efficiency and scene description. Nevertheless, the algorithm is most suitable for already known scenes.
Figure 18. Diagram representing the SLAM++ algorithm. Adapted from ^{[40]}.
3.3.3. Dense Visual Odometry (2013)
The dense visual odometry SLAM (DVOSLAM) algorithm, proposed by Kerl et al. ^{[41]}, is a keyframebased technique. It minimizes the photometric error between the keyframes to acquire the depth values and pixels coordinates, as well as camera motion. The algorithm calculates, for each input frame, an entropy value that is compared to a threshold value. The same principle is used for loop detection, although it uses a different threshold value. The map is represented by a SLAM graph where the vertex has camera poses, and edges are the transformations between keyframes. This algorithm is robust to textureless scenes and performs loop closure detection. The map representation relies on a representation of the keyframes, and the algorithm does not perform an explicit map reconstruction. Figure 19 shows an overview of the DVO algorithm.
Figure 19. Diagram representing the DVO algorithm.
3.3.4. RGBDSLAMv2 (2014)
The RGBDSLAMv2 ^{[42]} is one of the most popular RGBDbased algorithms and relies on feature extraction. It performs the RANSAC algorithm to estimate the transformation between the matched features and the ICP algorithm to obtain pose estimation. Finally, the system executes a global optimization and loop closure to eliminate the accumulated error. In addition, this method proposes using an environment measurement model (EMM) to validate the transformations obtained between the frames. The algorithm is based on SIFT features, which degrades its realtime performance. RGBDSLAMv2 presents a high computation consumption and requires a slow movement by the sensor for its correct operation ^{[39]}. Figure 20 represents the algorithm.
Figure 20. Diagram representing the RGBDSLAMv2 algorithm. Adapted from ^{[42]}.