Geometry and Semantic-Based Dynamic SLAM: Comparison
Please note this is a comparison between Version 1 by sun yang and Version 2 by Camila Xu.

Simultaneous localization and mapping (SLAM) is a crucial technology for advanced robotics applications, enabling collision-free navigation and environment exploration. In terms of the localization aspect of SLAM, the accuracy of pose estimation is greatly affected by the proportion of dynamic feature points being tracked in the field of view. 

  • VSLAM
  • dynamic environments
  • object detection
  • geometric constraint

1. Introduction

Simultaneous localization and mapping (SLAM) [1] is a crucial technology for advanced robotics applications, enabling collision-free navigation and environment exploration [2]. SLAM heavily relies on the sensors carried by robots to simultaneously achieve high-precision localization and environment mapping. Visual SLAM (VSLAM) [3][4][3,4] utilizes cameras to estimate the robot’s position, offering several advantages such as cost-effectiveness, lower energy consumption, and reduced computational requirements. Over the last decade, the VSLAM framework has witnessed rapid development, with notable frameworks such as SOFT2 [5], VINS-Mono [6], ORB-SLAM3 [7], and DM-VIO [8]. Most of these algorithms employ optimization-based methods to construct epipolar constraints, BA, or minimum photometric constraints with features in the environment. VINS-Fusion [9] leverages optical flow to track feature points in the front end and optimizes the minimum reprojection error to solve the poses with BA in the back end. ORB-SLAM2 [10] uses ORB feature points to improve tracking and incorporates a loop closure thread to achieve higher accuracy in global pose estimation. Building upon ORB-SLAM2, ORB-SLAM3 integrates an IMU to enhance the system robustness and stands as one of the most advanced VSLAM solutions to date.
The essence of SLAM pose estimation lies in the robot’s perception of its relative movement in the environment. In terms of the localization aspect of SLAM, the accuracy of pose estimation is greatly affected by the proportion of dynamic feature points being tracked in the field of view. When the proportion of dynamic feature points is relatively small, non-dynamic SLAM algorithms can utilize statistical methods like RANSAC [11] to identify and discard these few dynamic points as outliers. However, when dynamic objects occupy more than half or the majority of the field of view, there are limited static feature points available for tracking. This presents a significant challenge that needs to be addressed using dynamic SLAM algorithms. In such cases, the accuracy of SLAM pose estimation significantly decreases and can even lead to failure, especially for feature-based VSLAM approaches [5][6][7][8][5,6,7,8]. Consequently, these open-source algorithms often experience a loss in accuracy or even failure when deployed in dynamic environments such as city streets or rural roads with numerous dynamic objects.

2. Geometry-Based Dynamic SLAM

Geometry-based methods rely on geometric constraints between camera frames to eliminate outliers. Dynamic objects can be identified as they deviate from the geometric motion consistency observed between frames. Additionally, statistical analysis allows for the differentiation of inner points (static points) from outliers (dynamic points). Most SLAM systems, like VINS-Mono, use RANSAC with epipolar constraints to remove outliers by calculating the fundamental matrix using the eight-point method. However, RANSAC becomes less effective when outliers dominate the dataset. DGS-SLAM [12][13] presents an RGB-D SLAM approach specifically designed for dynamic environments. It addresses outlier impacts during optimization by introducing new robust kernel functions. DynaVINS [13][14] introduces a novel loss function that incorporates IMU pre-integration results as priors in BA. In the loop closure detection module, loops from different features are grouped for selective optimization. PFD-SLAM [14][15] utilizes GMS (grid-based motion statistics) [15][16] to ensure accurate matching with RANSAC. Subsequently, it calculates the homography transformation to extract the dynamic region, which is accurately determined using particle filtering. ClusterSLAM [16][17] clusters feature points based on motion consistency to reject dynamic objects. In general, geometry-based methods offer higher accuracy and lower computational costs compared to semantic-based methods. However, they lack the semantic information required for precise segmentation. Moreover, geometry-based methods heavily rely on experience-based hyperparameters, which can significantly limit algorithm feasibility.

3. Semantic-Based Dynamic SLAM

Deep-learning networks have achieved remarkable advancements in speed and accuracy in various computer vision tasks, including object detection, semantic segmentation, and optical flow. These networks can provide object detection results, such as bounding boxes, which can be utilized in dynamic SLAM systems. To accurately detect dynamic objects, deep-learning-based methods often incorporate geometric information to capture the real motion state in the current frame. For example, DynaSLAM [17][18] is an early dynamic SLAM system that combines multi-view geometry with deep learning. It utilizes MASK R-CNN, which offers pixel-level semantic priors for potential dynamic objects in images. Dynamic-SLAM [18][19] detects dynamic objects using the SSD (single shot multi-box detector) [19][20] object detection network and addresses missed detections by employing a constant velocity motion model. Moreover, it sets a threshold for the average parallax of features within the bounding box area to further reject dynamic features. However, this method’s reliance on bounding boxes may incorrectly reject static feature points belonging to the background. DS-SLAM [20][21] employs the SegNet network to eliminate dynamic object features, which are then tracked using the Lucas–Kanade (LK) optical flow algorithm [21][22]. The fundamental matrix is calculated using RANSAC. The distance between the matched points and their epipolar line is computed, and, if the distance exceeds a certain threshold, the point is considered dynamic and subsequently removed. Additionally, depth information from RGB-D cameras is often employed for dynamic object detection. RS-SLAM [22][23] detects dynamic features with semantic segmentation, and a Bayesian update method based on the previous segmentation results is used to refine the current coarse segmentation. It also utilizes depth images to compute the Euclidean distance between such two movable regions. Dynamic-VINS [23][24] proposes an RGB-D-based visual–inertial odometry approach specifically designed for embedded platforms. It reduces the computational burden by employing grid-based feature detection algorithms. Semantic labels and the depth information of dynamic features are combined to separate the foreground and background. A moving consistency check based on IMU pre-integration is introduced to address missed detection issues. YOLO-SLAM [24][25] is an RGB-D SLAM system that obtains an object’s semantic labels using Darknet19-YOLOv3. The drawback is that it cannot be run in real time. SG-SLAM [25][26] is a real-time RGB-D SLAM system that adds a dynamic object detection thread and semantic mapping thread based on ORB-SLAM2 for creating global static 3D reconstruction maps. In general, geometry-based methods offer faster processing times but lack semantic information. In contrast, deep-learning-based methods excel in dynamic object detection by detecting potential dynamic objects with semantic information. However, it is challenging to run deep-learning algorithms in real time on embedded platforms. Their accuracy heavily relies on the training results. Moreover, most of these methods use RGB-D cameras, which tightly couple geometric and depth information, making them more suitable for indoor environments. Few algorithms are specifically designed for outdoor dynamic scenes.
Video Production Service