Geometry and Semantic-Based Dynamic SLAM: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor: , , , , , ,

Simultaneous localization and mapping (SLAM) is a crucial technology for advanced robotics applications, enabling collision-free navigation and environment exploration. In terms of the localization aspect of SLAM, the accuracy of pose estimation is greatly affected by the proportion of dynamic feature points being tracked in the field of view. 

  • dynamic environments
  • object detection
  • geometric constraint

1. Introduction

Simultaneous localization and mapping (SLAM) [1] is a crucial technology for advanced robotics applications, enabling collision-free navigation and environment exploration [2]. SLAM heavily relies on the sensors carried by robots to simultaneously achieve high-precision localization and environment mapping. Visual SLAM (VSLAM) [3][4] utilizes cameras to estimate the robot’s position, offering several advantages such as cost-effectiveness, lower energy consumption, and reduced computational requirements. Over the last decade, the VSLAM framework has witnessed rapid development, with notable frameworks such as SOFT2 [5], VINS-Mono [6], ORB-SLAM3 [7], and DM-VIO [8]. Most of these algorithms employ optimization-based methods to construct epipolar constraints, BA, or minimum photometric constraints with features in the environment. VINS-Fusion [9] leverages optical flow to track feature points in the front end and optimizes the minimum reprojection error to solve the poses with BA in the back end. ORB-SLAM2 [10] uses ORB feature points to improve tracking and incorporates a loop closure thread to achieve higher accuracy in global pose estimation. Building upon ORB-SLAM2, ORB-SLAM3 integrates an IMU to enhance the system robustness and stands as one of the most advanced VSLAM solutions to date.
The essence of SLAM pose estimation lies in the robot’s perception of its relative movement in the environment. In terms of the localization aspect of SLAM, the accuracy of pose estimation is greatly affected by the proportion of dynamic feature points being tracked in the field of view. When the proportion of dynamic feature points is relatively small, non-dynamic SLAM algorithms can utilize statistical methods like RANSAC [11] to identify and discard these few dynamic points as outliers. However, when dynamic objects occupy more than half or the majority of the field of view, there are limited static feature points available for tracking. This presents a significant challenge that needs to be addressed using dynamic SLAM algorithms. In such cases, the accuracy of SLAM pose estimation significantly decreases and can even lead to failure, especially for feature-based VSLAM approaches [5][6][7][8]. Consequently, these open-source algorithms often experience a loss in accuracy or even failure when deployed in dynamic environments such as city streets or rural roads with numerous dynamic objects.

2. Geometry-Based Dynamic SLAM

Geometry-based methods rely on geometric constraints between camera frames to eliminate outliers. Dynamic objects can be identified as they deviate from the geometric motion consistency observed between frames. Additionally, statistical analysis allows for the differentiation of inner points (static points) from outliers (dynamic points). Most SLAM systems, like VINS-Mono, use RANSAC with epipolar constraints to remove outliers by calculating the fundamental matrix using the eight-point method. However, RANSAC becomes less effective when outliers dominate the dataset.
DGS-SLAM [12] presents an RGB-D SLAM approach specifically designed for dynamic environments. It addresses outlier impacts during optimization by introducing new robust kernel functions.
DynaVINS [13] introduces a novel loss function that incorporates IMU pre-integration results as priors in BA. In the loop closure detection module, loops from different features are grouped for selective optimization.
PFD-SLAM [14] utilizes GMS (grid-based motion statistics) [15] to ensure accurate matching with RANSAC. Subsequently, it calculates the homography transformation to extract the dynamic region, which is accurately determined using particle filtering.
ClusterSLAM [16] clusters feature points based on motion consistency to reject dynamic objects.
In general, geometry-based methods offer higher accuracy and lower computational costs compared to semantic-based methods. However, they lack the semantic information required for precise segmentation. Moreover, geometry-based methods heavily rely on experience-based hyperparameters, which can significantly limit algorithm feasibility.

3. Semantic-Based Dynamic SLAM

Deep-learning networks have achieved remarkable advancements in speed and accuracy in various computer vision tasks, including object detection, semantic segmentation, and optical flow. These networks can provide object detection results, such as bounding boxes, which can be utilized in dynamic SLAM systems. To accurately detect dynamic objects, deep-learning-based methods often incorporate geometric information to capture the real motion state in the current frame.
For example, DynaSLAM [17] is an early dynamic SLAM system that combines multi-view geometry with deep learning. It utilizes MASK R-CNN, which offers pixel-level semantic priors for potential dynamic objects in images.
Dynamic-SLAM [18] detects dynamic objects using the SSD (single shot multi-box detector) [19] object detection network and addresses missed detections by employing a constant velocity motion model. Moreover, it sets a threshold for the average parallax of features within the bounding box area to further reject dynamic features. However, this method’s reliance on bounding boxes may incorrectly reject static feature points belonging to the background.
DS-SLAM [20] employs the SegNet network to eliminate dynamic object features, which are then tracked using the Lucas–Kanade (LK) optical flow algorithm [21]. The fundamental matrix is calculated using RANSAC. The distance between the matched points and their epipolar line is computed, and, if the distance exceeds a certain threshold, the point is considered dynamic and subsequently removed. Additionally, depth information from RGB-D cameras is often employed for dynamic object detection.
RS-SLAM [22] detects dynamic features with semantic segmentation, and a Bayesian update method based on the previous segmentation results is used to refine the current coarse segmentation. It also utilizes depth images to compute the Euclidean distance between such two movable regions.
Dynamic-VINS [23] proposes an RGB-D-based visual–inertial odometry approach specifically designed for embedded platforms. It reduces the computational burden by employing grid-based feature detection algorithms. Semantic labels and the depth information of dynamic features are combined to separate the foreground and background. A moving consistency check based on IMU pre-integration is introduced to address missed detection issues.
YOLO-SLAM [24] is an RGB-D SLAM system that obtains an object’s semantic labels using Darknet19-YOLOv3. The drawback is that it cannot be run in real time.
SG-SLAM [25] is a real-time RGB-D SLAM system that adds a dynamic object detection thread and semantic mapping thread based on ORB-SLAM2 for creating global static 3D reconstruction maps.
In general, geometry-based methods offer faster processing times but lack semantic information. In contrast, deep-learning-based methods excel in dynamic object detection by detecting potential dynamic objects with semantic information. However, it is challenging to run deep-learning algorithms in real time on embedded platforms. Their accuracy heavily relies on the training results. Moreover, most of these methods use RGB-D cameras, which tightly couple geometric and depth information, making them more suitable for indoor environments. Few algorithms are specifically designed for outdoor dynamic scenes.

This entry is adapted from the peer-reviewed paper 10.3390/rs15153881


  1. Kazerouni, I.A.; Fitzgerald, L.; Dooly, G.; Toal, D. A survey of state-of-the-art on visual SLAM. Expert Syst. Appl. 2022, 205, 117734.
  2. Covolan, J.P.M.; Sementille, A.C.; Sanches, S.R.R. A Mapping of Visual SLAM Algorithms and Their Applications in Augmented Reality. In Proceedings of the 2020 22nd Symposium on Virtual and Augmented Reality (SVR), Porto de Galinhas, Brazil, 7–10 November 2020; pp. 20–29.
  3. Tourani, A.; Bavle, H.; Sanchez-Lopez, J.L.; Voos, H. Visual SLAM: What Are the Current Trends and What to Expect? Sensors 2022, 22, 9297.
  4. Chen, C.; Zhu, H.; Li, M.; You, S. A Review of Visual-Inertial Simultaneous Localization and Mapping from Filtering-Based and Optimization-Based Perspectives. Robotics 2018, 7, 45.
  5. Cvisic, I.; Markovic, I.; Petrovic, I. SOFT2: Stereo Visual Odometry for Road Vehicles Based on a Point-to-Epipolar-Line Metric. IEEE Trans. Robot. 2022, 39, 273–288.
  6. Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020.
  7. Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890.
  8. von Stumberg, L.; Cremers, D. DM-VIO: Delayed Marginalization Visual-Inertial Odometry. IEEE Robot. Autom. Lett. 2022, 7, 1408–1415.
  9. Qin, T.; Cao, S.; Pan, J.; Shen, S. A General Optimization-Based Framework for Global Pose Estimation with Multiple Sensors. arXiv 2019, arXiv:1901.03642.
  10. Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262.
  11. Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. In Readings in Computer Vision; Fischler, M.A., Firschein, O., Eds.; Morgan Kaufmann: San Francisco CA, USA, 1987; pp. 726–740. ISBN 978-0-08-051581-6.
  12. Yan, L.; Hu, X.; Zhao, L.; Chen, Y.; Wei, P.; Xie, H. DGS-SLAM: A Fast and Robust RGBD SLAM in Dynamic Environments Combined by Geometric and Semantic Information. Remote Sens. 2022, 14, 795.
  13. Song, S.; Lim, H.; Lee, A.J.; Myung, H. DynaVINS: A Visual-Inertial SLAM for Dynamic Environments. IEEE Robot. Autom. Lett. 2022, 7, 11523–11530.
  14. Zhang, C.; Zhang, R.; Jin, S.; Yi, X. PFD-SLAM: A New RGB-D SLAM for Dynamic Indoor Environments Based on Non-Prior Semantic Segmentation. Remote Sens. 2022, 14, 2445.
  15. Bian, J.; Lin, W.-Y.; Matsushita, Y.; Yeung, S.-K.; Nguyen, T.-D.; Cheng, M.-M. GMS: Grid-Based Motion Statistics for Fast, Ultra-Robust Feature Correspondence. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2828–2837.
  16. Huang, J.; Yang, S.; Zhao, Z.; Lai, Y.-K.; Hu, S. ClusterSLAM: A SLAM Backend for Simultaneous Rigid Body Clustering and Motion Estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Repiblic of Korea, 27 October–2 November 2019; pp. 5874–5883.
  17. Bescos, B.; Facil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083.
  18. Xiao, L.; Wang, J.; Qiu, X.; Rong, Z.; Zou, X. Dynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment. Robot. Auton. Syst. 2019, 117, 1–16.
  19. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37.
  20. Yu, C.; Liu, Z.; Liu, X.-J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174.
  21. Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision. In Proceedings of the DARPA Image Understanding Workshop, Washington, DC, USA, 21–23 April 1981; pp. 674–679.
  22. Ran, T.; Yuan, L.; Zhang, J.; Tang, D.; He, L. RS-SLAM: A Robust Semantic SLAM in Dynamic Environments Based on RGB-D Sensor. IEEE Sens. J. 2021, 21, 20657–20664.
  23. Liu, J.; Li, X.; Liu, Y.; Chen, H. Dynamic-VINS: RGB-D Inertial Odometry for a Resource-Restricted Robot in Dynamic Environments. IEEE Robot. Autom. Lett. 2022, 7, 9573–9580.
  24. Wu, W.; Guo, L.; Gao, H.; You, Z.; Liu, Y.; Chen, Z. YOLO-SLAM: A semantic SLAM system towards dynamic environment with geometric constraint. Neural Comput. Appl. 2022, 34, 6011–6026.
  25. Cheng, S.; Sun, C.; Zhang, S.; Zhang, D. SG-SLAM: A Real-Time RGB-D Visual SLAM Toward Dynamic Scenes With Semantic and Geometric Information. IEEE Trans. Instrum. Meas. 2023, 72, 7501012.
This entry is offline, you can click here to edit this entry!
Video Production Service