Environment Perception System of Quadruped Robots

Environment Perception System of Quadruped Robots: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Engineering, Mechanical

Contributor:

Liang Hong

Due to the high stability and adaptability, quadruped robots are currently highly discussed in the robotics field. To overcome the complicated environment indoor or outdoor, the quadruped robots should be configured with an environment perception system, which mostly contain LiDAR or a vision sensor, and SLAM (Simultaneous Localization and Mapping) is deployed.

quadruped robot
simultaneous localization and mapping
deep learning

1. Introduction

According to the type of motion, mobile robots can be classified into three categories, wheeled, crawler, and legged [1]. Wheeled robots are suitable for simple terrains, crawler robots can move on complex terrains, but their movement flexibility is poor. Compared to the former two, legged robots only require discrete points instead of continuous motion when planning their motion path, allowing them to adapt to more complex terrains [2]. Legged robots can be further divided into monopods [3], bipeds [4], quadrupeds [5], hexapods [6], etc. Among them, quadruped robots offer both high stability and adaptability, allowing them to navigate more complex terrains than biped robots without the complexity of hexapod robots. As a result, they have become a research hotspot in the field of robotics. In the research of quadruped robots, improving their adaptability to the external environment, specifically their ability to autonomously perceive and interact with the external environment, is a highly researched topic. An autonomous legged robot requires an accurate, real-time running, simultaneous localization and mapping (SLAM) algorithm without human intervention [7].

Most outdoor navigation systems, such as surface ships, use Global Navigation Satellite Systems (GNSS) [8], such as the Global Positioning System (GPS), to measure their position. Xia X et al. proposed An autonomous vehicle sideslip angle estimation algorithm based on consensus and vehicle kinematics/dynamics synthesis. Based on the velocity error measurements between the reduced Inertial Navigation System (R-INS) and the GNSS, a velocity-based Kalman filter is formalized to estimate the velocity errors, attitude errors, and gyro bias errors of the R-INS [9]. Gao L et al. proposed a vehicle localization system based on vehicle chassis sensors considering vehicle lateral velocity to improve the accuracy of vehicle stand-alone localization in highly dynamic driving conditions during GNSS outages [10]. However, these signals are weak and vulnerable to intentional or unintentional interference. To address these problems, SLAM has emerged as a research hotspot in the field of robot autonomous navigation. Two mainstream technologies in SLAM are laser-SLAM and visual-SLAM, which are based on LiDAR sensors and visual sensors, respectively. Each sensor has its advantages and disadvantages. Visual sensors can obtain relatively accurate detection results at close distances, but their detection distance is limited and they are more sensitive to the external environment. They are usually used for semantic interpretation of the scene but cannot perform well in harsh lighting conditions. On the other hand, LiDAR sensors can detect further distances and have stronger anti-jamming capabilities, making them important for obstacle detection and tracking. However, they have poor performance in the detection of color, texture, and features. Therefore, fusing LiDAR and visual information can overcome their drawbacks and improve the stability and accuracy of detection [11].

In addition, machine learning and deep learning techniques are widely used for more complex object detection and scene perception, including image classification and object detection. Commonly used algorithms include Convolutional Neural Networks (CNNs) and the YOLO (You Only Look Once) network. Liang Y et al. presented a novel lightweight convolutional module (LCM), namely convolutional layers module (CEModule), focusing on the CE part. CEModule increases the number of key features to maintain a high level of accuracy in classification. In the meantime, CEModule employs a group convolution strategy to reduce floating-point operations (FLOPs) incurred in the training process [12]. Zhou P et al. proposed a lightweight unmanned aerial vehicle video object detection based on spatial-temporal correlation, an efficient deep learning model on unmanned aerial vehicles (UAVs) to fit the restriction of low computational powers and low power consumption [13].

2. Single Sensor Detection

The current research on the perception of the external environment using a single sensor is relatively mature. Manuel et al. proposed an algorithm that performs autonomous 3D reconstruction of an environment using a single 2D LiDAR sensor and implemented it on a mobile platform using the Robot Operating System (ROS) [14]. Woo et al. proposed a Ceiling Vision-based Simultaneous Localization and Mapping (CV-SLAM) technique using a single ceiling vision sensor [15]. They addressed the rotation and affine transform problems of the ceiling vision by using a 3D gradient orientation estimation method and multi-view description of landmarks. Based on that, they reconstructed the 3D landmark map in real-time using the Extended Kalman filter-based SLAM framework. Andrew et al. presented the MonoSLAM algorithm, which can recover the 3D trajectory of a monocular camera [16]. The core part of the research is to online create a sparse but persistent map of natural landmarks within a probabilistic framework. The work also extended the range of robotic systems to humanoid robots and augmented reality with a hand-held camera. Dominik Belter applied a simultaneous localization and mapping algorithm to localize a hexapod robot using data from compact RGB-D sensors. This approach employed a new concept that combines fast visual odometry to track sensor motion and visual features to track radar scans. Experiments showed that visual radar features can be used to accurately estimate ship trajectories across a wide range of datasets [17].

3. Multi-Sensor Fusion

Multi-sensor fusion is an effective method to improve a robot’s ability to perceive the external environment [18]. For example, one common fusion approach is to combine cameras and LiDARs. Cameras can obtain complex external environment information with a high frame rate and high pixel count, but they are easily affected by lighting conditions. On the other hand, LiDAR is less affected by light and can provide more accurate position and depth information, but it cannot capture visual information. By fusing the data from these two sensors, the robustness of perception can be greatly improved [19]. Joel et al. fused LiDAR and color imagery for pedestrian detection using CNNs [20]. They incorporated LiDAR by up-sampling the point cloud to a dense depth map and extracting three features representing horizontal disparity, height above ground, and angle (HHA) features. These features were then used as extra image channels and fed into CNNs to learn a deep hierarchy of feature representation. Mohamed Dhouioui proposed an embedded system based on two types of data, radar signals and camera images, aiming to identify and classify obstacles on the road. They used machine learning methods and signal processing techniques to optimize the overall computation performance and efficiency [21]. Elena incorporated vision and laser fusion techniques for simultaneous localization and mapping of Micro Air Vehicles (MAVs) in indoor rescue and/or identification navigation missions. The technique fused laser and visual information, as well as measurement data from inertial components, to obtain reliable 6DOF pose estimation of MAV within a local map. Experimental results showed that sensor fusion can improve position estimation under different test conditions and obtain accurate maps [22]. When considering robotic applications in complex scenarios, traditional geometric maps appear inaccurate due to their lack of interaction with the environment. Based on this, Jing Li et al. proposed building a three-dimensional (3D) semantic map with large-scale and precise integration of LiDAR and camera information to more accurately present real-time road scenes [23]. First, they performed SLAM through multi-sensor fusion of LiDAR and inertial measurement unit (IMU) data to locate the robot’s position and build a map of the surrounding scene while the robot moves. Furthermore, they employed a CNN-based image semantic segmentation to develop a semantic map of the environment. To address the incompleteness of environmental perception when using only a 2D LiDAR, they calibrated the point cloud information from the RGBD camera Kinectv2 and the 2D LiDAR using internal and external parameters based on the Cartographer algorithm [24]. Precise calibration of the rigid body transform between the sensors is crucial for correct data fusion. To simplify the calibration process, Michelle et al. presented the first framework that makes use of CNNs for odometry estimation by fusing data from 2D laser scanners and monocular cameras without requiring sensor calibration [25]. Mary et al. presented a fusion of a six-degrees-of-freedom (6-DoF) inertial sensor and a monocular vision [26]. They integrated a monocular vision-based object detection algorithm using Speeded-Up Robust Feature (SURF) and Random Sample Consensus (RANSAC) algorithms to improve the accuracy of detection. By fusing data from inertial sensors and a camera using an Extended Kalman Filter (EKF), they estimated the position and orientation of the mobile robot. Xia X et al. proposed an automated driving systems data acquisition and analytics platform. It presents a holistic pipeline from the raw advanced sensory data collection to data processing, which is capable of processing the sensor data from multi-CAVs (connected automated vehicle) and extracting the objects’ Identity (ID) number, position, speed, and orientation information in the map and Frenet coordinates [27]. Liu W et al. proposed a novel kinematic-model-based VSA (vehicle slip angle) estimation method by fusing information from a GNSS and an IMU [28]. Xia X et al. proposed a method for the IMU and automotive onboard sensors fusion to estimate the yaw misalignment autonomously [29].

4. Deep Learning Method

In the application of assisted driving systems, a model that can accurately identify partially occluded targets in complex backgrounds and perform short-term tracking and the early warning of fully occluded targets is required. Based on this, Kun Wang et al. proposed a method based on YOLOv3 [30], which can improve the detection accuracy while supporting real-time operation and realize real-time alarm for completely occluded targets. They first obtained a more appropriate prior frame setting through categorical K-means clustering. Then, they used DIOUNMS instead of the traditional non-maximum suppression (NMS) technique. Additionally, to improve the system’s ability to identify occluded targets, they proposed a tracking algorithm based on Kalman filter and Hungarian matching. Qiu et al. proposed an Adaptive Spatial Feature Fusion (ASFF) YOLOv5 network (ASFF-YOLOv5) to improve the accuracy of recognition and detection of multiple multiscale road traffic elements [31]. The first step was to use the K-means algorithm for clustering statistics on the range of multiscale road traffic elements. Then, they employed a spatial pyramid pooling fast (SPPF) structure to enhance the accuracy of information extraction. To address the problems in object detection in drone-captured scenarios due to different altitudes and high drone speeds, Zhu et al. proposed TPH-YOLOv5 to handle different object scales and motion blur [32]. Based on YOLOv5, they added an additional prediction head to detect objects of different scales. They replaced the original prediction heads with Transformer Prediction Heads (TPH) and integrated the Convolutional Block Attention Model (CBAM) to identify attention regions in scenarios with dense objects. Experiments on the VisDrone2021 dataset demonstrated that TPH-YOLOv5 performed well, with impressive interpretability, in drone-captured scenarios. Liu W et al. proposed a novel algorithm referred to as YOLOv5-tassel to detect tassels in UAV-based (Unmanned aerial vehicle) RGB imagery [33].

This entry is adapted from the peer-reviewed paper 10.3390/drones7050329

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.