Stereo cameras allow mobile robots to perceive depth in their surroundings by capturing two separate images from slightly different perspectives. This is necessary for tasks such as obstacle avoidance, navigation, and spatial mapping.
1. Introduction
Mobile robots have witnessed a surge in popularity and find versatile applications in numerous fields
[1]. One compelling use case for mobile robots is their deployment in hazardous environments, such as automated agriculture and the handling of dangerous materials, where they can replace human workers
[2]. However, to ensure optimal performance, it is imperative for mobile robots to swiftly and accurately gauge the geometric attributes of their surroundings, specifically the depth information. Depth estimation plays a pivotal role in enabling mobile robots to excel in various tasks. It empowers these robots with the capability to perform obstacle detection
[3], construct detailed environmental maps
[4], and facilitate object recognition
[5]. One of the potential solutions for depth estimation is stereo matching
[6]. Stereo matching is a computer vision technique that simulates human vision by analyzing a pair of 2D images captured from slightly different viewpoints to reconstruct 3D scenes. The primary objective of stereo matching is to establish correspondences between pixels in these input 2D images and, subsequently, to compute the corresponding depth values for each pixel. This process is executed by identifying the disparity, which denotes the horizontal displacement between correspondences in the 2D images
[7]. The accurate calculation of this disparity is instrumental in calculating the depth information, thereby empowering mobile robots to navigate, interact with, and operate effectively in their surroundings.
In order to accurately determine the disparity, recent studies have applied deep learning methods and achieved promising results
[8]. Particularly, these works first used a convolutional neural network (CNN) to extract features from 2D images, then concatenate them and store the disparity values between them to construct a 4D cost volume
(โ๐๐๐โ๐กร๐ค๐๐๐กโร๐๐๐ ๐๐๐๐๐ก๐ฆร๐๐๐๐ก๐ข๐๐๐ ). Then, the 4D cost volume is input in a 3D CNN model for regularization into a 3D cost volume
(โ๐๐๐โ๐กร๐ค๐๐๐กโร๐๐๐ ๐๐๐๐๐ก๐ฆ). Finally, the predicted disparity is regressed from the cost volume via a softmax operation (
๐)
[9].
For example, GC-Net
[9] proposes to learn the context of cost volume through an encoderโdecoder 3D CNN architecture. PSMNet
[10] utilizes a feature extractor with a spatial pyramid pooling module and regularizes the cost volume using a 3D CNN based on stacked hourglass architecture. GA-Net
[11] incorporates a 3D CNN with semiglobal matching for cost filtering. These approaches have demonstrated cutting-edge performance in stereo matching. Despite the high accuracy, when applying these methods to mobile robots, which often have low computational power, the computational cost is also a critical challenge.
As reported in
[12], PSMNet
[10] could only run at approximately 0.16 frames per second (fps) on an NVIDIA Jetson TX2 module. Similarly, although it has been proposed specifically for mobile robots, StereoNet
[13] could only provide fewer than 2 fps on the same device. Such performances are far from the requirement for real-time applications in mobile robots, which is often a minimum of 30 fps
[14].
Recently, the authors of
[12] proposed attention-aware feature aggregation (AAFS) to obtain a better tradeoff between computational time and accuracy for real-time stereo matching on edge devices. The authors reported that AAFS could run at a maximum frame rate of 33 fps on low-budget devices, such as an NVIDIA Jetson TX2 module. However, the accuracy of AAFS is still limited due to the fact that it cannot efficiently exploit the contextual information of stereo images. The reason is that AAFS attempts to not increase the number of feature maps in its cascaded 3D CNN to limit the computational cost. In this case, leveraging the idea of a deep convolutional encoderโdecoder, which is intended for dense prediction tasks, is a potential solution. Deep encoderโdecoder tasks could reduce the computational cost by compressing the input data, then decoding the compressed data back to the input data dimension
[15]. For example, a stacked hourglass based on a deep encoderโdecoder consists of hourglass blocks that apply two-stride 3D convolutions to reduce the cost volume size by a factor of four
[16]. This allows for an increase in feature dimensions with little impact on computational resources. Then, 3D transposed convolutions are applied to decode the volume to the original dimension.
2. Hourglass 3D CNN for Stereo Disparity Estimation for Mobile Robots
Zbontar et al. originally introduced a CNN-based stereo-matching technique
[17] whereby the similarity metric of tiny patch pairings was learned using a convolutional neural network. GCNet
[9] was one of the first methods incorporating 4D cost volume, using the soft argmin operation in the disparity regression steps to obtain the best matching disparity. PSMNet
[10] introduced a spatial pyramid pooling module and 3D stacked hourglass networks and yielded promising results. The authors of
[18] proposed GwcNet, which is a modified 3D stacked hourglass architecture, and a combined 3D cost volume based on group-wise correlation. GA-Net
[11] includes a semiglobal aggregation layer and a local guided aggregation layer to replace several 3D convolution layers. To replace the 3D architecture, AANet
[19] includes an intrascale and cross-scale cost aggregation algorithm, which can reduce inference time and maintain equivalent accuracy. On the other hand, DeepPruner
[20], a coarse-to-fine approach, includes a differentiable PathMatch-based module to estimate the pruned search range of each pixel. Although 4D cost volume-based methods have achieved promising results, they operate at high computational cost and do not accommodate real-time operation on low-budget devices.
Therefore, some recent studies have focused on lightweight stereo networks based on 4D cost volumes to achieve real-time performance while maintaining competitive accuracy. These methods typically construct and aggregate 3D cost volume at low resolution to significantly reduce computational cost. For instance, StereoNet
[13] is an edge-preserving refinement network that utilizes the left images as guidance to recover high-frequency details. Gu et al.
[21] proposed a cascade cost volume, which consists of two stages. Cost volume at the early stage is built on a low-resolution feature map. Then, the later stage used the estimated disparity maps from the earlier stage to construct new cost volumes to apply better semantic features. This leads to a remarkable improvement in GPU memory consumption and computation time. AAFS
[12] constructs a 4D cost volume by adopting a distance metric (height ร width ร disparity ร 1). A disparity map is then computed at the lowest resolution, and disparity residuals are computed in later stages. However, its 3D CNN cannot exploit the contextual information for cost volume regularization, resulting in a limitation in estimation accuracy.
This entry is adapted from the peer-reviewed paper 10.3390/app131910677