Sequential Spacecraft Depth Completion: Comparison
Please note this is a comparison between Version 1 by Xiang Liu and Version 2 by Rita Xu.

The recently proposed spacecraft three-dimensional (3D) structure recovery method based on optical images and LIDAR has enhanced the working distance of a spacecraft’s 3D perception system.

  • depth completion
  • sequential depth completion
  • multi-modal fusion

1. Introduction

The proliferation of satellites in Earth’s orbit has witnessed a significant surge in recent years. However, as many satellites encounter malfunctions or deplete their fuel reserves, the need for on-orbit maintenance [1] or the recovery of critical components [2] has become imperative. During the execution of on-orbit maintenance tasks, acquiring a target’s precise three-dimensional point cloud data is paramount as these data play a pivotal role in various aspects of space operations, such as navigation [3], three-dimensional (3D) reconstruction, pose estimation [4][5][4,5], component identification and localization [6], and decision-making. Consequently, the acquisition of precise 3D point cloud data from a target object has emerged as a critical and fundamental requirement for the successful execution of numerous space missions to be conducted in the dynamic and challenging space environment.
To date, different sensor options have been proposed to obtain point cloud data efficiently and accurately, and they can be categorized into multi-camera vision systems [7], time-of-flight (TOF) cameras [8], and techniques that combine monocular and LIDAR systems [9]. Among these, multi-camera-based solutions utilize the triangulation principle to recover the depths of the extracted feature points, though these solutions struggle with smooth surfaces or repetitive textures. Furthermore, a binocular camera’s baseline dramatically limits such systems’ working distances, which is challenging for meeting the requirements of on-orbit tasks. Unlike binocular systems, TOF cameras accurately determine depths by gauging the times of laser pulse flights. Although capable of obtaining precise depths with high density, the working distances of TOF cameras are generally less than 10 m, hindering their use in practical applications. Recently, a combination of monocular and LIDAR systems has been proposed, and they utilize optical images and sparse-ranging information to restore a spacecraft’s dense depths. Compared with binocular systems and TOF cameras, combining a monocular camera with LIDAR can effectively increase a system’s working distance and reduce its sensitivity to light conditions and materials, which is more suitable for practical applications in space. Therefore, this paper aims to reconstruct a spacecraft’s detailed depth using images obtained using an optical camera and sparse depths obtained via LIDAR.
Numerous learning-based depth completion algorithms have been proposed in recent years, and they have been tailored to address the demands of diverse applications relying on depth information. Existing methods can be roughly categorized into early and late fusion models, depending on the layers where the multimodal data are fused. Early fusion models [10][11][12][13][10,11,12,13] concatenated the visible images and the depth maps directly and fed them into a U-Net-like system to regress their dense depths. Late fusion models [9][14][15][16][17][9,14,15,16,17] adopted multiple sub-networks to extract unimodal features contained in optical images and LIDAR separately. The extracted unimodal features were fused through various fusion modules and fed into a decoder to regress their dense depths.

2. Sequential Spacecraft Depth Completion

LIDAR and monocular-based depth completion tasks aim to reconstruct pixel-wise depths from sparse-ranging information obtained via LIDAR with the guidance of an optical image, and this has received considerable research interest due to its significance in different applications. Early research works [18][19][20][21][21,22,23,24] generally utilized traditional image-processing techniques (such as bilateral filters [22][25], global optimization [23][26], etc.) to generate dense depth maps. More recently, neural networks’ powerful feature extraction capabilities have propelled learning-based methods to outperform conventional techniques in both accuracy and efficiency. According to different LIDAR/monocular fusion strategies, learning-based depth completion methods can roughly be classified into early fusion models and late fusion models. Early fusion models [10][11][12][13][10,11,12,13] treated sparse depths as additional channels and fed the concatenation of RGB-D into a U-Net-like network to predict dense depths. For instance, sparse-to-dense [10] employed a regression network to predict the pixel-wise depths with RGB-D data as input. Despite the structure of such methods being simple and easy to implement, it is challenging to exhaustively utilize the complementary information of different modal data due to the lack of adequate guidance, leading to blurry depth prediction results. Therefore, various spatial propagation networks (SPNs) [24][25][26][27][28][29][27,28,29,30,31,32] have been introduced to improve the quality of depth maps derived from early fusion models. Specifically, SPNs [24][27] utilized a spatial propagation network to acquire an affinity matrix that captures pairwise interactions within an image, and it established a three-way linkage to facilitate spatial propagation. The CSPN [25][28] replaced the three-way connection propagation with recurrent convolution operations, solving the limitation that a SPN cannot consider all local neighbors simultaneously. On this basis, more and more variants (such as learning adaptive kernel sizes and adjusting iterations for each pixel [26][29], applying non-local propagation [27][30], making use of non-linear propagation [28][31], etc.) were proposed and yield better depth completion results. Late fusion models employed two parallel neural network branches to concurrently extract features from RGB images and depth data. Parallel neural network architectures have seen extensive adoption across diverse image processing tasks, such as image classification [30][33], multi-sensor data fusion [31][32][34,35], object detection [33][36], etc. Their widespread usage verifies their versatility and effectiveness in the realm of multi-source data processing. In depth completion tasks, the extracted image features are generally incorporated into depth features through various finely designed fusion modules, culminating in input to a decoder to generate dense depth information. Specifically, FusionNet [14] adopted 2D and 3D convolution to extract 2D and 3D features, respectively. The 3D features were then projected into the 2D space. Finally, the composite representations were generated by adding the 2D features and the projected 3D features. Inspired by guided filtering [22][25], GuideNet [15] proposed a guided unit to predict the content-dependent kernels, which were then leveraged for extracting depth features. FCFRNet [16] combined RGB-D features by employing channel shuffling and energy-based fusion operations. SDCNet [9] proposed an attention-based feature fusion module, facilitating the aggregation of complementary information from diverse inputs. In addition to single-frame depth completion methods, a few works were dedicated to sequential depth completion [34][35][36][18,19,20] tasks. Giang et al. [34][18] performed feature warping by utilizing the relative poses between frames and incorporated warped features into current features through a confidence-based integration module. Nguyen et al. [35][19] directly fed the prediction results of FusionNet [14] into recurrent neural networks to investigate temporal information, helping mitigate the mismatch between frames. Moreover, Chen et al. [36][20] utilized CoarseNet, PoseNet, and DepthNet to predict coarse dense maps, relative poses between frames, and final depth maps, respectively.
Video Production Service