Vision-Based Pose Estimation of Non-Cooperative Target: Comparison
Please note this is a comparison between Version 1 by Xiao Ling and Version 2 by Rita Xu.

In the realm of non-cooperative space security and on-orbit service, a significant challenge is accurately determining the pose of abandoned satellites using imaging sensors. Traditional methods for estimating the position of the target encounter problems with stray light interference in space, leading to inaccurate results.

  • non-cooperative targets
  • stray light interference
  • vision-based pose estimation

1. Introduction

As human exploration and development of outer space advances, countries demand higher levels of space technology [1]. Some of the key challenges in the aerospace field are spacecraft rendezvous and docking, on-orbit capture and repair of malfunctioning satellites, and space debris removal [2]. These challenges require the ability to perform rendezvous, docking, and capture of non-cooperative targets [3]. However, this task depends on the relative pose measurement of non-cooperative targets, which is difficult to achieve due to the poor quality of space images. Space images often have low contrast and texture and are affected by stray light in space. Non-cooperative targets lack artificial markers and feature cursors for auxiliary measurement, making it hard to obtain geometric, grayscale, depth, and other information about the target surface [4]. Various factors limit the availability of samples, which poses problems and challenges for attitude measurement.
There are various methods to achieve the pose measurement of non-cooperative targets, depending on the sensors used. These methods include visual target measurement, scanning laser radar measurement, non-scanning three-dimensional laser imaging measurement [5], pose measurement method based on multi-sensor fusion [6], and so on. The visual measurement method uses a camera to obtain the target image. This method is simple and does not require complex structures or too many devices. It can measure the target with only a camera and a computer, but it requires high computing power. Binocular vision can calculate the target distance and real size using the principle of triangulation, which is more suitable for the pose measurement of space non-cooperative targets [7]. However, this method also requires that the pose estimation algorithm can detect and process image feature information. Moreover, the optical images are more vulnerable to stray light, which affects the recognition and detection of space targets and indirectly leads to the scarcity of data set samples.
Currently, deep learning methods have been applied to various fields beyond image recognition, and the Transformer model is a rising star in the field of non-cooperative target detection and recognition. After the introduction of the Transformer structure from natural language processing to computer vision, it has broken the limited receptive field constraint of CNN. It has gained significant attention due to its advantages, such as not requiring proposals like Faster R-CNN, not using anchors like YOLO, not needing centers or post-processing steps like NMS, as in CenterNet, and directly predicting detection boxes and classes. The Backbone, as a feature extraction network, primarily extracts relevant information from images for subsequent stages. The role of the Neck is to fuse and enhance the features extracted by the Backbone before providing them to the Head for detection. The Head employs the previously extracted features to predict the position and class of objects [8]. As a target detection method, DETR transformed Transformers into the field of object detection, opening up new research avenues [9]. YOLOS is a series of ViT-based object detection models with minimal modifications and inductive biases [10]. Additionally, DETR has various related variants. To address the slow convergence issue of DETR, researchers proposed Deformable DETR and TSP-FCOS and TSP-RCNN [11][12][11,12]. Deformable DETR uses deformable convolution to effectively solve the slow convergence and low detection accuracy for small objects in sparse spatial positioning. ATC primarily alleviates redundancy in the attention maps of DETR and the problem of feature redundancy as the encoder deepens. It is evident that the Transformer network in the Neck section has mature research solutions that can significantly enhance accuracy. Furthermore, in the context of non-cooperative target issues, appropriate modifications can prevent the loss of information when reading patch information. This approach can retain more feature information, considering the scarcity of information sources.

2. Traditional and Deep Learning Methods

To acquire target model information in noisy environments, some traditional research methods transform pose estimation problems into template matching problems, utilizing essential matrices for pose initialization. Pose calculation involves image filtering, edge detection, line extraction, and stereo matching. A three-dimensional model of non-cooperative micro and nanosatellites is reconstructed using a stereo vision system [13]. Subsequently, a method based on feature matching estimates the target’s relative pose, followed by ground experiments to assess the algorithm’s accuracy. Segal S et al. [14] employ the principles of binocular vision measurement and utilize an Extended Kalman Filter to track and observe target feature points, achieving pose measurement for non-cooperative spacecraft. Finally, a trial system for estimating non-cooperative target poses is constructed. However, non-cooperative images often vary in quality, and traditional methods suffer significant accuracy reduction with blurry or smoothly-edged targets, making them inadequate for complex non-cooperative target measurements. Despite proposing algorithms based on horizontal and vertical feature lines to derive fundamental matrices without using paired point information, the reliance on high-quality imagery contradicts the scarcity of suitable non-cooperative target image datasets. As a result, these methods face significant limitations in practical applications. Deep learning methods do not depend on the target model, do not need manual feature design, and have better generalization abilities when the training data are adequate. Li K et al. [15] proposed a method that outperforms the heatmap and regression-based methods and improves the uncertainty prediction. Zhu Z et al. [16] suggested an algorithm that can effectively suppress interference points and enhance the accuracy of non-cooperative target pose estimation. Despond F T [17] used a novel convolutional model to estimate the relative x, y and attitude of the target spacecraft. Deep learning methods are more versatile and robust for different targets and scenarios than traditional methods and can be more effectively applied to non-cooperative pose estimation.

3. Small-Sample Training

To address the challenge of pose estimation for non-cooperative space targets with limited real samples, researchers have also turned to deep learning methods and conducted a series of studies. As the most mature image processing networks, neural network approaches have been widely employed in non-cooperative target pose estimation, forming the basis for numerous improved and optimized algorithms capable of addressing various scenarios. Pasqualetto Cassinis L et al. [18] present a fusion of convolutional neural network-based feature extraction and the CEPPnP (efficient Procrustes perspect-n-points) method, combined with Extended Kalman Filtering for non-cooperative target pose estimation. Hou X et al. [19] introduce a hybrid artificial neural network estimation algorithm based on dual quaternion vectors. Ma C et al. [20] propose a Neural Network-Enhanced Kalman Filter (NNEKF), innovatively improving filter performance using the virtual observation of inertial characteristics. Huan W et al. [21] employ existing object detection networks and keypoint regression networks to predict 2D keypoint coordinates, reconstructing a 3D model through multi-viewpoint triangulation and minimizing 3D coordinates with nonlinear least squares to predict position and orientation. Li Xiang et al. [22] designed a non-cooperative target pose estimation network based on the Google Inception Net model. Applications of the proposed MEGNN-based method to PHM 2010 milling TCM dataset and experiments demonstrate it outperforms three DL-based methods (CNN, AlexNet, ResNet) under small samples [23]. Pan T et al. [24] proposed a generative adversarial network (GAN), which is considered a promising way to solve the problem of small samples. Ma et al. [25] proposed a face recognition method based on sparse representation of deep learning features. This method first extracts face features using deep CNN and then classifies the obtained face features by sparse representation. Experiments prove that this method has higher recognition accuracy, which can improve by 6–60% compared with traditional methods, can effectively cope with the interference caused by intra-class changes, such as lighting, pose, expression, and occlusion, and has a greater advantage when encountering small sample problems. Despite the application of deep learning methods to space target scenarios, their efficacy is still hampered by the scarcity of actual samples, often relying on simulation datasets for training, leaving room for improvement in accuracy and methodology.

4. Stray Light

During the process of collecting space signals using optical sensors, non-target light information is captured in the form of stray light, and such interference is challenging to completely suppress or eliminate. Correlation methods can only reduce the impact of stray light interference [26]. For complex space environments, many studies have also incorporated methods for handling unique spatial noise. Yang Ming et al. [27] address the issue of significant lighting and Earth background effects on non-cooperative spacecraft attitude measurement in space, proposing an end-to-end attitude estimation method based on convolutional neural networks with AlexNet and ResNet architectures. Compared to using regression methods alone for attitude estimation, this approach effectively reduces the average absolute error, standard deviation, and maximum error of attitude estimation. Synthetic images used for network training adequately consider factors such as noise and lighting in orbit. Additionally, Sharma S et al. [28] introduce the SPN (spacecraft pose network) model, which trains the network using grayscale images. The SPN model consists of three branches, with the first using a detector to detect the boundary boxes of the target in the input image and the other two branches using regions within the 2D boundary boxes to determine the relative pose. The improvement in accuracy methods also brings up another issue: the scarcity of samples in space target data. To address the problem of small samples in space target data, the dataset of the target is built using Unity3d2019 [29] software. To fully simulate the space lighting environment, the brightness of simulated sunlight in the environment is randomly set, starry background noise is randomly added, and data normalization is performed for data enhancement. Jiang Zhaoyang et al. [30] designed a dual-channel neural network based on VGG and DenseNet architectures to locate the pixel corresponding to feature points in the image and provide their corresponding pixel coordinates, proposing a neural network pruning method to achieve network lightweighting. Addressing the interference of space lighting and the issue of small samples, Sharma S et al. [31] present a monocular image-based pose estimation network. Phisannupawong T et al. and Chen B et al. [32][33][32,33] achieve 6-DOF pose estimation for non-cooperative spacecraft using pre-trained deep models. Despite Sonawani S et al. [34] being the first to create a dataset for non-cooperative targets using a semi-physical simulation platform, overall, there has not been extensive research into algorithms that simultaneously handle stray light and small sample sizes.
Video Production Service