1. Introduction
Accurately estimating the six-degree-of-freedom (6-DoF) of objects is a critical task in various applications, including robotics, autonomous driving, and virtual reality. For instance, the precise estimation of spatial coordinates and rotational orientation of an object is essential for robotic tasks such as manipulation, navigation, and assembly. However, achieving robustness in 6-DoF detection remains a challenging problem. In real-world applications, numerous object types exhibit significant occlusions and variations in lighting conditions. Due to the increasing reliability of new RGB-D image sensors, the 6-DoF detection of visual targets based on multi-source image information is flourishing. Researchers have explored a number of ways
[1,2,3][1][2][3] to fuse RGB image data and depth image data to guide 6-DoF detection of visual targets with impressive accuracy. Different research teams are employing various framework approaches to investigate solutions for the 6DoF pose estimation problem. Some focus on the overall algorithmic framework, while others delve into efficient feature extraction.
Regarding the problem of object pose estimation, previous approaches predominantly employed adaptive matrices to tackle this issue. However, with the rise of convolutional neural networks (CNN) and transformers, deep learning (DL) based methods are used to solve the 6-DoF estimation problem. There are two main types of DL-based frameworks for 6D attitude estimation of objects: end-to-end architectures
[4,5][4][5] and two-stage segmentation-pose regression architectures
[6,7][6][7]. End-to-end models integrate multiple stages of visual processing steps into a single model; therefore, their networks are less complex and computationally intensive. A single network processes pixel information from the image to deduce the region where the candidate target is located and its corresponding 6DoF pose information. The internal structure and decision-making process of this neural network are more hidden, less interpretable, and more difficult to train. On the other hand, the two-stage segmentation-pose regression architecture first segments the visual target from the scene and then obtains the pose of the visual target in the scene by regression. This method is able to focus on the visual target being detected and exclude interference from the background, resulting in more reliable results.
In the process of 6DoF pose estimation through image features, there have been numerous prior efforts. Some have employed manually designed features (such as SIFT) to extract object characteristics for subsequent pose regression. However, the limited quantity of manually designed features might lead to failures in pose regression. Depth images provide dense features, yet enhancing the robustness of these depth features remains an unsolved challenge. Solely relying on RGB or depth information addresses only one facet of the problem. Thus, the approach leverages the fusion of RGB-D data to accomplish the task. Prior research has made significant strides in exploring the fusion of RGB and depth images. A multitude of studies have delved deeply into various techniques and algorithms aiming to effectively exploit the complementary information these modalities provide. However, despite these commendable efforts, achieving seamless integration between RGB and depth images remains an ongoing and formidable challenge. Existing methods often grapple with the intricate task of synchronizing the two modalities accurately, resulting in less than optimal fusion outcomes. Moreover, inherent differences in intrinsic features between RGB and depth data, including variations in lighting conditions and occlusions, further amplify the complexity of the fusion process. As such, continuous research and innovation are urgently needed to elevate the fusion of RGB and depth images in target pose detection to new heights.
2. Enhancing 6-DoF Object Pose Estimation
2.1. Feature Representation
In vision tasks, the representation of image features plays a crucial role in various applications, including visual target recognition and detection. In the context of target pose estimation, it is essential for the features of visual targets to exhibit robustness against translation, rotation, and scaling. Additionally, these features should possess local descriptive capabilities and resistance to noise.
In previous studies, researchers have utilized image feature matching to detect the position of visual targets. The pose of the target can be obtained by solving the 2D-to-3D PnP problem. Artificially designed features such as SIFT
[10[8][9],
11], SURF
[12][10], DAISY
[13][11], ORB
[14][12], BRIEF
[15][13], BRISK
[16][14], and FREAK
[17][15] have demonstrated robustness against occlusion and scale-scaling issues. These descriptors have been widely adopted in models for target position detection. Similarly, 3D local features such as PFH
[18[16][17][18],
19,20], FPFH
[21][19], SHOT
[22][20], C-SHOT
[23][21], and RSD
[24][22] can effectively extract features and detect the position of targets in 3D point clouds. Recently, machine learning based feature descriptor algorithms
[25,26][23][24] are receiving more and more attention in the field of image matching. These methods employ PCA
[27][25], random trees
[28][26], random fern
[29][27], and boosting
[30][28] algorithms to achieve more robust features than hand-designed features.
However, in cases where the surface of the visual target is smooth and lacks texture, the extraction of manually designed feature points is often limited in number. This limitation adversely affects the reliability of object pose estimation. Furthermore, the high apparent similarity among visual targets also poses challenges in accurately estimating the positional attitude of the detected target.
In addition to manually designed features, there are supervised learning-based feature description methods such as triplet CNN descriptor
[31][29], LIFT
[32][30], L2-net
[33][31], HardNet
[34][32], GeoDesc
[35][33]. For the recognition of textureless objects, global features can be implemented by utilizing image gradients or surface normals as shape attributes. Among these, template-based global features aim to identify the region in the observed image that bears the closest resemblance to the object template. Some commonly employed template-based algorithms include Line-MOD
[36][34] and DTT-OPT
[37][35]. In recent years, novel 3D deep learning methods have emerged, such as OctNet
[38][36], PointNet
[39][37], PointNet++
[40][38], and MeshNet
[41][39]. These methods are capable of extracting distinctive deep representations through learning and can be employed for 3D object recognition or retrieval.
2.2. Two-Stage or Single-Shot Approach
In the realm of object 6D pose estimation frameworks, two main types can be identified: end-to-end architectures and two-stage segmentation 6-DoF regression architectures.
In the field of object detection, notable end-to-end frameworks like YOLO
[42][40] and SSD
[43][41] have emerged. These frameworks have been extended to address the challenge of target pose detection. Poirson et al.
[44][42] proposed an end-to-end object and pose detection architecture based on SSD, treating pose estimation as a classification problem using RGB images. Another extension, SSD-6D
[45][43], utilizes multi-scale features to regress bounding boxes and classify pose into discrete viewpoints. Yisheng He et al.
[46][44] introduced PVN3D, a method based on a deep 3D Hough voting network that fuses appearance and geometric information from RGB-D images.
Two-stage architectures segment the visual target and estimate pose through regression. For example, in
[47][45], pose estimation was treated as a classification problem using the 2D bounding box. Mousavian et al.
[48][46] utilized a VGG backbone to classify pose based on the 2D bounding box and regress the offset. Nuno Pereira et al.
[7] proposed MaskedFusion, a two-stage network that employed an encoder–decoder architecture for image segmentation and utilized fusion with RGB-D data for pose estimation and refinement. This two-stage neural network effectively leverages the rich semantic information provided by RGB images and exhibits good decoupling, allowing for convenient code replacement when improvements are required for a specific stage algorithm. Additionally, this design helps reduce training costs.
However, the MaskedFusion method employed in the first stage solely relies on RGB image information, which often leads to insufficient and inaccurate semantic information in low-light and low-texture scenarios. This results in issues such as blurry edges and erroneous segmentation in the Mask image of the segmentation network during practical scene applications.
2.3. Single Modality or Multi-Modality Fusion
2.3.1. RGB Single Modal Based Object Pose Estimation
For visual target position detection, RGB images have traditionally been used as the primary data source. Feature matching techniques are commonly employed for localizing target positions within 2D images. PoseCNN
[49][47] utilizes a convolutional neural network and Hough voting to estimate the target’s pose. PvNet
[50][48] extracts keypoints from RGB images and employs a vector field representation for localization. Hu et al.
[51][49] proposed a segmentation-driven framework that uses a CNN to extract features from RGB image and assigns target category labels to virtual meshes. The ROPE framework
[52][50] incorporates holistic pose representation learning and dynamic amplification for accurate and efficient pose estimation. SilhoNet
[53][51] also predicts object poses using a pipeline with a convolutional neural network. Zhang et al.
[54][52] proposed an end-to-end deep learning architecture for object detection and pose recovery from single RGB modal data. Aing et al.
[55][53] introduced informative features and techniques for segmentation and pose estimation.
Although image-based methods have achieved promising results in 6-DoF estimation, their performance tends to degrade when dealing with textureless and occluded scenarios.
2.3.2. 3D Cloud or Depth Image Based Object Pose Estimation
Recovering the position of a visual target from 3D point cloud or depth image data is also a common method. The RGM method
[56][54] introduces deep graph matching for point cloud registration, leveraging correspondences and graph structure to address outliers. This approach replaces explicit feature matching and RANSAC with an attention mechanism, enabling an end-to-end framework for direct prediction of correspondence sets. Rigid transformations can be estimated directly from the predicted correspondences without additional post-processing. The BUFFER method
[57][55] enhances computational efficiency by predicting key points and improves feature representation by estimating their orientation. It utilizes a patch-wise embedder with a lightweight local feature learner for efficient and versatile piecewise features. The ICG framework
[58][56] presents a probabilistic tracker that incorporates region and depth information, relying solely on object geometry
Nonetheless, the point cloud data inherently exhibits sparsity and lacks sufficient texture information, which poses limitations to the performance of these methods. Consequently, the incorporation of RGB image information represents a crucial enhancement to enhance the accuracy and effectiveness of the position estimation.
2.3.3. Multi-Modal Data Based Object Pose Estimation
In the field of target position detection, the fusion of information from multiple sensors has emerged as a cutting-edge research area for accurate position detection. Zhang et al.
[59][57] proposed a hybrid Transformer-CNN method for 2-DoF object pose detection. They further proposed a bilateral neural network architecture
[60][58] for RGB and depth image fusion and achieved promising results. In 6-DoF pose detection area, Wang et al.
[6] introduced the DenseFusion framework for precise 6-DoF pose estimation using two data sources and a dense fusion network. MaskedFusion
[7] achieved superior performance by incorporating object masking in a pipeline. Se(3)-TrackNet
[61][59] presented a data-driven optimization approach for long-term 6D pose tracking. PVN3D
[46][44] adopted a keypoint-based approach for robust 6DoF object pose estimation from a single RGBD image. FFB6D
[5] introduced a bi-directional fusion network for 6D bit-pose estimation, exploiting the complementary nature of RGB and depth images. The ICG+
[62][60] algorithm incorporated additional texture patterns for flexible multi-camera information fusion. However, existing methods still face challenges in extracting feature information from RGB-D data.