1. Introduction
The development of driving assistance systems holds the possibility of reducing accidents, reducing environmental emissions, and easing the stress associated with driving
[1][2][3][4]. Several levels of automation have been proposed, based on their technology capacities and human interaction
[5][6]. The most widely known levels can be broken down into six groups
[7]. Beginning with Level 0 (Driver-Only Level), the complete control of the vehicle, including steering, braking, accelerating, and decelerating, is completely under the control of the driver
[8]. As the level increases from Level 1 to Level 3, the user interaction is reduced and the level of automation is increased
[8]. In contrast to earlier levels of autonomous driving, Level 4 (High Driving Automation) and Level 5 (Full Driving Automation) attain fully autonomous driving, where the vehicle can be operated without the need for any driving experience or even a driving licence
[9]. The difference between Level 4 and Level 5 is that autonomous vehicles categorized in Level 5 can drive entirely automatically in all driving domains and require no human input or interaction. Thus, Level 5 prototypes remove the steering wheel and the pedals; hence, the role of the driver is diminished to that of a mere passenger
[10]. Consequently, there is a constraint on any vehicle equipped with a driving assistance system in order to evolve into a practical reality. This vehicle must be equipped with a perception system that enables high levels of awareness and intelligence necessary to deal with stressful real-world situations, make wise choices, and always act in a manner that is secure and accountable
[5][11][12].
The perception system of autonomous cars performs a variety of tasks. The ultimate purpose of these tasks is to fully understand the vehicle’s immediate surroundings with low latency so that the vehicle can make decisions or interact with the real world. Object detection, tracking, and driving area analysis are examples of such activities. 3D object properties detection is a significant area of study in the autonomous vehicles perception system
[13]. Current 3D object detection techniques can be primarily classified into Lidar-based or vision-based techniques
[14]. Lidar-based 3D object detection techniques are accurate and efficient but expensive, which limits their use in industry
[15]. On the other hand, vision-based techniques
[16] can also be categorized into two distinct groups: monocular and binocular vision. Vision-based perception systems are widely used due to their low cost and rich features (critical semantic elements within the image that carry valuable information and are used to understand and process the visual data). The most significant downside of monocular vision is that it cannot determine depth directly from image information, which may cause errors in 3D pose estimation in monocular object detection. The cost of binocular vision is higher than that of monocular vision, although it can provide more accurate depth information than monocular vision. Moreover, binocular vision yields a narrower visual field range, which cannot meet the requirements of certain operating conditions
[17].
For the case of monocular vision systems, the camera projects a 3D point (defined in the 3D world’s coordinate frame of the object) into the 2D image coordinate frame. This is a forward mathematical operation, which removes the projected object’s depth information. Thus, the inverse projection of the point from the 2D image coordinate frame back into the 3D world’s coordinate frame of the object is not a trivial mathematical task
[18].
2. 3D Object Detection Methods
The various methods employed for 3D object detection can be classified into two distinct categories: conventional techniques or deep learning techniques. This section provides a comprehensive summary of the present 3D object detection methodologies as summarized in
Figure 1. The 3D-data-based techniques in both categories achieve superior detection results over 2D-data-based techniques. However, they come with additional cost related to the sensor setup, as mentioned earlier. On the other hand, 2D-data-based techniques are cost efficient, which comes with sacrificing the depth estimation accuracy
[15][17].
Figure 1. 3D object detection methodologies.
2.1. Conventional Techniques
Conventional techniques depend heavily on a prior knowledge of the object appearance to retrieve the essential suite of object features using hand-crafted feature descriptors, followed by the formation of a database by appropriately attributing features with the 3D models
[19][20][21]. In order to provide quick detection results, the model database is organized and searched using effective indexing algorithms such as hierarchical RANSAC search, and voting is performed by using ranked criteria
[22]. Other approaches depend on different hashing techniques like geometric hashing
[19] and elastic hash table
[23]. The conventional techniques can be categorized according to the input information into 3D-data-based techniques and 2D-data-based techniques
[24].
2.1.1. 3D-Data-Based Techniques
With the advancement of sensor technology, more devices capable of capturing 3D environmental data, such as depth cameras and 3D scanners
[25][26], are developed. When compared to 2D-data techniques, 3D-data techniques retain the object’s genuine physical characteristics, which make them a superior choice to quantify the 6D pose
[27]. The two primary categories of conventional 3D-data-based approaches are as follows.
Local-Descriptor-Based Approaches: These depend on the local descriptor and utilize an offline global descriptor applied to the model. This global descriptor needs to be rotation and translation invariant. Then, the local descriptor is subsequently generated online and matched with the global descriptor. Super Key 4-Points Congruent Sets (SK-4PCS)
[28] can be paired with invariant local properties of 3D shapes to optimize the quantity of data handled. The iterative Closest Point (ICP)
[29] approach is a traditional method that can compute the pose transformation between two sets of point clouds. Another approach
[30] establishes a correspondence between the online local descriptor and the saved model database by employing the enhanced Oriented Fast and Rotated BRIEF (ORB)
[31] feature and the RBRIEF descriptor
[32]. A different approach centered around semi-global descriptors
[33] can be used to assess the 6D pose of large-scale occluded objects.
Matching-Based Approaches: The objective of matching-based approaches is to find the template in the database that is almost identical to the input sample as well as retrieving its 6D pose. An innovative Multi-Task Template Matching (MTTM) framework was proposed in
[34] to increase matching reliability. It locates the closest template of a target object while estimating the masks of segmentation and the object’s pose transformation. In order to enhance the data storage footprint and the lookup searching time, fused Balanced Pose Tree (BPT) and PCOF-MOD (multimodal PCOF) under optimum storage space restructuring was proposed in
[35] in order to yield memory-efficient 6D pose estimation. A study on the control of micro-electro-mechanical system (MEMS) microassembly was conducted in
[36] to allow for an accurate control to enable minimizing 3D residuals.
2.1.2. 2D-Data-Based Techniques
To measure the 6D pose of objects, a variety of features that are represented in the 2D image can be used. These features may include texture features, geometric features, color features, and more
[24][37][38]. Speeded Up Robust Features (SURF)
[39] and Scale-Invariant Feature Transform (SIFT) features
[40] are considered the primary features that can be utilized for object pose estimation. Color features
[41], as well as geometric features
[42], can be utilized to enhance the pose estimation performance. The conventional 2D-data-based techniques can be further broken down according to the used matching template into CAD-model-based approaches and real-appearance-based approaches.
CAD-Model-Based Approaches: CAD-modelbbased approaches depend mainly on the rendered templates of CAD models. These approaches are suitable for industrial applications, because illumination and blurring have no effect on the rendering process
[24]. A Perspective Cumulated Orientation Feature (PCOF) based on orientation histograms was proposed in
[43] to estimate a robust object’s pose. The Fine Pose parts-based Model (FPM)
[44] was implemented to localize objects in a 2D image using the given CAD models. Moreover, the edge correspondences can be used to estimate the pose
[45]. The multi-view geometry can be adapted in order to extract the object’s geometric features
[46], while the epipolar geometry can be used to generate the transformation matrix
[47]. Some other approaches suggest visuals-based tracking like
[48], which performs tracking approach based on CAD model for micro-assembly and like
[49] which employs 6D posture estimation for end-effector tracking in a scanning electron microscope to enable higher-quality automated processes and accurate measurements.
Real-Appearance-Based Approaches: Despite the accuracy that can be achieved by employing 3D CAD models, reliable 3D CAD models are not always obtained. Under these conditions, real object images are used as the template. The 6D pose of an object can be measured depending on multi-cooperative logos
[50]. The Histogram Of Gradients (HOG)
[51] is an effective technique for improving the pose estimation performance. Another approach
[52] suggested template matching in order to provide stability against small image variations. As obtaining sufficient real image templates is laborious and time-consuming, and producing images via CAD model is becoming simpler, approaches that depend on CAD models are much more popular rather than approaches that depend on real appearance
[24].
The main downside of the conventional techniques is that they are very sensitive to noise and are not efficient in terms of computational burden and storage resources
[53]. The performance of the 2D-data-based techniques heavily depends on the total number of templates, which impacts the pose estimation accuracy, as these approaches depend on the object appearance experience. The greater the number of templates, the more accurate the posture assessment
[54][55]. Consequently, a large number of templates necessitates a significant amount of storage and search time
[56]. Also, it is not practical to obtain 360° features of the model
[57][58]. Because of how the 3D-data-based approaches function, the major limitation for these approaches is that they fail to function adequately whenever the object has a high level of reflected brightness
[24]. Another problem is that the efficiency of these approaches is somewhat limited, as point clouds and depth images involve an enormous quantity of data, leading to a large computational burden
[24].
2.2. Deep Learning-Based Techniques
Convolution Neural Networks (CNNs) are commonly used in 3D object detection methodologies based on deep learning strategies to retrieve a hierarchical suite of abstracted features from each object in order to record the object’s essential information
[59]. Unlike conventional 3D object detection techniques, which depend on hand-crafted feature descriptors, deep learning-based techniques utilize learnable weights that can be automatically tuned during the training phase
[60]. Thus, these techniques provide more robustness against environmental variations. The deep learning techniques can also be categorized into 3D- and 2D-data-based techniques, as shown in
Figure 1.
2.2.1. 3D-Data-Based Techniques
In recent years, LiDAR-based 3D detection
[61][62][63][64] has advanced rapidly. LiDAR sensors collect accurate 3D measurement data gathered from the surroundings in the pattern of 3D points (x, y, z), at which x, y, z are each point’s absolute 3D coordinates. LiDAR point cloud representations, by definition, necessitate an architecture that allows convolution procedures to be performed efficiently. Thus, deep learning-based techniques, which depend on LiDAR 3D detection, can be divided into Voxel-based approaches, Point-based approaches, Pillar-based approaches, and Frustum-based approaches
[65][66].
Voxel-Based Approaches: These partition point clouds into similarly sized 3D voxels. Following that, for every single voxel, feature extraction can be used to acquire features from a group of points. The aforementioned approach minimizes the overall size of the point cloud, conserving storage space
[66]. Voxel-based approaches, such as VoxelNet
[67] and SECOND
[68], augment the 2D image characterization into 3D space by splitting the 3D space into voxels. Other deep learning models like Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection
[69], Afdet
[70], Center-based 3D object detection and tracking
[71], and Psanet
[72] utilize the Voxel-based approach to perform 3D object detection.
Point-Based Approaches: These were introduced to deal with raw unstructured point clouds. Point-based approaches, such as PointNet
[73] and PointNet++
[74] issue raw point clouds as input and retrieve point-wise characteristics for 3D object detection using formations such as multi-layer perceptrons. Other models like PointRCNN
[64], PointRGCN
[75], RoarNet
[76], LaserNet
[77], and PointPaiting
[78] are examples of deep learning neural networks that adopt a Point-Based approach.
Pillar-Based Approaches: These structure the LiDAR 3D point cloud into vertical columns termed as pillars. Utilizing pillar-based approaches enables the tuning of the 3D point cloud arranging process in the x–y plane, removing the z coordinate, as demonstrated in PointPillars
[79].
Frustum-Based Approaches: These partition the LiDAR 3D point clouds into frustums. The 3D object detector models that crop point cloud regions recognized by an RGB image object detector include SIFRNet
[80], Frustum ConvNet
[81], and Frustum PointNet
[82].
2.2.2. 2D-Data-Based Techniques
RGB images are the main input for deep learning-based techniques that rely on 2D data. Despite the outstanding performance of 2D object detection networks, generating 3D bounding boxes merely from the 2D image plane is a substantially more complicated challenge owing to the lack of absolute depth information
[83]. Thus, the 2D-data-based techniques for 3D object detection can be categorized according to the adopted method to generate the depth information into Stereographically Based Approaches, Depth-Aided Approaches, and Single-Image-Based Approaches.
Stereographically Based Approaches: Strategies in
[84][85][86][87][88] process the pair of stereo images using a Siamese network and create a 3D cost volume to determine the matching cost for stereo matching using neural networks. The preliminary work 3DOP
[89] creates 3D propositions by tinkering with a wide range of extracted features such as stereo reconstruction, and object size previous convictions. MVS-Machine
[90] considers differentiable projection and reprojection for improved 3D volume construction handling from multi-view images.
Depth-Aided Approaches: Due to the missing depth information in single monocular image input, several researchers tried to make use of the progress in depth estimation neural networks. Previous research works
[91][92][93] convert images into pseudo-LiDAR perceptions by harnessing off-the-shelf depth map predictors and calibration parameters. Then, they deploy established LiDAR-based 3D detection methods to output 3D bounding boxes, leading to lesser effectiveness. D4LCN
[94] and DDMP-3D
[95] emphasize a fusion-based strategy between image and estimated depth using cleverly engineered deep CNNs. However, most of the aforementioned methods that utilize off-the-shelf depth estimators directly pay substantial computational costs and achieve only limited improvement due to inaccuracy in the estimated depth map
[96].
Single-Image-Based Approaches: Recently, several research works
[13][97][98][99][100] employed only a monocular RGB image as input to 3D object detection. PGD-FCOS3D
[101] establishes geometric correlation graphs between detected objects then utilizes the constructed graphs to improve the accuracy of the depth estimation. Some research works like RTM3D
[102], SMOKE
[103], and KM3D-Net
[104] anticipate key points of the 3D bounding box as an adjacent procedure for establishing spatial information of the observed object. Decoupled-3D
[105] presents an innovative framework for the decomposition of the detection problem into two tasks: a structured polygon prediction task and a depth recovery task. QD-3DT
[106] presents a system for tracking moving objects over time and estimating their full 3D bounding box from an ongoing series of 2D monocular images. The association stage of the tracked objects depends on quasi-dense similarity learning to identify objects of interest in different positions and views based on visual features. MonoCInIS
[107] presents a method that utilizes instance segmentation in order to estimate the object’s pose. The proposed method is camera-independent in order to account for variable camera perspectives. MonoPair
[108] investigates spatial pair-wise interactions among objects to enhance detection capability. Many recent research works depend on a prior 2D object detection stage. Deep3Dbox
[109] suggests an innovative way to predict orientation and dimensions. M3D-RPN
[110] considers a depth-aware convolution to anticipate 3D objects and produces 3D object properties with 2D detection requirements. GS3D
[111] extends Deep3Dbox with a feature extraction module for visible surfaces. Due to the total loss of depth information and the necessity for a vast search space, it is not trivial to estimate the object’s spatial position immediately
[17]. As a result, PoseCNN
[112] recognizes an object’s position in a 2D image while simultaneously predicting its depth to determine its 3D position. Because rotation space is nonlinear, it is challenging to estimate 3D rotation directly with PoseCNN
[17].
Although deep learning methods that rely on a single RGB image provide some appropriate projection-based constraints, they fail to achieve promising results due to an absence of depth spatial data
[91][113].