An autonomous driving system consists of perception, planning, decision, and control, which is illustrated in. The perception subsystem is the basis for other subsystems. It takes data captured from different sensors as input to obtain vehicle’s position and location, also including the size and direction of surrounding objects.
An autonomous driving system consists of perception, planning, decision, and control, which is illustrated in Figure 1 . The perception subsystem is the basis for other subsystems. It takes data captured from different sensors as input to obtain vehicle’s position and location, also including the size and direction of surrounding objects. Autonomous driving vehicles  are often equipped with a variety of sensors, including LiDARs, cameras, millimeter-wave radars, GPS, and so on, which are illustrated in Figure 2 .
A perception subsystem needs to be accurate and robust to ensure safe driving. It is composed of several important modules, such as object detection, tracking, Simultaneous Localization and Mapping (SLAM), etc. Object detection is a fundamental ability and aims to detect all interested objects to achieve their location and categories from captured data, such as images or point clouds. Images are captured by cameras and can provide rich texture information. Cameras are cheap but cannot achieve accurate depth information, and they are sensitive to changes in illumination and weather, such as low luminosity at night-time and extreme brightness disparity when entering or leaving tunnels, rainy, or snowy weather. Point clouds are captured by LiDARs and can provide accurate 3D spatial information. They are robust to weather and extreme lighting conditions and demonstrate sparsity and ununiformity in spatial distribution. In addition, LiDARs are expensive sensors. Therefore, considering the complementary characteristics between point clouds and images, cameras and LiDARs are used as indispensable sensors to ensure intelligent vehicles’ driving safety.
Notably, failure to detect objects might lead to safety-related incidents. It may result in traffic accidents, threatening human lives for failed detection of a leading vehicle . To avoid collision with surrounding vehicles and pedestrians, object detection is an essential technique to analyze perceived images and point clouds, which needs to identify and localize objects. The general framework is illustrated in Figure 3 . With the development of deep learning, 2D object detection is an extensive research topic in the field of computer vision. CNN-based 2D object detections  have an excellent performance in some public datasets . However, 2D object detection only provide 2D bounding boxes and can not provide depth information of objects that is crucial for safe driving. Compared with 2D object detection, 3D object detection provides more spatial information, such as location, direction, and object size, which makes it become more significant in autonomous driving. 3D detection needs to estimate more parameters for 3D-oriented boxes of objects, such as central 3D coordinates, length, width, height, and deflection angle of a bounding box. In addition, 3D object detection still faces arduous problems, including the complex interaction between objects, occlusion, changes in perspective and scale, and limited information provided by 3D data.
In this paper, we present a review of 3D object detection methods to summarize the development and challenges of 3D object detection. We analyze the potential advantages and limitations of these methods. The existing 3D object detection methods are divided into image-based methods, point cloud-based methods, and multimodal fusion-based methods. A general framework of the existing object detection methods is shown in Figure 3 . The categories and their limitations are briefly described in Table 1 .
|Image||Apply images to predict bounding boxes of 3D objects. 2D bounding boxes are predicted and then are extrapolated to 3D by reprojection constraints or regression model.||Depth information is deficient and the accuracy of detection results is low.|
|point cloud||Projection||Project a point cloud into a 2D plane and utilize 2D detection frameworks to regress 3D bounding boxes on projected images.||There is information loss in the process of projection.|
|Volumetric||Conduct voxelization to achieve 3D voxels and generate representation by using convolutional operations in Voxels to predict 3D bounding boxes of objects.||Expensive 3D convolutional operations increase inference time. The computation is heavy.|
|PointNet||Apply raw point cloud to predict 3D bounding boxes of objects directly.||Large scale of point cloud increases running time. It is difficult to generate region proposals.|
|Multi-sensor Fusion||Fuse image and point cloud to generate prediction on 3D bounding boxes. It is robust and complement each other.||Fusion methods are computationally expensive and are not mature enough.|
RGB-D images can provide depth information, which are used in some works. For example, Chen et al.  apply the poses of 3D bounding boxes to establish the energy function, and they use structured SVM for training to minimize the energy function. In DSS , multi-scale 3D RPN network is used to recommend objects on stereo images, which can detect objects of different sizes. Deng et al.  use the 2.5D method for object detection. They establish a model to detect 2D objects, and then convert 2D targets to 3D space to realize 3D object detection. Due to the large computation of RGB-D images, monocular images are used for 3D object detection.
In early days, Chen et al. propose Mono3d , which uses monocular images to generate 3D candidates, and then uses semantics, context information, hand-designed shape features, and location priors, which are illustrated in Figure 4 , to score each candidates through energy model. Based on these candidates, Fast RCNN is used to further refine the 3D bounding boxes by location regression. The network improves the detection performance, but it is dependent on the object classes and needs a large number of candidates to achieve high recall, which leads to computational cost increase. To overcome this limitation, Pham and Jeon propose DeepStereoOP architecture  that is a class-independent algorithm, which exploits not noly RGB images but also depth inforamtion.
Hu et al.  propose a multi-task framework to associate detections of objects in motion over time and estimate 3D bounding boxes information from a sequential images. They leverage 3D box depth-ordering matching for robust instance association and use 3D trajectory prediction for identification of occluded vehicles. Considering benefits from multi-task learning, Center3D  is proposed to efficiently estimate 3D location of objects and depth using only monocular images. It is an extension of CenterNet .
In recent years, 3D object detection from a 2D perspective has attracted the attention of many researchers. Lahoud and Ghanem  propose a 2D driven 3D object detection method to reduce the search space of 3D object. They apply manual features to train multi-layer perceptron network to predict 3D boxes. Later, they extend the work  and propose a multimodal region proposal network to generate region proposals, which uses an extended 2D boxes to generate 3D boxes. MonoDIS  leverages a novel disentangling transformation for 2D and 3D detection losses and a self-supervised confidence score for 3D bounding boxes.
3D processing directly uses the raw point cloud as the network input to extract the suitable point cloud features. For example, 3D FCN  and Vote3Deep  directly use a 3D convolution network to detect 3D bounding boxes of objects. However, the point cloud is sparse and the computation of 3D CNN is expensive. Additionally, affected by the receptive field, the traditional 3D convolution network cannot effectively learn the local features of different scales. To learn more effective spatial geometric representation from point cloud, some specific network frameworks have been proposed for point cloud, such as PointNet , PointNet++ , PointCNN , Dynamic Graph CNN , and Point-GNN . PointNets  can directly process LiDAR point clouds and extract point cloud features through the MaxPooling symmetric function to solve the disorder problem of points. The network architecture of PontNet is illustrated in Figure 5 . Thanks to the networks, the performance of 3D object detection is improved but the computation of point-based methods is expensive, especially when the information of large scenes are captured by using Velodyne LiDAR HDL-64E and there are more than 100K points in one scan. Therefore, some preprocessing operations need to be conducted, such as downsampling.
After the point cloud features learning models are proposed, PointRCNN  constructs a PointNet++-based architecture to detect 3D objects, which is simply illustrated in Figure 6 . Through the bottom-up 3D PRN, the subnetwork is used to transform the proposals information into standard coordinates to learn better local spatial features. By combining with the global semantic features of each point, the accuracy of the detected bounding boxes is improved. Similarly, Yang et al.  add a proposal generation module based on spherical anchor, which uses PointNet++ as the backbone network to extract semantic context features for each point. At the same time, in the second stage of boxes prediction, an IoU estimation branch is added for postprocessing, which further improves the accuracy of object detection.
Benifical from multi-task learning, LiDARMTL  utilizes a encoder–decoder architecture to predict perception parameters for 3D object detection and road understanding, which can be leveraged for online localization. Although the location accuracy of the object is improved compared with the previous methods, the calculation burden is heavy due to the large scale of the point cloud. To deal with the drawback, AFDet  adopts an anchor-free and Non-Maximum Suppresion-free single-stage framework to detect objects, which has the advantage in embedded systems.
With the help of point cloud data, the performance of 3D object detection is significantly improved. In general, the accuracy of 3D bounding boxes with image-based methods is much less than point cloud-based methods. Currently, LiDAR point cloud-based 3D object detection has become a main trends, but point cloud cannot provide texture information to efficiently discrimate categories of objects. Moreover, the density of points decreases when the distance between object and LiDAR increases, which affects the performance of detectors, while images can still capture faraway objects. Therefore, multi-sensor fusion-based methods are proposed to improve the overall performance.
Considering the advantages and disadvantages of image-based and point cloud-based methods, some methods try to apply fuse both modalities with different strategies. The fusion of LiDAR point cloud and images is done to conduct a projection transformation of the point cloud, and then to integrate the multi-view projected plane with the image by different feature fusion schemes, such as MV3D , AVOD , etc. There are three fusion schemes, including early fusion, late fusion, and deep fusion, which are illustrated in Figure 7 . MV3D aggregates features by using a deep fusion scheme, where feature maps can hierarchically interact with others. AVOD is the first approach to introduce early fusion. The features of each modality proposal are merged and a FC layer is followed to output category and coordinates of 3D box for each proposal. These methods lose space information in the projection transformation process and the detection performance of small targets is poor. In addition, ROI feature fusion only uses advanced features, and the sparsity of LiDAR point cloud limits the fusion-based methods.
To address the problem of information loss, 3D-CVF  combines the features of camera and LiDAR by using the cross-view spatial feature fusion strategy. Autocalibrated projection is applied to transform the image features to a smooth spatial feature map with the highest correspondence to the LiDAR features in the BEV domain. A gated feature fusion network is used mix the features appropriately. Additionally, the fusion methods based on BEV or voxel format are not accurate enough. Thus, PI-RCNN  proposes a novel fusion method named Point-based Attentive Cont-conv Fusion module to fuse multi-sensor features directly on 3D points. Except for continuous convolution, Point-Pooling and Attentive Aggregation are used to fuse features expressively.
In the process of 3D object detection, inconsistency between the localization and classification confidence is a critical issue . To solve the problem, a consistency enforcing loss is utilized to increase the consistency of both the localization and classification in EPNet . Moreover, the point features is enhanced with semantic image features in a point-wise manner without image annotations.
Besides fusion of camera and LiDAR, radar data are also used for 3D object detection . CenterFusion  first associates radar detections to corresponding objects in the 3D space. Then, these radar detections are mapped into image plane to complement features of images in a middle-fusion method.