Self-Attention-Based 3D Object Detection for Autonomous Driving: Comparison
Please note this is a comparison between Version 1 by Husnain Mushtaq and Version 2 by Sirius Huang.

Autonomous vehicles (AVs) play a crucial role in enhancing urban mobility within the context of a smarter and more connected urban environment. Three-dimensional object detection in AVs is an essential task for comprehending the driving environment to contribute to their safe use in urban environments. 

  • smart cities
  • 3D object dejection
  • semantic features leaning
  • self-attention

1. Introduction

Smart sustainable cities use ICT for efficient operations, information sharing, better government services, and citizen well-being, prioritizing technological efficiency over availability for improved urban life [1][2][3][4][1,2,3,4]. Autonomous vehicles offer immersive user experiences, shaping future human–machine interactions in smart cities [5][6][5,6]. Mobility as a service is set to transform urban mobility in terms of sustainability [7]. Cities seek smart mobility solutions to address transport issues [8]. AVs’ benefits drive their adoption, mitigating safety concerns. AVs promise traffic improvements, enhanced public transport, safer streets, and better quality of life in eco-conscious digital cities [9].
At the core of AV technology lies 3D object detection, a fundamental capability enabling AVs to perceive their surroundings in three dimensions. This 3D object detection is vital for safe autonomous vehicle navigation in smart cities [10][11][10,11]. It identifies and comprehends surrounding objects in 3D, enabling obstacle avoidance, path planning, and collision prevention [12]. Advancements in this technology enhance urban life through improved autonomous vehicle perception [13][14][13,14]. Autonomous vehicles are equipped with various sensors, including cameras, LiDAR (light detection and ranging), radar, and sometimes ultrasonic sensors. These sensors capture data about the surrounding environment [15].
Recent advancements in autonomous driving technology have significantly propelled the development of sustainable smart cities [16][17][18][16,17,18]. Notably, 3D object detection has emerged as a pivotal element within autonomous vehicles, forming the basis for efficient planning and control processes in alignment with smart city principles of optimization and enhancing citizens’ quality of life, particularly in ensuring the safe navigation of autonomous vehicles (AVs) [19][20][21][19,20,21]. LiDAR, an active sensor utilizing laser beams to scan the environment, is extensively integrated into AVs to provide 3D perception in urban environments. Various autonomous driving datasets, such as KITTI, have been developed to enable mass mobility in smart cities [22][23][22,23]. Although 3D LiDAR point cloud data are rich in depth and spatial information and less susceptible to lighting variations, it possesses irregularities and sparseness, particularly at longer distances, which can jeopardize the safety of pedestrians and cyclists. Traditional methods for learning point cloud features struggle to comprehend the geometrical characteristics of smaller and distant objects in AVs [24][25][24,25].
To overcome geometric challenges and facilitate the use of deep neural networks (DNNs) for processing 3D smart city datasets to ensure safe autonomous vehicle (AV) navigation, custom discretization or voxelization techniques are employed [26][27][28][29][30][31][32][33][34][26,27,28,29,30,31,32,33,34]. These methods convert 3D point clouds into voxel representations, enabling the application of 2D or 3D convolutions. However, they may compromise geometric data and suffer from quantization loss and computational bottlenecks, posing sustainability challenges for AVs in smart cities. Region proposal network (RPN) backbones exhibit high accuracy and recall but struggle with average precision (AP), particularly for distant or smaller objects. The poor AP score hinders AV integration in sustainable smart cities due to its direct impact on object detection at varying distances [35][36][35,36].

2. Sustainable Transportation and Urban Planning

Sustainability has become a paramount concern across industries, with particular focus on the transportation sector. Numerous studies have addressed the implications of autonomous vehicles (AVs) and their potential to revolutionize urban living in smart cities [1][2][3][4][5][6][7][8][9][11][1,2,3,4,5,6,7,8,9,11]. Shi et al. [2] introduced a semantic understanding framework that enhances detection accuracy and scene comprehension in smart cities. Yigitcanlar et al. [6] highlighted the need for urban planners and managers to formulate AV strategies for addressing the challenges of vehicle automation in urban areas. Manfreda et al. [8] emphasized that the perceived benefits of AVs play a significant role in their adoption, especially when it comes to safety concerns. Campisi et al. [9] discussed the potential of the AV revolution to improve traffic flow, enhance public transport, optimize urban space, and increase safety for pedestrians and cyclists, ultimately enhancing the quality of life in cities. Duarte et al. [10] explored the impact of AVs on the road infrastructure and how they could reshape urban living and city planning, akin to the transformative shift brought about by automobiles in the past. Heinrichs et al. [11] delved into the unique characteristics and prospective applications of autonomous transportation, which has the potential to influence land use and urban planning in distinct ways. Stead et al. [18] conducted scenario studies to analyze the intricate effects of AVs on urban structure, including factors like population density, functional diversity, urban layout, and accessibility to public transit. Li et al. [26] proposed a deep learning method combining LiDAR and camera data for precise object detection, while Seuwou et al. focused on smart mobility initiatives, emphasizing the significance of CAVs in sustainable development within intelligent transportation systems. Seuwou et al. [37][45] present a study that examines smart mobility initiatives and challenges within smart cities, focusing on connected vehicles and AVs. Xu et al. [38][46] introduced a fusion strategy utilizing LiDAR, cameras, and radar to enhance object detection in dense urban areas. These studies collectively underscore the importance of developing 3D object detection methods to ensure safe and efficient transportation systems in smart cities, addressing critical sustainability challenges.

3. Point Cloud Representations for 3D Object Detection

LiDAR is vital for AVs, generating unstructured, unordered, and irregular point clouds. Processing these raw points conventionally is challenging. Numerous 3D object detection methods have emerged in recent years [2][26][27][28][29][30][31][33][34][39][40][41][42][43][44][2,26,27,28,29,30,31,33,34,37,47,48,49,50,51]. These methods are categorized based on their approach to handling 3D LiDAR point cloud input.

3.1. Voxel-Based Methods

Studies have aimed to convert irregular point clouds into regular voxel grids and use CNNs to learn geometric patterns [25][30][34][39][25,30,34,37]. Early research used high-density voxelization and CNNs for voxel data analysis [26][43][44][26,50,51]. Yan et al. introduced the SECOND architecture for improved memory and computational efficiency using 3D sub-manifold sparse convolution [34]. PointPillars simplified voxel representation to pillars [39][37]. Existing single-stage and two-stage detectors often lack accuracy, especially for small objects [29][32][29,32]. ImVoxelNet by Danila et al. increased the memory and computational costs for image to voxel projection [25]. Zhou et al. transformed point clouds into regularly arranged 3D voxels, adding 3D CNN for object detection [30]. Noh et al. integrated voxel-based and point-based features for efficient single-stage 3D object detection [43][50]. Shi et al. proposed a voxel-based roadside LiDAR feature encoding module that voxelizes and projects raw point clouds into BEV for dense feature representation with reduced computational overhead [2]. Voxel-based approaches offer reasonable 3D object detection performance with efficiency but may suffer from quantization loss and structural complexity, making optimal resolution determination challenging for local geometry and related contexts.

3.2. Point-Based Methods

Different to voxel-based methods, point-based methods generate the 3D objects by direct learning of unstructured geometry from raw point clouds [28][42][28,49]. To deal with the unordered nature of 3D point clouds, point-based methods incorporate PointNet [41][48] and its different variants [29][45][29,39] to aggregate the point-wise features employing symmetric functions. Shi et al. [29] presented a regional proposal two-staged 3D object detection framework: Point-RCCN. This method works in quite an interesting way as it generates object proposals from foreground point segments and then exploits the local spatial and semantic features to regress the high-quality 3D bounding boxes.
Qi et al. [46][52] proposed voteNet, a deep Hough voting-based one-stage point 3D object detector to predict the centroid of an instance. Yang et al. [47][53] proposed 3DSSD, a single-staged 3D object detection framework. It uses farthest point sampling (FPS), a very popular approach, and Euclidean space as a fusion sampling strategy. PointGNN [48][54] is a generalized graph neural network for 3D object detection. Point-based methods are not as resource intensive as voxel-based methods. Point-based methods are intuitive and straightforward and do not require any extra pre-processing and simply take raw point clouds as input. The drawback of point-based methods is their limited efficiency and insufficient learning ability.

3.3. Weak Semantic Information for 3D Object Detection

In autonomous driving, point cloud sampling often yields sparse coverage. For example, when aligning KITTI dataset color images with raw point clouds, only about 3% of pixels have corresponding points [49][50][42,55]. This extreme sparsity challenges high-level semantic perception from point clouds. Existing 3D object detection methods [29][30][31][33][34][39][29,30,31,33,34,37] typically extract local features from raw point clouds but struggle to capture comprehensive feature information and feature interactions. Sparse point cloud data, limitations in local feature extraction, and insufficient feature interactions lead to weak semantic information in 3D object detection models, notably affecting performance for distant and smaller objects.
Both voxel-based [30][34][39][30,34,37] and point-based [29][41][29,48] methods face weak semantic information challenges in sparse point clouds. For example, Yukang et al. [51][56] proposed a complex approach with focus sparse convolution and multi-modal expansion but with high computational costs and complexity. Qiuxiao et al. [52][57] introduced a sparse activation map (SAM) for voxel-based techniques, and Pei et al. [53][58] developed range sparse net (RSN) for real-time 3D object detection from dense images but with spatial depth information issues. Mengye et al. [54][59] introduced a sparse blocks network (SBNet) for voxel-based methods. Shi et al. [2] incorporated multi-head self-attention and deformable cross-attention for interacting vehicles. Existing methods focus on downstream tasks, under-utilize object feature information, and are often limited to either voxel-based or point-based models, reducing their generalizability.

4. Self-Attention Mechanism

The recent success of transformers in various computer vision domains [49][55][42,60] has led to a new paradigm in object detection. Transformers have proven to be highly effective in learning local context-aware representations. DETR [55][60] introduced this paradigm by treating object detection as a set prediction problem and employed transformers with parallel decoding to detect objects in 2D images. The application of point transformers [49][42] in self-attention networks for 3D point cloud processing and object classification has gained attention recently. Particularly, the point cloud transformer (PCT) framework [21] has been utilized for learning from point clouds and improving embedded input. PCT incorporates essential functionalities such as farthest-point sampling and nearest-neighbor searching. In the context of 3D object detection, Bhattacharyya et al. [56][61] proposed two variants of self-attention for contextual modeling. These variants augment convolutional features with self-attention features to enhance the overall performance of 3D object detection. Additionally, Jiageng et al. [57][62] introduced voxel transformer (VoTr), a novel and effective voxel-based transformer backbone specifically designed for point cloud 3D object detection. Shi et al. [2] employed multi-attention and cross-attention to establish a dense feature representation through feature re-weighting.
Overall, these studies highlight the importance of 3D object detection techniques in enhancing the perception capabilities of autonomous vehicles and contribute to the development of safer and more efficient transportation systems in smart cities. 
Video Production Service