In summary, S3DIS, ScanNet-V2, and SUN RGB-D are datasets that are particularly well-suited for recreating real spaces in the cloud due to their realistic indoor scene captures and extensive annotations. ONCE dataset focuses on part-level co-segmentation, which can contribute to accurate 3D space recreation. However, datasets like ModelNet10/ModelNet40, ScanObjectNN, ShapeNet, and ShapeNetPart are more suitable for tasks like object recognition, instance segmentation, and part-level semantic segmentation in 3D models rather than full-scale real-space recreation.
In this part, the imaging resolutions of the three methods are introduced and compared which are LiDAR, Photogrammetry, and Structured Light.
Structured light systems project known patterns onto a scene and use cameras to capture the deformations. The imaging resolution depends on factors such as the number and complexity of projected patterns, camera sensor resolution, and the accuracy of calibration. Higher-resolution cameras and more detailed patterns can increase the imaging resolution. The benefits of Structured light are that Structured light techniques can provide accurate and detailed point clouds with relatively good resolution. They can capture color and texture information alongside geometric data. The limitations of Structured light are that Structured light requires careful system setup and calibration. The resolution and accuracy can be affected by factors like ambient lighting conditions and the presence of reflective or glossy surfaces 
1.4. Point Cloud Transformation Algorithms
ICP is an iterative algorithm used for point cloud registration and alignment 
. The computational time of ICP depends on the number of iterations required to converge and the complexity of distance calculations, typically between (𝑁2)
, where N is the number of points. ICP can be time-consuming, especially for large point clouds, and may require initial alignment estimates for convergence. However, there are variants and optimizations available, such as parallelization and approximate nearest neighbor search, to improve efficiency 
NDT is a technique used for point cloud registration by estimating a probability distribution of the point data. The computational time of NDT depends on the voxel grid resolution, typically between (𝑁)
, where N
is the number of points. NDT can be computationally efficient, especially for large point clouds, as it uses voxel grids to accelerate computations. However, higher grid resolutions increase memory requirements and may impact processing time 
MLS is a method used for point cloud smoothing and surface reconstruction. The computational time of MLS depends on the radius used for local computations and the number of neighbors, typically between (𝑁𝑙𝑜𝑔𝑁)
, where N
is the number of points. Efficiency Considerations: MLS can be relatively efficient, especially with optimized data structures like kd-trees for nearest neighbor searches. However, larger radii and denser neighborhood computations can impact processing time 
Voxel grid downsampling is a technique used to reduce the density of point clouds by grouping points within voxel volumes 
. The computational time of voxel grid downsampling is typically (𝑁)
, where N
is the number of points. Voxel grid downsampling is efficient as it involves spatial partitioning, enabling faster processing of large point clouds. The processing time is influenced by the size of the voxel grid used.
PCA is a statistical method used for feature extraction and dimensionality reduction in point clouds 
. The computational time of PCA depends on the number of dimensions and the number of points, typically between (𝐷𝑁2)
, where D
is the number of dimensions and N
is the number of points. PCA can be computationally efficient for moderate-sized point clouds, but for high-dimensional data, dimensionality reduction techniques may be required to maintain efficiency 
2. The Representation of 3D Model
This method is the simplest way to show a 3D model. As we know, 2D model have less representation, also it is easy for observers describing a 3D model with a single viewpoint. Therefore, a series of 2D capturing from different viewpoints can be used to show a 3D shape. Because of reducing one dimension, it is relatively convenient and efficient for the observers to record a 3D shape while shrinking the size of the data 
The use of depth images can provide the distance between the camera and the scene to each pixel. First, depth images can be obtained from multi-view or stereo images, where a disparity map is calculated for each pixel in the image, but we usually use the form of RGB-D data to represent such images. Because RGB-D data are composed of color images and corresponding depth images, depth sensors such as kinect can easily obtain RGB-D data 
. Since the object can only be seen from one side, the depth image cannot describe the shape entirely because the depth image is captured from a viewpoint. Fortunately, thanks to huge advances in 2D processing, many 2D algorithms can use these data directly 
A point cloud is a group of unordered points in 3D space, which are represented by coordinates on the x
, and z
axes, and from which a specific 3D shape can be formed 
. The coordinates of these points can be obtained from one or more views using a 3D scanner, such as the RGB-D cameras or LiDAR mentioned earlier. At the same time, RGB cameras can capture color information. These color information can be selectively superimposed on the point cloud as additional information to enrich the content expressed by the point cloud. A point cloud is an unordered set, so it differs from the image usually represented by a matrix. Therefore, a permutation invariant method is crucial for processing such data, so as to ensure that the results do not change with the order of the points in the cloud.
For a picture, pixels are made up of small squares of an image. These squares have a clear position and specific color. The color and position of the small squares determine the appearance of the image. Therefore, we can also define a similar concept named “voxels” as the pixels. In 3D space, a voxel representation provides information on regular grid 
. Voxels can be obtained from point clouds in the voxelization process, in which all features of 3D points within a voxel are grouped for subsequent processing. The structure of 3D voxels is similar to that of 2D. For example, convolution, in 2D convolution the kernel slides in 2D, while in 3D convolution the kernel slides in 3D instead of 2D as in 2D convolution. Since voxels contain a large number of empty volumes corresponding to the space around the object, in general, the voxel representation is relatively sparse. In addition, since most capture sensors can only collect information on the surface of an object, the interior of the object is also represented by empty volume.
Unlike voxels, a mesh incorporates more elements and is a collection of vertices, edges, and faces (polygons) 
. Its basic components are polygons and planar shapes defined by the connection of a set of 3D vertices. Point clouds, in contrast, can only provide vertex locations, but because grids incorporate more elements, they can contain information about the surface of an object. This way of representing 3D models is very common in computer graphics applications. Nonetheless, surface information is difficult to process directly using deep learning methods, and in order to transform the mesh representation into a point cloud, many techniques pursue sampling points from the surface 
3. 3D Transformer
3.1. 3D Transformer Architecture
The archetypal Transformer model, which employs an encoder–decoder framework, is illustrated here, where the encoder represents the upper module, and the decoder, the lower. This section provides a comprehensive introduction to both the encoder and decoder.
The decoder follows a similar pattern, with each block adding a multi-head cross-attention sublayer compared to its encoder counterpart. This decoder block includes a multi-head cross-attention sub-layer, multi-head self-attention sub-layer, and a feed-forward network. The multi-head self-attention sublayer is designed to capture the relationship between different decoder elements, while using the encoder output as the key and value of the multi-head cross-attention sublayer to attend to the encoder output. Similarly to the encoder, a multilayer perceptron can transform the features of each input element via the feedforward network in the decoder. Moreover, a normalization operation and a residual connection follow each sublayer in the decoder, mirroring the encoder’s structure.
The architecture of the model is illustrated in Figure 1.
Figure 1. The transformer model.
3.2. Classification of 3D Transformers
: Initially, it should be noted that points follow an irregular format, unlike the regular structure of voxels. Therefore, during the process of point-to-voxel conversion, due to the constraints imposed by this regimented format, geometric information may be inevitably lost to some extent 
. Conversely, given that the point cloud is the most raw representation, formed by the aggregation of points, comprehensive geometric information is inherent in the point cloud. Consequently, the majority of Transformer-based point cloud processing frameworks fall under the category of point Transformer-based.
: 3D point clouds are typically unstructured, which starkly contrasts with the structure of images. Therefore, conventional convolution operators cannot process this kind of data directly. However, by simply converting the 3D point cloud into 3D voxels, this challenge can be easily addressed. The 3D voxel structure bears similarities to images.
4.1. 3D Object Detection
The objective of 3D object detection is to predict the rotation bounding box of a 3D object 
. Three-dimensional object detectors demonstrate distinct differences when compared to 2D detectors. For instance, Vote3Deep 
leverages feature-centric voting 
to efficiently process sparse 3D point clouds on evenly spaced 3D voxels. A unified feature representation can be produced through the combination of 3D sparse convolutions and 2D convolutions in the detection head, a requirement that necessitates VoxelNet 
to use PointNet 
within each voxel. Building upon this, SECOND 
simplifies the VoxelNet process and makes the 3D convolution sparse 
In the realm of outdoor 3D object detection, 3D object detection plays a pivotal role in autonomous driving. The KITTI dataset 
is one of the most frequently used datasets in this field due to its precise and clear provision of 3D object detection annotations. The KITTI dataset encompasses 7518 test samples and 7481 training samples, with standard average precision being used for easy, medium, and hard difficulty levels. The KITTI dataset enables the use of either LiDAR or RGB as input, or both.
4.2. 3D Object Classification
Object classification in deep learning pertains to the identification of an object’s category or class present in data sources such as images, videos, or other types of data . This involves training a neural network model on a substantial dataset of labeled images, with each image being associated with a distinct object class. The trained model can subsequently be employed to predict the class of objects in novel, unseen images .
Point cloud classification  strives to classify each point in the cloud into a predefined set of categories or classes . This task frequently arises in the fields of robotics, autonomous vehicles, and other computer vision applications where sensor data are represented in the form of point clouds. In order to classify a given 3D shape into a specific category, certain unique characteristics must be identified.
4.3. 3D Object Tracking
Three-dimensional object tracking in deep learning refers to the detection and tracking of the 3D position and movement of one or multiple objects within a scene over time. This process involves training a neural network model on an extensive dataset of labeled 3D objects or scenes, each annotated with its corresponding 3D position and movement over a period of time 
The purpose of 3D object tracking is to precisely track the movement of one or multiple objects in real-world environments, a crucial component in various computer vision applications, such as robotics, autonomous vehicles, and surveillance.
A deep learning model is trained on a large dataset of labeled 3D objects or scenes for 3D object tracking, with each object or scene annotated according to its respective 3D position and movement over time. The model learns to extract pertinent features from the 3D data and to track the object’s or objects’ movement in real time. During inference, the trained model is applied to new, unseen 3D data to track the object’s or objects’ movement in the scene over time. The model output comprises a set of 3D coordinates and trajectories, representing the movement of the object or objects in 3D space over time. Figure 2
illustrates various methods for 3D object tracking 
Figure 2. Various methods for 3D object tracking.
4.4. 3D Estimation
Three-dimensional pose estimation in deep learning pertains to estimating the 3D position and orientation of an object or scene from a 2D image or set of 2D images 
. This process involves training a neural network model on an extensive dataset of labeled images and their corresponding 3D poses, with each pose representing the position and orientation of the object or scene in 3D space 
Three-dimensional pose estimation aims to accurately estimate the 3D pose of an object or scene in real-world settings, a key aspect in various computer vision applications, such as robotics, augmented reality, and autonomous vehicles 
To perform 3D pose estimation, a deep learning model is trained on an extensive dataset of labeled images and their corresponding 3D poses 
. The model learns to extract pertinent features from the 2D images and estimate the 3D pose of the object or scene. During inference, the trained model is applied to new, unseen images to estimate the 3D pose of the object or scene in real-time. The model output comprises a set of 3D coordinates and orientations, representing the position and orientation of the object or scene in 3D space 
4.5. 3D Segmentation
Three-dimensional segmentation 
in deep learning involves dividing a 3D object or scene into meaningful parts or regions 
. This process necessitates training a neural network model on an extensive dataset of labeled 3D objects or scenes, with each object or scene segmented into its constituent parts or regions. The trained model can then predict segmentation labels for new, unseen 3D data 
In point cloud 3D segmentation 
, the goal is to partition a 3D point cloud into distinct regions based on their semantic meaning. This task is vital in robotics, autonomous vehicles, and other computer vision applications where sensor data are represented in the form of point clouds 
4.6. 3D Point Cloud Completion
Three-dimensional point cloud completion in deep learning pertains to reconstructing missing or incomplete 3D point cloud data. This process involves training a neural network model on a comprehensive dataset of incomplete point clouds, where each point cloud lacks some points or possesses incomplete information. The trained model can then generate complete point clouds from new, incomplete point cloud data 
The purpose of 3D point cloud completion is to recover the missing information within the point cloud and create a comprehensive 3D representation of the object or scene. This task holds significant importance in robotics, autonomous vehicles, and other computer vision applications where sensor data may be incomplete or noisy. For instance, point cloud completion can generate a comprehensive 3D map of a scene, even when some parts of the scene are obscured or missing due to sensor limitations.