Point Cloud Object Classifications

Point Cloud Object Classifications: Comparison

Please note this is a comparison between Version 1 by Haijian Lai and Version 2 by Peter Tang.

A point cloud is a set of individual data points in a three-dimensional (3D) space. Proper collection of these data points may create an identifiable 3D structure, map, or model.

point cloud classification
deep learning
lightweight model

1. Introduction

A point cloud is a set of individual data points in a three-dimensional (3D) space. Proper collection of these data points may create an identifiable 3D structure, map, or model. The effective analysis of point cloud data [1], such as semantic segmentation and object detection, is crucial for applications in 3D computer vision, including robotic sensing and autonomous driving. In general, the data points in point cloud contain geometric information, such as geometric coordinates, and other optional information (e.g., colors, normal vectors, etc.). However, the disordered and irregular nature of point cloud data makes analysis challenging. The recent advancement in deep learning has been phenomenal. Among them, the PointNet ^[2][3][3,4] was introduced as a solution to overcome the three non-trivial issues of permutation invariance, interactions between points, and transformation invariance. While newer designs, such as PointConv ^[4][5], KPConv ^[5][6], DeepGCNs ^[6][7], and PointTransformer ^[7][8], have shown better performance, they are expensive to run, thus limiting their practical uses. Hence, there is a need for lightweight and faster applications while still offering high precision and accuracy.

In 3D computer vision, methods of learning geometric features evolve from landmark MLP models (PointNet, PointNet++) to using convolutions, graphs, and transformer models. Though these newer designs obtain better performance, they mostly execute on expensive commodities, e.g., the high-end GPU card with a large amount of memory. It is atypical to deploy a heavy, slow, but high-performance application in 3D shape classification, object detection, etc. Instead, a lighter, faster, and relatively high-performance application is what users desire in general. In addition, limited improvement for some newer models can be observed through the experiments of benchmark datasets, such as ModelNet40 ^[8][9], ScanObjectNN ^[9][10], S3DIS ^[10][11], etc. Recently, many latest models such as PointNeXt ^[11][12], and PointMLP ^[12][13], begin to rethink network design and training strategies. They show that advanced extractors (MLPs, convolutions, graphs, attentions) may not be the main factors to improve overall performance. For instance, PointNeXt performed lots of experiments to assess different training strategies, and their results showed that training strategies are factors in raising the ranks in benchmark tests.

2. Deep Learning on Point Cloud

There are two main types of 3D data representations: point-cloud-based and non-point-cloud-based. For the point cloud data, PointNet is the baseline model for processing the point cloud data. Another possible choice is to convert the point cloud to voxel data ^[13][14] or multi-view images ^[14][15] or other methods; for example, implicit representations ^[15][16] could be used to enrich point clouds for better 3D analysis. Traditionally, voxel-based structures are used to study 3D representations, but they require extensive computations to carry out 3D convolutions or run neural networks on 2D multi-view images. On the other hand, multi-view images are a desirable option because they allow for the use of well-designed 2D neural networks to interpret 3D visual data. It is generally observed that state-of-the-art works can be broadly categorized into two types: those that use pre-training with point cloud data and those that do not. This trend is also reflected in the benchmarks for classification models. Therefore, classification models can be broadly categorized into two types: those that use pre-training methods with point cloud data and those that do not. Generally speaking, pre-training methods have an advantage in terms of performance, if we do not consider the voting strategy. The use of large models trained with multimodality, which connect computer vision and natural language processing, has allowed models to reach new heights. However, it is also possible to achieve very good performance by training models from scratch without using pre-training. In addition, models trained from scratch tend to have significantly fewer parameters than those that use pre-training methods. Here are some nice SOTA works using pre-training: RECON ^[16][17] unifies the data scaling capacity of contrastive models and the representation generalization of generative models to achieve a new state-of-the-art in 3D representation learning on ScanObjectNN. I2P-MAE ^[17][18] is an alternative method to obtain high-quality 3D representations from 2D pre-trained models via image-to-point masked autoencoders, which leverage well-learned 2D knowledge to guide 3D masked autoencoding. ULIP ^[18][19] is a framework that pre-trains 3D models on object triplets from images, texts, and 3D point clouds by leveraging a pre-trained vision-language model, achieving state-of-the-art performance on 3D classification tasks. These pre-training methods almost aim to improve the representation learning of point clouds for better performance on downstream tasks. The basic point cloud data structure contains points, and each point is like a pixel in a picture. PointNet ^[2][3] is one deep learning neural network model to process the point cloud data. Vanilla PointNet uses an MLP and max pooling to learn a feature. Apart from MLP, there are other approaches using attention (also known as transformers), convolutions, and graphs to produce good benchmark results. Unlike MLPs, these design models are usually complicated, and their approaches are tedious to work on. Hence, most accepted approaches use MLPs, such as the PointNeXt ^[11][12], PointMLP, and RepSurf ^[19][20], etc. PointNeXt reuses the architecture of PointNet++ ^[3][4] with some modifications and uses training strategies to achieve good performance. Then the PointMLP is totally a different architectural MLP design that is not based on the PointNet model, and RepSurf focuses mostly on the point cloud representation. Potentially, a good point cloud representation (triangular RepSurf and umbrella RepSurf) may lead a selected model (e.g., PointNet++) to achieve excellent performance which is comparable to that of the state-of-the-art design. Recent advancements in hyperbolic neural networks (HNN) ^[20][21] have led to several studies attempting to project data into the hyperbolic space to improve performance, such as HyCoRe ^[21][22], which proposes a new method of embedding point cloud classifier features into hyperbolic space with explicit regularization to capture the inherent compositional nature of 3D objects, achieving a significant improvement in the performance of supervised models for point cloud classification. In summary, there are various ways to process point cloud data using deep learning models, and the choice of model depends on the specific application and performance requirements. While PointNet is a popular choice for point cloud processing, other models such as PointNeXt, PointMLP, and RepSurf can also achieve excellent results with some modifications.

3. PointMLP

PointMLP ^[12][13] is an extension of the traditional MLP architecture that incorporates residual blocks for better performance on point cloud analysis tasks. The key idea behind PointMLP is that local geometric information may not be as important as previously thought and that simple yet effective methods such as the geometric affine sampling technique can achieve high accuracy without the need for complex geometric feature extraction. This approach not only improves robustness but also allows for faster computation and reduced model complexity. The use of residual blocks in PointMLP further enhances the network’s ability to learn complex representations by enabling it to capture both short-term and long-term dependencies. Additionally, the FPS sampling with the geometric affine method allows for better handling of variations in scale, rotation, and translation, making PointMLP suitable for a wide range of point cloud analysis tasks. Overall, The main contribution of PointMLP is its use of residual blocks in an MLP framework for point cloud analysis, challenging the notion that local geometric information is necessary for accurate analysis. Additionally, the use of the geometric affine sampling method enhances the robustness and accuracy of the model. PointMLP represents a promising advancement in the field of deep learning for point cloud analysis.

4. GhostMLP

GhostMLP is a novel neural network architecture that introduces a new residual MLP module that significantly reduces the number of parameters without compromising the classification performances. The GhostMLP is evaluated and validated with the two popular point cloud classification datasets: ScanObjectNN and ModelNet40. The results show that GhostMLP and its variant GhostMLP-S achieve competitive performance and fast inference compared with other state-of-the-art methods and their baseline. GhostMLP also could be to other tasks such as part segmentation on ShapeNet-Part and classification on mobile lidar scan (MLS) data, demonstrating its versatility and efficiency for different point cloud recognition applications.