Lightweight Convolutional Neural Network

Lightweight Convolutional Neural Network: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Artificial Intelligence

Contributor:

Yu-Ming Zhang

Chia-Yuan Cheng

Chih-Lung Lin

Chun-Chieh Lee

Kuo-Chin Fan

Biometrics has become an important research issue, and the use of deep learning neural networks has made it possible to develop more reliable and efficient recognition systems. Palms have been identified as one of the most promising candidates among various biometrics due to their unique features and easy accessibility.

multi-view projection
lightweight convolutional neural network
biometrics
palm recognition

1. Introduction

With the rapid development of the information age, awareness of biometrics has become increasingly common in recent years. Many daily activities require the verification of personal identities, such as access control systems in sensitive places or epidemic control systems during the ravages of COVID-19. Therefore, it is necessary to build an ID recognition model based on a reliable token, such as the inherent component of human beings—the palm. Everyone has two palms which are unique components of the human body. The palms have rich features such as texture, finger length, and shape. Based on this simple intuition, the palm, represented in 3D point clouds, can best preserve biological characteristics and be used as a token for the ID recognition system. In addition to choosing a reliable token, a robust model is a cornerstone of the ID recognition system. Today, many cnn-based models, such as Vgg [1], ResNet [2], DenseNet [3], EfficientNet [4], and more, show extraordinary performance. While 3D point clouds are a better form of data, they also introduce additional complexity. For example, the 3D cloud point has complex properties such as disorder and lack of structure. In addition, the input of the 3D CNN is a sequence, and there are many possible combinations of the elements contained in the sequence, the sequence’s length, and the sequence’s order. It is far more complicated than a 2D image. To address this complexity, we propose the Multi-View Projection (MVP) method to project 3D palm data onto 2D images from several different views, just like humans observe their palms. Then, we propose Tiny-MobileNet(TMBNet), which combines advanced feature fusion and extraction methods. Finally, our experiments show a significant performance gap compared to the 3D CNN baselines, such as PointNet [5] and PointNet++ [6]. Overall, our proposed method TMBNet with MVP efficiently addresses the challenges of using 3D palms as a reliable token for an ID recognition system; It achieves better performance by projecting the 3D palms onto 2D images and combining advanced feature fusion and extraction methods than the classic 3D models.

2. Overview for Palm Recognition

Previously, using Principal Component Analysis (PCA) to select critical features and then classify them by Support Vector Machine (SVM) was the common method; There are works of literature proposing the variants of PCA; For example, ref. [7] proposed the Gabor Wavelet with PCA to represent the 2D palms images; ref. [8] proposed the QPCA that is a multispectral version of PCA. Later, with the rapid development of deep learning, the literature [9] first used AlexNet to identify palms; some researchers focus on proposed new loss function [10,11], and they improve the performance of CNN at their time; there are some studies [12,13] presents the synthesized algorithm that combines palms data with other prior knowledge.

3. Overview of 3D Convolution Neural Networks

Today, some benchmark [14] for palm recognition is the form of point clouds, and in recent years, there has been much work to build a 3D CNN for 3D point clouds [5,15,16,17]. PointNet [5], and its derivative [6] are essential baselines in these 3D models. PointNet uses the symmetric function to solve the disorder caused by 3D point clouds and uses Multilayer Perceptrons (MLP) to extract high-level features; They propose a matrix network T-Net to attach at the beginning of the model for realignment features. When the input point clouds are aligned, sorted, and extracted, it goes through a Global Average Pooling (GAP) layer to get the final prediction. PointNet is a cornerstone for 3D point clouds, and after that, many studies have proposed novel methods based on it. Pointnet++ [6] has improved considerable performance through their designed local neighborhood sampling representation method and multi-level encoder-decoder combined network structure based on PointNet. Although these 3D CNNs have considerable performance, they are naturally more complex than 2D CNNs because of the negative properties of 3D point clouds, such as disorder and etc. In practice, the 3D CNNs hard to converge when training data is too few, so applying 3D data augmentation has become an often idea [18,19,20]. However, there is some trouble because these methods usually rely on point matching, which causes much computation.

4. Overview of 2D Convolution Neural Networks

The literature [21] uses AlexNet, VGG-16, GoogLeNet, and ResNet-50 to reach more impressive results than traditional methods in palm recognition tasks. In other words, they have been proven these classic 2D CNNs can achieve robust performance, such as VGG, ResNet, DenseNet, MobileNet, EffencienNet, and others. VGG [1] opened the era of widespread use of convolution layer with kernel size three by three. ResNet [2] proposes a skip connection to solve the problem, which is a nonlinear function to fit the identity function. DenseNet [3] chooses another way to achieve this purpose. They generate a few channels through a single convolution layer and then continue concatenating them to increase the channel gradually. They believe that they can pass features directly than skip connection. MobileNet [22] proposes the separable convolution to approximate convolution-3 × 3. They used it to build a lightweight CNN backbone with fewer parameters and low FLOPs than the other CNN backbone. MobileNetv2 [23] found that when the channel size is too small, adequate information will be lost because of dead cells due to ReLU [24]. To address it, they propose the Inverted Residual Block (IRB), which consists of a pointwise convolution (equal to convolution-1 × 1) and a separable convolution. The first pointwise convolution of IRB is designed to expand the channel for more redundancy to overcome the information lost. EfficienNet [4] built from NAS [25] technique proposes a comprehensive scale model from B0 to B7, no matter which scale is the leader at that time.

5. Projection Methods

As we just talked about, because of the harmful properties of 3D point clouds, there are studies proposing the projection method to project the 3D point clouds to the 2D data for reducing complexity. Some of them directly project the 3D point clouds into an image [15,26,27], and some methods convert it to Volume Pixel format [28]. The literature [26] has a conclusion that the collection of 2D images with different views can be highly informative for 3D shape recognition; the literature [27] hand over multiple groups of 2D images with different views to the learnable CNN to further strengthen the extracted features. Overall, in addition to directly processing the 3D point clouds as input of the model, it is also possible to project 3D point clouds to 2D format. However, dimensionality reduction will inevitably bring information loss. How to reduce the loss and maintain the richness of data is the main problem in this field.

This entry is adapted from the peer-reviewed paper 10.3390/info14070381

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.