2. LFaciterature Reviewal Anti-Spoofing Approaches
2.1. Feature-Based Facial Anti-Spoofing Approaches
Liu et al. proposed an improved local binary pattern for face maps that could be used as a classification feature
[16]. When these characteristics were put into a support vector machine (SVM) classifier, the face maps could be classified as genuine or false. Tan et al. proposed a technique for formulating the anti-spoofing objective as a binary classification problem
[17]. They implemented Difference of Gaussian (DoG) filtering to eliminate noise from 2D Fourier spectra and then extended a sparse logistic regression classifier both nonlinearly and spatially for the classification. Matta et al. utilized multiscale local binary patterns to examine the texture of facial images
[18]. Afterward, the macrotexture patterns were encoded into an improved feature histogram, which was then input into an SVM classifier.
Parveen et al. proposed a texture descriptor that they described as the dynamic local ternary pattern, in which the textural features of skin were examined using an adaptive threshold configuration and the classification was performed using a support vector machine with a linear kernel
[19]. Das et al. proposed a method that was based on a frequency and texture analysis to distinguish between real and artificial faces
[20]. The frequency evaluation was accomplished by Fourier transforming the images into the frequency domain and then calculating the frequency descriptor to detect dynamic variations in the faces. LBP was employed to analyze the texture and an SVM classifier with a radial basis function kernel was used to classify the generated feature vectors.
The method that was proposed by Luan et al. involved three characteristics: blurriness, the specular reflection ratio and color channel distribution characteristics
[21]. The images were classified using an SVM, according to these three characteristics. Chan et al. developed a technique that utilized flash to defend against 2D spoofing attacks
[22]. This technique captured two images per individual: one with flash and the other without. Three additional descriptors were employed to extract the textural information of the faces: a descriptor that was based on uniform LBP and the standard deviation and mean of the grayscale difference between the two recorded images of each person. The difference between the pictures with and without flash, as assessed by the four descriptors, was used to classify the images. Kim et at. proposed a method that defined the differences between the surface characteristics of live and fake faces by calculating the diffusion speeds and extracting anti-spoofing characteristics that were based on the local patterns of the diffusion speeds
[23]. These characteristics were input into a linear SVM classifier to assess the liveness of the images. Yeh et al. developed a method that utilized digital focus properties with different depths of field to accomplish liveness detection
[24]. The nose and the lower right portion of the cheek were examined for preprocessing. Due to the impact of the depth of field, the degree of blurriness differed between real and false images. The k-nearest neighbor method was used to classify the results. Although hand-crafted feature extraction was also utilized in
[12[12][13],
13], wresearche
rs adopted a patch-based CNN that was built on VGG-16 architecture to conduct feature extraction automatically, which removed the need for hand-crafted feature extraction.
2.2. Deep Learning-Based Facial Anti-Spoofing Approaches
Deep CNN models have been employed in recent facial liveness identification studies because they provide more accurate liveness detection than the previously presented strategies
[7,8,14,15][7][8][14][15]. For facial anti-spoofing detection, Atoum et al. suggested a two-stream CNN-based method that included a patch-based CNN and a depth-based CNN
[25]. While the first CNN extracted local features from face image patches, the second extracted depth features by computing the depth from the entire image and then using an SVM for feature extraction. This two-stream network could be trained end-to-end to discriminate between real and fake faces based on their rich appearance attributes by including these parameters. Rehman et al. proposed a technique that concentrated on data randomization in mini-batches to train deep CNNs for liveness detection
[26].
The method that was proposed by Alotaibi et al. used a mix of input facial diffusion accompanied by a three-layer CNN architecture
[27]. For facial liveness detection, Alotaibi et al. suggested an approach that employed nonlinear diffusion, accompanied by a tailored deep convolutional network
[28]. By rapidly diffusing the input image, nonlinear diffusion aided in differentiating between fake and genuine images. As a result, the edges of flat images faded away while the edges of genuine faces stayed visible. Furthermore, to extract the most significant properties for classification, a customized deep convolutional neural network was suggested. Koshy et al. proposed a method that combined nonlinear diffusion with three architectures: CNN-5, ResNet50 and Inception v4. They found that the Inception v4 architecture was the best
[29]. Jourabloo et al. addressed the facial anti-spoofing problem as an image denoising problem, which resulted in the development of an anti-spoofing method that was based on CNNs to achieve the facial anti-spoofing objective
[30]. In the first layer of a Lenet-5-based neural network model, De Souza et al. applied the LBP descriptor, which improved the accuracy of facial spoofing detection
[31]. An improved version of LBPnet, which is called n-LBPnet, was suggested to achieve higher accuracy for real-time facial spoofing detection by integrating the local response normalization (LRN) step into the second layer of the network.
Xu et al. demonstrated that facial anti-spoofing in videos could be achieved using a deep architecture that integrated LSTM and a CNN
[32]. The LSTM obtained the temporal correlations in the input sequences while the CNN retrieved the local and dense features. Tu et al. proposed that facial spoofing detection in video sequences could be achieved using a CNN–LSTM network, which concentrated on motion cues throughout video frames
[33]. To enhance the facial emotions of the humans in the videos, they used Eulerian motion magnification. Highly discriminative features were extracted from the video frames using a CNN and LSTM, which were also used to capture the temporal dynamics from the videos. Recently, Khade et al. proposed an iris liveness detection technique that used several deep convolutional networks
[34]. Densenet121, Inceptionv3, VGG-16, EfficientNetB7 and Resnet50 were employed in that study to identify iris liveness using transfer learning approaches. The limited dataset necessitated the use of a transfer learning approach to prevent overfitting. As stated above, numerous well-referenced CNN models have demonstrated the ability to distinguish between real and fake faces. Therefore, this
pape
rntry presents a patch-based CNN architecture for training complicated and differentiating features to improve the security of present facial recognition-based biometric authentication systems against printed images and replay attacks.
3. Architecture of the Proposed System
3.1. Liveness Detection
A patch-based CNN that was built on the VGG-16 architecture with a deep aspect was proposed for liveness detection to improve security. After the patches are constructed, the input images are transmitted sequentially to the CNN, which serves as the front-end of the architecture. The CNN output is then passed to an LSTM, which identifies temporal features in the sequence and determines whether the dense layer in the neural network output is real or fake. The workflow of the proposed method is shown
Figure 1.
Figure 1.
The workflow of the patch-based CNN–LSTM architecture with the modified VGG-16.
3.2. Patch-Based CNN
Before applying the patch-based approach
[23,24,35][23][24][35] for real-time testing purposes, an image processing algorithm (LBPH) was adopted using OpenCV
[36], which allowed facial boundaries to be detected within images. Further, there were several motivations for the proposed CNN to use patches instead of the whole face. Firstly, the number of CNN learning samples needed to be increased due to the limited number of samples that were available for training in all of the accessible anti-spoofing datasets. Although hundreds of faces could be taken from individual videos by cropping faces frame by frame, overfitting posed a huge problem while training the CNN because of the significant similarities between the images.
Secondly, classic CNNs need to redimension faces when employing full facial photos as inputs because of the various resolutions of the photos. These shifts in size could lead to a drop in discrimination between images. In contrast, by adopting local patches, the native resolution of the original images can be maintained and the discriminatory capability of the system can be preserved. To solve this, a patch-based CNN was deployed in which each frame was converted into patches that were then classified separately. As the spoof-specific discriminatory information was present in the whole facial region, patch-level inputs were used to enable the CNN to detect this information irrespective of the patch position, despite this being a more complicated process than using the entire facial image.
Furthermore, the input features were selected by converting HSV color space to obtain the discriminative descriptors as the scope of anti-spoofing methods that use RGB images is limited
[25] and color space can also be utilized for chrominance and luminance information. Then, pixel-wise LBP maps were randomly extracted for spatial texture information rather than using them as traditional histogram descriptors
[37,38,39][37][38][39]. Using a pre-trained Haar cascade model for front facial detection, the maps were transformed into feature representations, as well as fixed sized patches, for processing using the VGG-16-based CNN–LSTM model.