Recognizing facial expressions plays a crucial role in various multimedia applications, such as human–computer interactions and the functioning of autonomous vehicles. An individual’s facial expressions (FEs) or countenance convey their psychological reactions and intentions in response to a social or personal event. These expressions convey non-verbal stealth messages. With technological advancements, human behavior can be understood through facial expression recognition (FER). To express emotions, humans use facial expressions as their primary nonverbal communication method.
1. Introduction
Biometric authentication and nonverbal communication applications have recently drawn considerable attention to facial recognition. The movie industry has also conducted studies that predict emotions experienced during scenes. These works sought to identify a person’s mood or emotion by analyzing facial expressions.
With landmarks in 2D and 3D
[3][1], facial appearances
[4][2], geometry
[5][3], and 2D/3D spatiotemporal representations
[6][4], facial models can be developed. Review papers provide comprehensive reviews
[6,7,8][4][5][6]. It is also possible to categorize approaches based on images, deep learning, and model-based approaches. Engineered topographies such as HOGs
[9][7], local binary pattern histograms (LBPHs)
[10][8], and Gabor filters
[11][9] are used in many image-based approaches.
The most recent works in this field focus on hand-engineered features, but various techniques have been developed
[12,13,14,15][10][11][12][13]. In today’s image processing and computer vision applications, deep neural networks are the best choice due to the variety and size of datasets
[16,17,18][14][15][16]. Conventional deep networks can easily handle spatial images
[18][16]. Traditional feature extraction and classification schemes are computationally complex and difficult to use for achieving high recognition rates. A DNN based on convolutional coatings and residuals is proposed for recognizing facial emotions
[19,20][17][18]. By learning the subtle features of each expression, the proposed model can distinguish them
[21,22][19][20]. The proposed model presents a facial emotion classification report, along with the confusion matrix derived from the image dataset.
2. Human Emotions and Facial Expressions
Human emotions are expressed through facial expressions in social communication. Three orthogonal planes are used to extract local binary pattern features (LBP-TOP)
[23][21]. Based on computed histograms, the proposed LBP-TOP operator determines expression features from video sequences from three orthogonal planes. The author classified expressions using video sequences based on the extracted features of LBP-TOPs using a machine learning (ML) classifier. According to
[24][22], fuzzy ARTMAP neural networks (FAMNNs) are used for VFER. In addition, particle swarm optimization (PSO) has been used to determine hyperparameters for the FAMNN network. A definite method
[25][23] categorizes emotions based on their types, such as sadness, happiness, fear, and anger, according to the dimensional method
[26][24].
SVM has also been used to categorize facial expressions
[27][25]. Using 15 different feature points and their representations of neutral faces, the authors proposed a method measuring Euclidean distances between them. In the JAFFE dataset, 92% of the datasets are recognized, while in the CK dataset, 86.33% are recognized. These facial expression classification results demonstrate the effectiveness of SVMs in recognizing emotions. Also, SVMs have been employed for formalizing faces
[28][26] and recognizing faces
[29,30][27][28].
A large sample size is essential for developing algorithms for automatically recognizing facial expressions and related tasks. CK
[31][29] and MMI
[32][30] are three facial expression databases used to test and evaluate expression recognition algorithms. There are many databases where participants are asked to present certain facial expressions (e.g., frowns) rather than naturally occurring expressions (e.g., smiles). A spontaneous facial expression does not follow the same spatial or temporal pattern as a deliberately posed expression
[33][31]. Over 90% of facial expressions can be detected by several algorithms. However, it is much harder to recognize spontaneous expressions
[21,34][19][32]. A naturalistic setting is the best place for automatic FER. The working flow of FER methods
[35][33], their integral and intermediate steps, and pattern structures are thoroughly analyzed and surveyed in th
ise study in order to address these missing aspects. Furthermore, the limitations of existing FER surveys are discussed. A detailed analysis of FER datasets follows, followed by a discussion of the challenges and problems related to these datasets.
3. Deep-Learning-Based Face Recognition
During the training process, deep learning can auto-learn the new features based on stored information, thus minimizing the need to train the system repeatedly for new features. As deep learning algorithms do not require manual preprocessing, they can also handle large amounts of data. Recurrent neural networks (RNNs) and convolutional neural networks (CNNs) are two algorithms used in deep learning
[36][34].
With RNN, relative dependencies with images are learned by recollecting information about past inputs, which is an advantage over CNN. It is widely used to combine RNNs and CNNs for image processing tasks, such as image recognition
[37][35] and segmentation
[38][36]. When input successions and hidden states are mapped to yields, a recurrent neural network (RNN) learns quick progression. In their paper, the authors proposed an improved method for representing spatial–temporal dependencies between two signals
[39][37]. The CNN model is used in various fields, such as IOT
[40][38], offloading
[41][39], speech recognition
[42][40], and traffic sign recognition
[43][41].
A CNN and an RNN are combined in the HDCR-NN
[25][23]. A facial expression classification system is adapted to it. Hybridizing convolutional neural networks
[44][42] and recurrent neural networks
[45][43] are automatically motivated by their wide acceptance of learning feature attributes.
WThe
researchers used the Japanese Female Facial Expression (JAFFE) and Karolinska Directed Emotional Faces (KDEF) datasets to evaluate the proposed methodology.
Using cross-channel convolutional neural networks (CC-CNNs), the authors
[46][44] propose a method for calculating VFER. The author of
[47][45] proposes a method for VFER based on hierarchical bidirectional recurrent neural networks (PHRNNs). In their proposed framework (MSCNN), spatial information was extracted from still frames of an expression sequence using an MSCNN. Combining PHRNN and MSCNN further enhances VFER by extracting dynamic stills, parts and wholes, and geometry appearance information. An effective VFER method was demonstrated by combining spatiotemporal fusion and transfer learning. CNNs and RNNs are combined in the FER to incorporate audio and visual expressions.
An alternative to deep-learning-based methods used for recognizing facial emotions was proposed with a simple machine learning method
[48][46]. Regional distortions are useful for analyzing facial expressions. On top of the convolutional neural network, they trained a manifold network to learn a variety of facial distortions. A set of low-level features has been considered for inferring human action
[49][47] rather than only trying to extract facial features. An entropy-based method used for facial emotion classification was developed by
[50][48]. An innovative method for recognizing facial expressions from videos was presented by
[51][49]. Multiple feature descriptors have described face and motion information. To exploit complementary and discriminative distance metrics, the researchers carried out the learning of multiple distance metrics. An ANN that learns and fuses the spatial–temporal features was proposed by
[52][50]. The authors
[50][48] proposed several preprocessing steps to recognize facial expressions.