Algorithms for Facial Expression Recognition in the Wild

Algorithms for Facial Expression Recognition in the Wild: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

Facial expression recognition (FER) in the wild has attracted much attention due to its wide range of applications. Approaches use deep learning models trained on relatively large images, which significantly reduces their accuracy when they have to infer low-resolution images.

facial expression recognition
emotions
low resolution
AffectNet
RAF-DB
voting

1. Introduction

In recent years, the need to recognize a person’s emotions has increased, and there has been a growing interest in human emotion recognition across various fields, including brain–computer interfaces ^[1]^[2], assistance ^[3], medicine ^[4], psychology ^[5]^[6], and marketing ^[7]. Facial expressions are one of the primary nonverbal means of conveying emotion and play an important role in everyday human communication. According to a seminal paper ^[8], more than half of the messages related to feelings and attitudes are contained in facial expressions. Emotions are continuous in nature. However, it is common to measure them on a discrete scale. Ekman and Friesen ^[9] identified six universal emotions based on a study of people from different cultures. This study showed that people, regardless of their culture, perceive some basic emotions in the same way. These basic emotions are happiness, anger, disgust, sadness, surprise, and fear. Over time, critics of this model have emerged ^[10], arguing that emotions are not universal and have a high cultural component; nevertheless, the model of the six basic emotions continues to be widely used in emotion recognition ^[11].

In the last few decades, facial expression recognition (FER) has come a long way thanks to advances in computer vision and machine learning ^[12]. Traditionally, feature-extraction algorithms such as scale-invariant feature transform or local binary patterns and classification algorithms such as support vector machines or artificial neural networks have been used for this task ^[13]^[14]^[15]. However, the current trend is to use convolutional neural networks (CNNs), which perform feature extraction and classification at the same time ^[16]. FER datasets can be divided into two main categories, depending on how the samples were obtained: laboratory-controlled or wild. In lab-controlled datasets, all images are taken under the same conditions. Therefore, there are no variations in illumination, occlusion, or pose. Under these conditions, it is relatively easy to achieve high classification accuracy without resorting to complex models; in fact, in some datasets, such as CK+ or JAFFE, 100% of the images can be correctly classified ^[17]^[18]. On the other hand, in-the-wild datasets contain images taken under uncontrolled conditions, such as those found in the real world. In this scenario, the classification accuracy is significantly lower than that obtained in laboratory-controlled datasets ^[19]^[20].

FER in-the-wild datasets typically contain images of many different sizes ^[21], which are typically scaled to 224 × 224 to feed a neural network. The main drawback of training a CNN on images of this size is that, as the network infers lower-resolution images, classification accuracy drops significantly ^[20]. There are many applications where obtaining high-resolution images of human faces is not feasible, for example when trying to determine the emotional state of multiple people simultaneously in large spaces such as shopping malls, parks, or airports. In these situations, each person is at a different distance from the camera, resulting in images of different resolutions. As highlighted in a survey ^[22], these circumstances present a variety of challenges, including occlusion, pose, low resolution, scale variations, and variations in illumination levels. In addition, they underscore the importance of the efficiency of FER models when processing images of multiple people in real-time.

In these scenarios, a network trained on low-resolution images can be more robust because the network is less dependent on fine details that are not present in low-resolution images, thus increasing its ability to generalize. In addition, working with smaller images reduces the computational cost and bandwidth required to transfer the images from the cameras to the computer where they are processed. CNNs that work with low-resolution images are lighter because the features occupy less memory, making it possible to use methods such as ensemble learning, which is not widely used in deep learning due to its high computational complexity ^[23]. Ensemble learning methods combine the results of multiple machine learning estimators with the goal of obtaining a model that generalizes better than the estimators of which it is composed ^[24]. Assembling n CNNs means multiplying by n the number of trainable parameters of the model and the size of the features of each image in the network.

2. Facial Expression Recognition in the Wild

Many of today’s FER methods use a conventional CNN as the backbone and add attention modules to it to improve classification accuracy ^[25]^[26]^[27]. These modules are used to force the CNN to learn and focus more on important information instead of learning useless background information. For example, an occlusion-aware FER system using a CNN with an attention mechanism has been developed ^[19]. The authors divided the resulting feature vector of the last convolutional layer of a CNN into 24 regions of interest using a region decomposition scheme and trained an attention mechanism module capable of learning a low weight for a blocked region and a high weight for an unblocked and informative region with them. In a recent paper, a multi-headed cross-attention network was proposed that achieved state-of-the-art performance on three public FER in-the-wild datasets ^[28]. In this network, the attention module implements spatial and channel attention, which allows capturing higher-order interactions between local features. A CNN was also trained with multiple patches of the same image, and an attention module was added to the network output ^[29]. This model achieved state-of-the-art results on four public FER in-the-wild datasets. The researchers' method has in common that each sample is split into multiple patches in both methods. However, in the researchers' approach, the splitting is performed within the CNN, which significantly reduces the size of the image features within the network.

Nevertheless, there are other approaches that achieve high performance in FER in the wild without relying on attentional mechanisms. For example, a novel multitask learning framework that exploits the dependencies between these two models using a graph convolutional network has recently been proposed ^[30]. The results of their experiments showed that their method can improve the performance over different datasets and backbone architectures. In addition, a paper proposed three novel CNN models with different architectures and performed extensive evaluations on three popular datasets, demonstrating that these models are competitive and representative in the field of FER in the wild research ^[31]. A very recent paper introduced a few-shot learning model called the convolutional relation network for FER in the wild, which was trained by exploiting a feature similarity comparison among the sufficient samples of the emotion categories to identify new classes with few samples ^[32].

Another approach that has proven to be state-of-the-art in FER is transformer-based methods. Inspired by transformers used in natural language processing, vision transformers (ViTs) have been proposed as an alternative to CNNs for various computer vision problems such as image generation ^[33] or classification ^[34]. In this approach, images are divided into multiple samples, and this sequence of images is used as the input to the model. Compared to CNNs, ViTs are more robust for classifying images that have noise or are magnified, but these models are generally more computationally expensive ^[35].

However, the pure structure of ViTs, which does not reflect local features, is not suitable for detecting subtle changes between different facial expressions, and therefore, the performance of these models for FER may be inferior to those based on CNNs. In order to exploit the advantages and minimize the limitations of both approaches, hybrid models combining ViTs and CNNs for FER have been developed in recent years. Huang et al. ^[36] proposed a novel framework with two attention mechanisms for CNN-based models and a token-based visual transformer technique for image classification of facial expressions in the wild. With this model, they achieved state-of-the-art performance on different datasets without the need for additional training data. With the same goal, Kim et al. ^[37] proposed a hybrid approach with a ViT as the backbone. By introducing a squeeze module, they were able to reduce the computational complexity by reducing the number of feature dimensions, while increasing the FER performance by simultaneously combining global and local features. Similarly, Liu et al. ^[38] used a CNN to extract local image features and fed a ViT with a positional embedding generator to model correlations between different visual features from a global perspective. With this hybrid model of only 28.4 million parameters, they surpassed the state-of-the-art in FER with occlusion and outperformed other models with hundreds of millions of parameters.

In addition, Ma et al. ^[39] proposed a model using two ResNet-18 for parallel feature extraction from the original image and an image obtained by local binary patterns. To model the relationships between features, they used a visual transformer, where features from both networks are merged. However, the improvement in accuracy provided by this model implies a significant increase in computational load compared to other approaches, since this model uses between 51.8 and 108.5 million parameters in the different implementations described in this work. What the researchers' proposal has in common with transformer-based methods is that both approaches divide the images into several patches in order to improve the accuracy of emotion recognition. However, in transformer-based models, each patch contains only a small part of the face, which forces working with a large number of samples and, consequently, increases the computational complexity. In contrast, in the researchers' approach, almost the entire face is contained in each sample, which reduces the number of samples required and makes the model robust to image translations.

3. FER in the Wild from Low-Resolution Images

In real-world applications, some or all of the images may be low-resolution. Under these conditions, the accuracy of models trained on high-resolution images is significantly reduced. FER in the wild from low-resolution or variable-resolution images is a less-explored area. However, in recent years, several approaches have been proposed to address this problem. Yan et al. ^[40] proposed a filter-based subspace learning method that outperformed the state-of-the-art on posed facial expression datasets, but the results were significantly worse on in-the-wild image datasets. The authors argued that this method has a learning capacity superior to some CNN-based methods. However, it requires the image to be converted to gray scale, which can result in the loss of valuable information. Another approach that has proven effective is the use of denoising techniques on low-resolution images to increase classification accuracy ^[41].

On the other hand, super-resolution-based methods have shown promising results in this field ^[42]^[43]. Super-resolution algorithms are generative models used to obtain high-resolution images from small images. The output of these models can be used as the input to any CNN that works with high-resolution images, such as those described in the previous subsection. The main drawback of these methods is the computational cost, since adding a generative model before the classifier means increasing the memory needed to process the image and the number of floating-point operations. While conventional CNNs typically require less than 10 giga floating-point operations per second (GFLOPs) to process an image, super-resolution-based models can require thousands of GFLOPs ^[42].

Another approach that has shown promising results in this area is knowledge distillation ^[44], which basically consists of transferring knowledge from heavy models trained on high-resolution images to lighter models operating on low-resolution images. For example, Ma et al. ^[45] obtained high accuracy rates in FER of resolution-degraded images with a model in which they used the knowledge of different levels of features of a teacher network to transfer it to a lighter student network. O. Huang et al. ^[46] proposed a feature-map-distillation (FMD) framework in which the size of the feature map of the teacher and learner networks was different. With this approach, they achieved better results on several recognition tasks than with twenty state-of-the-art knowledge-distillation methods.

This entry is adapted from the peer-reviewed paper 10.3390/electronics12183837

References

García-Martínez, B.; Fernández-Caballero, A.; Martínez-Rodrigo, A.; Novais, P. Analysis of Electroencephalographic Signals from a Brain-Computer Interface for Emotions Detection. In Proceedings of the Advances in Computational Intelligence, Berlin, Germany, 16–18 December 2021; pp. 219–229.
Sánchez-Reolid, R.; García, A.S.; Vicente-Querol, M.A.; Fernández-Aguilar, L.; López, M.T.; Fernández-Caballero, A.; González, P. Artificial Neural Networks to Assess Emotional States from Brain-Computer Interface. Electronics 2018, 7, 384.
Martínez, A.; Belmonte, L.M.; García, A.S.; Fernández-Caballero, A.; Morales, R. Facial Emotion Recognition from an Unmanned Flying Social Robot for Home Care of Dependent People. Electronics 2021, 10, 868.
Kumfor, F.; Piguet, O. Emotion recognition in the dementias: Brain correlates and patient implications. Neurodegener. Dis. Manag. 2013, 3, 277–288.
Monferrer, M.; García, A.S.; Ricarte, J.J.; Montes, M.J.; Fernández-Caballero, A.; Fernández-Sotos, P. Facial emotion recognition in patients with depression compared to healthy controls when using human avatars. Sci. Rep. 2023, 13, 6007.
Monferrer, M.; García, A.S.; Ricarte, J.J.; Montes, M.J.; Fernández-Sotos, P.; Fernández-Caballero, A. Facial Affect Recognition in Depression Using Human Avatars. Appl. Sci. 2023, 13, 1609.
Consoli, D. A new concept of marketing: The emotional marketing. Broad Res. Account. Negot. Distrib. 2010, 1, 52–59.
Mehrabian, A.; Russell, J.A. An Approach to Environmental Psychology; The MIT Press: Cambridge, MA, USA, 1974.
Ekman, P.; Friesen, W.V. Constants across cultures in the face and emotion. J. Personal. Soc. Psychol. 1971, 17, 124–129.
Russell, J.A. Is there universal recognition of emotion from facial expression? A review of the cross-cultural studies. Psychol. Bull. 1994, 115, 102–141.
Upadhyay, A.; Dewangan, A.K. Facial expression recognition: A review. Int. J. Latest Trends Eng. Technol. 2016, 3, 237–243.
Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2022, 13, 1195–1215.
Ryumina, E.; Dresvyanskiy, D.; Karpov, A. In search of a robust facial expressions recognition model: A large-scale visual cross-corpus study. Neurocomputing 2022, 514, 435–450.
Lozano-Monasor, E.; López, M.; Vigo-Bustos, F.; Fernández-Caballero, A. Facial expression recognition in ageing adults: From lab to ambient assisted living. J. Ambient Intell. Humaniz. Comput. 2017, 8, 567–578.
Lozano-Monasor, E.; López, M.T.; Fernández-Caballero, A.; Vigo-Bustos, F. Facial Expression Recognition from Webcam Based on Active Shape Models and Support Vector Machines. In Proceedings of the Ambient Assisted Living and Daily Activities, Belfast, UK, 2–5 December 2014; Pecchia, L., Chen, L.L., Nugent, C., Bravo, J., Eds.; Springer: Cham, Switzerland, 2014; pp. 147–154.
Revina, I.M.; Emmanuel, W.S. A Survey on Human Face Expression Recognition Techniques. J. King Saud Univ. Comput. Inf. Sci. 2021, 33, 619–628.
Kandeel, A.; Rahmanian, M.; Zulkernine, F.; Abbas, H.M.; Hassanein, H. Facial Expression Recognition Using a Simplified Convolutional Neural Network Model. In Proceedings of the 2020 International Conference on Communications, Signal Processing, and their Applications, Sharjah, United Arab Emirates, 16–18 March 2021; pp. 1–6.
Taee, E.J.A.; Jasim, Q.M. Blurred Facial Expression Recognition System by Using Convolution Neural Network. Webology 2020, 17, 804–816.
Li, Y.; Zeng, J.; Shan, S.; Chen, X. Occlusion Aware Facial Expression Recognition Using CNN With Attention Mechanism. IEEE Trans. Image Process. 2019, 28, 2439–2450.
Zhao, Z.; Liu, Q.; Wang, S. Learning Deep Global Multi-Scale and Local Attention Features for Facial Expression Recognition in the Wild. IEEE Trans. Image Process. 2021, 30, 6544–6556.
Patel, K.; Mehta, D.; Mistry, C.; Gupta, R.; Tanwar, S.; Kumar, N.; Alazab, M. Facial Sentiment Analysis Using AI Techniques: State-of-the-Art, Taxonomies, and Challenges. IEEE Access 2020, 8, 90495–90519.
Deshmukh, S.; Patwardhan, M.; Mahajan, A. Survey on real-time facial expression recognition techniques. IET Biom. 2016, 5, 155–163.
Pham, L.; Vu, T.H.; Tran, T.A. Facial Expression Recognition Using Residual Masking Network. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: New York, NY, USA, 2021; pp. 4513–4519.
Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Front. Comput. Sci. 2019, 14, 241–258.
Li, J.; Jin, K.; Zhou, D.; Kubota, N.; Ju, Z. Attention mechanism-based CNN for facial expression recognition. Neurocomputing 2020, 411, 340–350.
Sun, W.; Zhao, H.; Jin, Z. A visual attention based ROI detection method for facial expression recognition. Neurocomputing 2018, 296, 12–22.
Wang, Z.; Zeng, F.; Liu, S.; Zeng, B. OAENet: Oriented attention ensemble for accurate facial expression recognition. Pattern Recognit. 2021, 112, 107694.
Wen, Z.; Lin, W.; Wang, T.; Xu, G. Distract Your Attention: Multi-Head Cross Attention Network for Facial Expression Recognition. Biomimetics 2023, 8, 199.
Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069.
Antoniadis, P.; Filntisis, P.P.; Maragos, P. Exploiting Emotional Dependencies with Graph Convolutional Networks for Facial Expression Recognition. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition, Jodhpur, India, 15–18 December 2021; IEEE: New York, NY, USA, 2021; pp. 1–8.
Shao, J.; Qian, Y. Three convolutional neural network models for facial expression recognition in the wild. Neurocomputing 2019, 355, 82–92.
Zhu, Q.; Mao, Q.; Jia, H.; Noi, O.E.N.; Tu, J. Convolutional relation network for facial expression recognition in the wild with few-shot learning. Expert Syst. Appl. 2022, 189, 116046.
Dubey, S.R.; Singh, S.K. Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey. arXiv 2023, arXiv:2302.08641.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929.
Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521.
Huang, Q.; Huang, C.; Wang, X.; Jiang, F. Facial expression recognition with grid-wise attention and visual transformer. Inf. Sci. 2021, 580, 35–54.
Kim, S.; Nam, J.; Ko, B.C. Facial Expression Recognition Based on Squeeze Vision Transformer. Sensors 2022, 22, 3729.
Liu, C.; Hirota, K.; Dai, Y. Patch attention convolutional vision transformer for facial expression recognition with occlusion. Inf. Sci. 2023, 619, 781–794.
Ma, F.; Sun, B.; Li, S. Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion. IEEE Trans. Affect. Comput. 2023, 14, 1236–1248.
Yan, Y.; Zhang, Z.; Chen, S.; Wang, H. Low-resolution facial expression recognition: A filter learning perspective. Signal Process. 2020, 169, 107370.
Bodavarapu, P.N.R.; Srinivas, P.V.V.S. Facial expression recognition for low resolution images using convolutional neural networks and denoising techniques. Indian J. Sci. Technol. 2021, 14, 971–983.
Nan, F.; Jing, W.; Tian, F.; Zhang, J.; Chao, K.M.; Hong, Z.; Zheng, Q. Feature super-resolution based Facial Expression Recognition for multi-scale low-resolution images. Knowl. Based Syst. 2022, 236, 107678.
Shao, J.; Cheng, Q. E-FCNN for tiny facial expression recognition. Appl. Intell. 2020, 51, 549–559.
Lee, K.; Kim, S.; Lee, E.C. Fast and Accurate Facial Expression Image Classification and Regression Method Based on Knowledge Distillation. Appl. Sci. 2023, 13, 6409.
Ma, T.; Tian, W.; Xie, Y. Multi-level knowledge distillation for low-resolution object detection and facial expression recognition. Knowl-Based Syst. 2022, 240, 108136.
Huang, Z.; Yang, S.; Zhou, M.; Li, Z.; Gong, Z.; Chen, Y. Feature Map Distillation of Thin Nets for Low-Resolution Object Recognition. IEEE Trans. Image Process. 2022, 31, 1364–1379.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.