Expression-Guided Deep Joint Learning for Facial Expression Recognition

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Bei Fang	--	1772	2023-09-22 05:55:41	\|
2	Reference format revised.	Lindsay Dong	Meta information modification	1772	2023-09-22 16:50:21	\|

This entry is adapted from the peer-reviewed paper 10.3390/s23167148

facial expression recognition deep joint learning FER

1. Introduction

The face is the most commonly used characteristic for expression recognition ^[1] and personal identification ^[2]. In unobtrusive sensing, a camera sensor is a commonly used sensor, and the use of camera sensors for facial information capturing and recording is a commonly employed method. The advantage of using a camera to obtain face information is that it can be conducted without attracting the attention of the monitored subjects, thus avoiding their discomfort and interference. In addition, the camera can operate for a long time and monitor multiple scenarios and time periods, providing a large amount of face information data ^[3].

As one of the most fundamental tasks in face analysis, facial expression recognition (FER) plays an important role in understanding emotional states and intentions. FER also provides support in a variety of important societal applications ^[4]^[5]^[6], including intelligent security, fatigue surveillance, medical treatment, and consumer acceptance prediction. The FER methods attempt to classify facial images based on their emotional content. For example, Ekman and Friesen ^[7] defined six basic emotions based on a cross-cultural study, including happiness, disgust, surprise, anger, fear, and sadness. There have been various FER methods developed in machine learning and computer vision.

Over the past decade, image-level classification has made remarkable progress in computer vision. Much of this progress can be attributed to deep learning and the emergence of convolutional neural networks (CNNs) in 2012. The success of CNN inspired a wave of breakthroughs in computer vision ^[8]. However, while the deep CNN methods have become the most advanced solution for FER, they also have obvious limitations. In particular, a major disadvantage of deep CNN methods is their low sampling efficiency and the fact that they require a large amount of labeled data, which excludes many applications in which the data are expensive or inherently sparse.

In particular, annotating facial expressions is a complicated task. It is extremely time-consuming and challenging for psychologists to annotate individual facial expressions. Therefore, several databases use crowdsourcing to perform annotation ^[9]. For example, network datasets collected in uncontrolled environments, such as FER2013 and RAF-DB, have improved their reliability through crowdsourced annotations, but the number of annotated images is only about 30,000. The FER2013 database contains 35,887 facial expression images of different subjects, but only 547 of them show disgust. In contrast, deep learning-based approaches for face recognition are typically trained on millions of reliable annotations of face images. While the sizes of FER datasets are growing, they are still considered small from the perspective of deep learning, which requires a large amount of labeled data. For data-driven deep learning, the accuracy of direct training for such databases is low.

For the FER methods relying on limited labeled data, there are two important strategies: transfer learning based on a face recognition model and semi-supervised learning based on large-scale unlabeled facial image data. One research stream focuses on applying transfer learning strategies to FER, i.e., fine-tuning deep networks on face recognition datasets to adapt them to the FER task ^[10]. Furthermore, another research stream focuses on applying semi-supervised learning-based deep convolutional networks to recognize facial expressions ^[11]^[12]. Two points indicate the potential for the application of semi-supervised learning strategies in FER: (1) Existing large-scale face recognition databases (such as the MS-Celeb-1M dataset ^[13]) contain abundant facial expressions; and (2) large amounts of facial expressions that are not labeled in databases, such as AffectNet and EmotioNet.

2. Expression-Guided Deep Joint Learning for Facial Expression Recognition

2.1. Efficient Network for Facial Expression Recognition

The existing FER methods that are described here use two distinct approaches, i.e., traditional FER and deep-learning-based FER. In traditional FER, handcrafted features are learned directly from a set of handcrafted filters based on prior knowledge. Traditional FER methods typically employ handcrafted features that are created using methods such as local phase quantization (LPQ) ^[14], histograms of oriented gradients (HOGs) ^[15], Gabor features ^[16], and the scaled-invariant feature transform (SIFT) ^[17]. As an example, Ref. ^[14] employed robust local descriptors to account for local distortions in facial images and then deployed various machine learning algorithms, such as support vector machines, multiple kernel learning, and dictionary learning, to classify the discriminative features. However, handcrafted features are generally considered to have limited representation power, and designing appropriate handcrafted features in machine learning is a challenging process.

Over the past decade, deep learning has proven highly effective in various fields, outperforming both handcrafted features and shallow classifiers. Deep learning has made great progress in computer vision and inspired a large number of research projects on image recognition, especially FER ^[18]^[19]. As an example, CNNs and their extensions were first applied to FER by Mollahosseini et al. ^[20] and Khorrami et al. ^[21]. Zhao et al. ^[22] adopted a graph convolutional network to fully explore the structural information of the facial components behind different expressions. In recent years, attention-based deep models have been proposed for FER and have achieved promising results ^[23]^[24].

Although CNNs have been very successful, due to the large amounts of internal parameters in CNN-based algorithms, they have high computing requirements and require a lot of memory. Several efficient neural network architectures were designed to solve the above problems, such as MobileNet ^[25] and ShuffleNet ^[26], which have the potential to create highly efficient deep networks with fewer calculations and parameters; they have been applied to FER in recent years. For instance, Hewitt and Gunes ^[27] designed three types of lightweight FER models for mobile devices. Barros et al. ^[28] proposed a lightweight FER model called FaceChannel, which consists of an inhibitory layer that is connected to the final layer of the network to help shape facial feature learning. Zhao et al. ^[29] proposed an efficient lightweight network called EfficientFace. EfficientFace is based on feature extraction and training, and it has few parameters and FLOPs.

Despite this, efficient networks are limited in terms of feature learning, because low computational budgets constrain both the depth and the width of efficient networks. Considering the challenges of pose variation and occlusion associated with FER in the wild, applying efficient networks directly to FER may result in poor performance in terms of both the accuracy and robustness. Furthermore, in a conventional lightweight network, such as MobileNet, pointwise convolution makes up a large portion of the overall calculations of the network, consuming a considerable amount of memory and FLOPs.

2.2. The Small-Sample Problem in Facial Expression Recognition

To mitigate the requirement for large amounts of labeled data, several different techniques have been proposed to improve the recognition results. Table 1 provides a comparison of the representative techniques for the small-sample problem.

Table 1. The comparison of representative facial expression recognition methods for the small-sample problem.

Method	Technique	Network	Datasets	Drawbacks and Advantages
^[30]	Data augmentation	GAN	CK+, Oulu-CASIA, MMI, Multi-PIE, TFD	High computational cost
^[31]	Ensemble learning	CNN	AffectNet, FER+	Additional computing time and storage
^[10]	Fine-tuning	CNN	CK+, Oulu-CASIA, TFD, SFEW	The identity information retained in the pre-trained models may negatively affect accuracy.
^[32]	Deep domain adaptation	GAN	BU-3DFE, KDEF, MMI	It requires access to many images in both source and target image domains at training time.
^[33]	Self-training model	ResNet-18	KDEF, DDCF	They rely heavily on domain-specific data enhancements that are difficult to generate for most data modalities.
^[34]	Generative model	GAN	CK+, Oulu-CASIA, BU-3DFE, BU-4DFE	Same as above

(1) Data augmentation for facial expression recognition.

A straightforward way to mitigate the problem of insufficient training data is to enhance the database with data augmentation techniques. Data augmentation techniques are typically based on geometric transformations or oversampling augmentation (e.g., GAN). The geometric transformation technique generates data by maintaining the linear transformations of the label and performing transformations, such as color transformations and geometric transformations (e.g., translation, rotation, scaling) ^[35]. The oversampling augmentation technique generates facial images based on a GAN algorithm ^[30]. Although data augmentation is effective, it has a significant drawback, namely the high computational cost of learning a large number of possible transformations for augmented data.

(2) Deep ensemble learning for facial expression recognition.

In deep ensemble learning or the use of multiple classifiers, different networks are integrated at the level of features or decisions, combined with their respective advantages, and applied to emotional contests to improve their performance on small-sample problems ^[36]. Siqueira et al. ^[31] proposed an ensemble learning algorithm based on shared representations of convolutional networks; they demonstrated its data processing efficiency and scalability for facial expression datasets. However, it is worth noting that the ensemble learning methodology requires additional computing time and storage requirements because multiple networks (rather than a single learning category) are used for the same task.

(3) Deep transfer learning for facial expression recognition.

The transfer learning method is an effective method for solving the small-sample problem ^[37]. It attempts to transfer knowledge from one domain to another.

Fine-tuning is a general method used in transfer learning. Many previous studies have employed face recognition datasets like MS-Celeb-1M ^[13], VGGFACE2 ^[38], and CASIA WebFace ^[39] in order to pre-train networks like ResNet ^[8], AlexNet ^[40], VGG ^[41], and GoogleNet ^[42] for expression recognition. Then, these networks can be fine-tuned based on expression datasets like CK+, JAFFE, SFEW, or any other FER dataset to accurately predict emotions. For example, Ding et al. presented FaceNet2ExpNet ^[10], which was trained on a face recognition database and then trained on facial expressions and face recognition; it was fine-tuned on facial expressions to reduce the reliance of the model on face identity information. In spite of the advantages of training FER networks on face-related datasets, the identity information retained in the pre-trained models may negatively affect their accuracy.

Deep domain adaptation is another commonly used transfer learning method. This method uses labeled data from one or more relevant source domains to generate new tasks in the target domain ^[43]. To reduce dataset bias, Li et al. ^[1] introduced the maximum mean discrepancy (MMD) into a deep network for the first time. Taking advantage of the excellent performance of the GAN, the adversarial domain adaptation model ^[32] was rapidly popularized in deep learning for domain adaptation.

(4) Deep semi-supervised learning for facial expression recognition.

Semi-supervised learning (SSL) explores both labeled data and unlabeled data simultaneously in order to mitigate the requirement for large amounts of labeled data. Many SSL models have shown excellent performance in FER, including self-training models ^[33] and generative models ^[34]. The principles of SSL are based on a regularization-based approach to achieving high performance; however, they rely heavily on domain-specific data enhancements that are difficult to generate for most data modalities. Based on pseudo-label-based semi-supervised learning methods, a deep joint learning model is proposed. It alternates between learning the parameters of an efficient neural network and efficiently clustering and labeling facial expressions.

References

Li, S.; Deng, W. Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. 2020, 13, 1195–1215.
Tolba, A.; El-Baz, A.; El-Harby, A. Face recognition: A literature review. Int. J. Signal Process. 2006, 2, 88–103.
Cai, Y.; Li, X.; Li, J. Emotion Recognition Using Different Sensors, Emotion Models, Methods and Datasets: A Comprehensive Review. Sensors 2023, 23, 2455.
Sariyanidi, E.; Gunes, H.; Cavallaro, A. Learning bases of activity for facial expression recognition. IEEE Trans. Image Process. 2017, 26, 1965–1978.
Álvarez-Pato, V.M.; Sánchez, C.N.; Domínguez-Soberanes, J.; Méndoza-Pérez, D.E.; Velázquez, R. A multisensor data fusion approach for predicting consumer acceptance of food products. Foods 2020, 9, 774.
Jin, B.; Qu, Y.; Zhang, L.; Gao, Z. Diagnosing Parkinson disease through facial expression recognition: Video analysis. J. Med Internet Res. 2020, 22, e18697.
Ekman, P. Strong evidence for universals in facial expressions: A reply to Russell’s mistaken critique. Psychol. Bull. 1994, 115, 268–287.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778.
Li, S.; Deng, W. Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition. IEEE Trans. Image Process. 2018, 28, 356–370.
Ding, H.; Zhou, S.K.; Chellappa, R. FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face and Gesture Recognition, Washington, DC, USA, 30 May–3 June 2017; pp. 118–126.
Zhang, F.; Xu, M.; Xu, C. Weakly-supervised facial expression recognition in the wild with noisy data. IEEE Trans. Multimed. 2021, 24, 1800–1814.
Liu, P.; Wei, Y.; Meng, Z.; Deng, W.; Zhou, J.T.; Yang, Y. Omni-supervised facial expression recognition: A simple baseline. arXiv 2020, arXiv:2005.08551.
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Amsterdam, The Netherlands, 2016; pp. 87–102.
Zhong, L.; Liu, Q.; Yang, P.; Liu, B.; Huang, J.; Metaxas, D.N. Learning active facial patches for expression analysis. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Manhattan, NY, USA, 2012; pp. 2562–2569.
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; IEEE: Manhattan, NY, USA, 2005; Volume 1, pp. 886–893.
Haley, G.M.; Manjunath, B. Rotation-invariant texture classification using modified Gabor filters. In Proceedings of the International Conference on Image Processing, Washington, DC, USA, 23–26 October 1995; IEEE: Manhattan, NY, USA, 1995; Volume 1, pp. 262–265.
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110.
Liu, S.; Huang, S.; Fu, W.; Lin, J.C.W. A descriptive human visual cognitive strategy using graph neural network for facial expression recognition. Int. J. Mach. Learn. Cybern. 2022, 1–17.
Mukhiddinov, M.; Djuraev, O.; Akhmedov, F.; Mukhamadiyev, A.; Cho, J. Masked Face Emotion Recognition Based on Facial Landmarks and Deep Learning Approaches for Visually Impaired People. Sensors 2023, 23, 1080.
Mollahosseini, A.; Chan, D.; Mahoor, M.H. Going deeper in facial expression recognition using deep neural networks. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision, Lake Placid, NY, USA, 7–10 March 2016; IEEE: Manhattan, NY, USA, 2016; pp. 1–10.
Khorrami, P.; Paine, T.; Huang, T. Do deep neural networks learn facial action units when doing expression recognition? In Proceedings of the the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 19–27.
Zhao, R.; Liu, T.; Huang, Z.; Lun, D.P.K.; Lam, K.K. Geometry-Aware Facial Expression Recognition via Attentive Graph Convolutional Networks. IEEE Trans. Affect. Comput. 2021, 14, 1159–1174.
Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069.
Li, Y.; Zeng, J.; Shan, S.; Chen, X. Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Trans. Image Process. 2018, 28, 2439–2450.
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861.
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856.
Hewitt, C.; Gunes, H. Cnn-based facial affect analysis on mobile devices. arXiv 2018, arXiv:1807.08775.
Barros, P.; Churamani, N.; Sciutti, A. The FaceChannel: A Light-weight Deep Neural Network for Facial Expression Recognition. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Argentina, 16–20 November 2020; IEEE: Manhattan, NY, USA, 2020; pp. 652–656.
Zhao, Z.; Liu, Q.; Zhou, F. Robust lightweight facial expression recognition network with label distribution training. In Proceedings of the the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 3510–3519.
Yan, Y.; Huang, Y.; Chen, S.; Shen, C.; Wang, H. Joint deep learning of facial expression synthesis and recognition. IEEE Trans. Multimed. 2019, 22, 2792–2807.
Siqueira, H.; Magg, S.; Wermter, S. Efficient facial feature learning with wide ensemble-based convolutional neural networks. In Proceedings of the AAAI conference on Artificial Intelligence, Hilton, NY, USA, 7–12 February 2020; Volume 34, pp. 5800–5809.
Bozorgtabar, B.; Mahapatra, D.; Thiran, J.P. Exprada: Adversarial domain adaptation for facial expression analysis. Pattern Recognit. 2020, 100, 107111.
Roy, S.; Etemad, A. Self-supervised contrastive learning of multi-view facial expressions. In Proceedings of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, 18– 22October 2021; pp. 253–257.
Yang, H.; Zhang, Z.; Yin, L. Identity-adaptive facial expression recognition through expression regeneration using conditional generative adversarial networks. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition, Xi’an, China, 15–19 May 2018; IEEE: Manhattan, NY, USA, 2018; pp. 294–301.
Lin, F.; Hong, R.; Zhou, W.; Li, H. Facial expression recognition with data augmentation and compact feature learning. In Proceedings of the 2018 25th IEEE International Conference on Image Processing, Athens, Greece, 7–10 October 2018; IEEE: Manhattan, NY, USA, 2018; pp. 1957–1961.
Renda, A.; Barsacchi, M.; Bechini, A.; Marcelloni, F. Comparing ensemble strategies for deep learning: An application to facial expression recognition. Expert Syst. Appl. 2019, 136, 1–11.
Ng, H.W.; Nguyen, V.D.; Vonikakis, V.; Winkler, S. Deep learning for emotion recognition on small datasets using transfer learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9–13 November 2015; pp. 443–449.
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. Vggface2: A dataset for recognising faces across pose and age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition, Xi’an, China, 15–19 May 2018; IEEE: Manhattan, NY, USA, 2018; pp. 67–74.
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning face representation from scratch. arXiv 2014, arXiv:1411.7923.
Hinton, G.E.; Krizhevsky, A.; Sutskever, I. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1.
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9.
Wang, M.; Deng, W. Deep visual domain adaptation: A survey. Neurocomputing 2018, 312, 135–153.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Bei Fang

Yujie Zhao

Guangxin Han

Juhou He

View Times: 273

Update Date: 22 Sep 2023

Table of Contents

Video Upload Options

Confirm