Expression-Guided Deep Joint Learning for Facial Expression Recognition

Expression-Guided Deep Joint Learning for Facial Expression Recognition: Comparison

Please note this is a comparison between Version 1 by Bei Fang and Version 2 by Lindsay Dong.

facial expression recognition
deep joint learning
FER

1. Introduction

The face is the most commonly used characteristic for expression recognition [1] and personal identification [2]. In unobtrusive sensing, a camera sensor is a commonly used sensor, and the use of camera sensors for facial information capturing and recording is a commonly employed method. The advantage of using a camera to obtain face information is that it can be conducted without attracting the attention of the monitored subjects, thus avoiding their discomfort and interference. In addition, the camera can operate for a long time and monitor multiple scenarios and time periods, providing a large amount of face information data [3].

As one of the most fundamental tasks in face analysis, facial expression recognition (FER) plays an important role in understanding emotional states and intentions. FER also provides support in a variety of important societal applications ^[4][5][6][4,5,6], including intelligent security, fatigue surveillance, medical treatment, and consumer acceptance prediction. The FER methods attempt to classify facial images based on their emotional content. For example, Ekman and Friesen [7] defined six basic emotions based on a cross-cultural study, including happiness, disgust, surprise, anger, fear, and sadness. There have been various FER methods developed in machine learning and computer vision.

Over the past decade, image-level classification has made remarkable progress in computer vision. Much of this progress can be attributed to deep learning and the emergence of convolutional neural networks (CNNs) in 2012. The success of CNN inspired a wave of breakthroughs in computer vision [8]. However, while the deep CNN methods have become the most advanced solution for FER, they also have obvious limitations. In particular, a major disadvantage of deep CNN methods is their low sampling efficiency and the fact that they require a large amount of labeled data, which excludes many applications in which the data are expensive or inherently sparse.

In particular, annotating facial expressions is a complicated task. It is extremely time-consuming and challenging for psychologists to annotate individual facial expressions. Therefore, several databases use crowdsourcing to perform annotation [9]. For example, network datasets collected in uncontrolled environments, such as FER2013 and RAF-DB, have improved their reliability through crowdsourced annotations, but the number of annotated images is only about 30,000. The FER2013 database contains 35,887 facial expression images of different subjects, but only 547 of them show disgust. In contrast, deep learning-based approaches for face recognition are typically trained on millions of reliable annotations of face images. While the sizes of FER datasets are growing, they are still considered small from the perspective of deep learning, which requires a large amount of labeled data. For data-driven deep learning, the accuracy of direct training for such databases is low.

For the FER methods relying on limited labeled data, there are two important strategies: transfer learning based on a face recognition model and semi-supervised learning based on large-scale unlabeled facial image data. One research stream focuses on applying transfer learning strategies to FER, i.e., fine-tuning deep networks on face recognition datasets to adapt them to the FER task [10]. Furthermore, another research stream focuses on applying semi-supervised learning-based deep convolutional networks to recognize facial expressions ^[11][12][11,12]. Two points indicate the potential for the application of semi-supervised learning strategies in FER: (1) Existing large-scale face recognition databases (such as the MS-Celeb-1M dataset [13]) contain abundant facial expressions; and (2) large amounts of facial expressions that are not labeled in databases, such as AffectNet and EmotioNet.

2. Expression-Guided Deep Joint Learning for Facial Expression Recognition

2.1. Efficient Network for Facial Expression Recognition

The existing FER methods that are described here use two distinct approaches, i.e., traditional FER and deep-learning-based FER. In traditional FER, handcrafted features are learned directly from a set of handcrafted filters based on prior knowledge. Traditional FER methods typically employ handcrafted features that are created using methods such as local phase quantization (LPQ) [14], histograms of oriented gradients (HOGs) [15], Gabor features [16], and the scaled-invariant feature transform (SIFT) [17]. As an example, Ref. [14] employed robust local descriptors to account for local distortions in facial images and then deployed various machine learning algorithms, such as support vector machines, multiple kernel learning, and dictionary learning, to classify the discriminative features. However, handcrafted features are generally considered to have limited representation power, and designing appropriate handcrafted features in machine learning is a challenging process. Over the past decade, deep learning has proven highly effective in various fields, outperforming both handcrafted features and shallow classifiers. Deep learning has made great progress in computer vision and inspired a large number of research projects on image recognition, especially FER ^[18][19][18,19]. As an example, CNNs and their extensions were first applied to FER by Mollahosseini et al. [20] and Khorrami et al. [21]. Zhao et al. [22] adopted a graph convolutional network to fully explore the structural information of the facial components behind different expressions. In recent years, attention-based deep models have been proposed for FER and have achieved promising results ^[23][24][23,24]. Although CNNs have been very successful, due to the large amounts of internal parameters in CNN-based algorithms, they have high computing requirements and require a lot of memory. Several efficient neural network architectures were designed to solve the above problems, such as MobileNet [25] and ShuffleNet [26], which have the potential to create highly efficient deep networks with fewer calculations and parameters; they have been applied to FER in recent years. For instance, Hewitt and Gunes [27] designed three types of lightweight FER models for mobile devices. Barros et al. [28] proposed a lightweight FER model called FaceChannel, which consists of an inhibitory layer that is connected to the final layer of the network to help shape facial feature learning. Zhao et al. [29] proposed an efficient lightweight network called EfficientFace. EfficientFace is based on feature extraction and training, and it has few parameters and FLOPs. Despite this, efficient networks are limited in terms of feature learning, because low computational budgets constrain both the depth and the width of efficient networks. Considering the challenges of pose variation and occlusion associated with FER in the wild, applying efficient networks directly to FER may result in poor performance in terms of both the accuracy and robustness. Furthermore, in a conventional lightweight network, such as MobileNet, pointwise convolution makes up a large portion of the overall calculations of the network, consuming a considerable amount of memory and FLOPs.

2.2. The Small-Sample Problem in Facial Expression Recognition

To mitigate the requirement for large amounts of labeled data, several different techniques have been proposed to improve the recognition results. Table 1 provides a comparison of the representative techniques for the small-sample problem.

Table 1.

The comparison of representative facial expression recognition methods for the small-sample problem.

Method	Technique	Network	Datasets	Drawbacks and Advantages
[30]	Data augmentation	GAN	CK+, Oulu-CASIA, MMI, Multi-PIE, TFD	High computational cost
[31]	Ensemble learning	CNN	AffectNet, FER+	Additional computing time and storage
[10]	Fine-tuning	CNN	CK+, Oulu-CASIA, TFD, SFEW	The identity information retained in the pre-trained models may negatively affect accuracy.
[32]	Deep domain adaptation	GAN	BU-3DFE, KDEF, MMI	It requires access to many images in both source and target image domains at training time.
[33]	Self-training model	ResNet-18	KDEF, DDCF	They rely heavily on domain-specific data enhancements that are difficult to generate for most data modalities.
[34]	Generative model	GAN	CK+, Oulu-CASIA, BU-3DFE, BU-4DFE	Same as above

(1) Data augmentation for facial expression recognition. A straightforward way to mitigate the problem of insufficient training data is to enhance the database with data augmentation techniques. Data augmentation techniques are typically based on geometric transformations or oversampling augmentation (e.g., GAN). The geometric transformation technique generates data by maintaining the linear transformations of the label and performing transformations, such as color transformations and geometric transformations (e.g., translation, rotation, scaling) [35]. The oversampling augmentation technique generates facial images based on a GAN algorithm [30]. Although data augmentation is effective, it has a significant drawback, namely the high computational cost of learning a large number of possible transformations for augmented data. (2) Deep ensemble learning for facial expression recognition. In deep ensemble learning or the use of multiple classifiers, different networks are integrated at the level of features or decisions, combined with their respective advantages, and applied to emotional contests to improve their performance on small-sample problems [36]. Siqueira et al. [31] proposed an ensemble learning algorithm based on shared representations of convolutional networks; they demonstrated its data processing efficiency and scalability for facial expression datasets. However, it is worth noting that the ensemble learning methodology requires additional computing time and storage requirements because multiple networks (rather than a single learning category) are used for the same task. (3) Deep transfer learning for facial expression recognition. The transfer learning method is an effective method for solving the small-sample problem [37]. It attempts to transfer knowledge from one domain to another. Fine-tuning is a general method used in transfer learning. Many previous studies have employed face recognition datasets like MS-Celeb-1M [13], VGGFACE2 [38], and CASIA WebFace [39] in order to pre-train networks like ResNet [8], AlexNet [40], VGG [41], and GoogleNet [42] for expression recognition. Then, these networks can be fine-tuned based on expression datasets like CK+, JAFFE, SFEW, or any other FER dataset to accurately predict emotions. For example, Ding et al. presented FaceNet2ExpNet [10], which was trained on a face recognition database and then trained on facial expressions and face recognition; it was fine-tuned on facial expressions to reduce the reliance of the model on face identity information. In spite of the advantages of training FER networks on face-related datasets, the identity information retained in the pre-trained models may negatively affect their accuracy. Deep domain adaptation is another commonly used transfer learning method. This method uses labeled data from one or more relevant source domains to generate new tasks in the target domain [43]. To reduce dataset bias, Li et al. [1] introduced the maximum mean discrepancy (MMD) into a deep network for the first time. Taking advantage of the excellent performance of the GAN, the adversarial domain adaptation model [32] was rapidly popularized in deep learning for domain adaptation. (4) Deep semi-supervised learning for facial expression recognition. Semi-supervised learning (SSL) explores both labeled data and unlabeled data simultaneously in order to mitigate the requirement for large amounts of labeled data. Many SSL models have shown excellent performance in FER, including self-training models [33] and generative models [34]. The principles of SSL are based on a regularization-based approach to achieving high performance; however, they rely heavily on domain-specific data enhancements that are difficult to generate for most data modalities. Based on pseudo-label-based semi-supervised learning methods, a deep joint learning model is proposed. It alternates between learning the parameters of an efficient neural network and efficiently clustering and labeling facial expressions.