Facial Expression Image Classification: Comparison
Please note this is a comparison between Version 1 by Eui Chul Lee and Version 2 by Rita Xu.

As emotional states are diverse, simply classifying them through discrete facial expressions has its limitations. Therefore, to create a facial expression recognition system for practical applications, not only must facial expressions be classified, emotional changes must be measured as continuous values.

  • facial expression classification
  • facial expression regression
  • arousal
  • valence

1. Introduction

Facial expression recognition (FER), an indicator for emotion recognition, is the most widely used nonverbal means by which humans express emotions [1]. In the field of computer vision, FER involves using algorithms or deep learning operations to discriminate the facial expression from a face image. With the recent development of deep learning networks such as convolutional neural networks (CNNs) [2][3][4][2,3,4] and transformers [5][6][5,6] that take images as input, most algorithms displaying state-of-the-art (SOTA)-level accuracy are deep learning-based approaches. In particular, deep learning methods tends to show much better performance than handcrafted feature-based methods when subjects are wearing glasses or hats that cause facial occlusion. Because FER is applied to emotion recognition and human computer interaction-based services rather than simple facial expression classification, the importance of FER in practical environments outside the lab has increased [7]. Consequently, deep learning-based approaches have become the main research direction. FER has reached a level of commercialization on edge devices such as mobile or embedded environments, and lightweight facial expression recognition models that minimize degradation of accuracy are being actively pursued as a major research topic. Fortunately, deep learning-based FER has progressed to the extent of being lightweight and producing fast inference on edge device based on hardware optimization and acceleration.
Interactive systems that include multi-modal approaches for emotion recognition using camera and microphone sensors are being researched for driver monitoring and human–computer interaction systems for more advanced affective computing [8]. Therefore, the necessity of fast and accurate facial expression recognition is becoming an increasingly important research topic.
Researchers introduce a knowledge distillation (KD) approach for training a student model based on a state-of-the-art (SOTA)-level teacher model of more than 4 G (×109) multiply-accumulate operations (MAC) of computation, thereby reducing the computation to 0.3 G (×109) MAC. MAC is a metric used to assess the amount of computation required for model inference, and represents the number of multiplication and accumulation operations. According to this metric, theour proposed method uses more than 50 times less computation than the existing models. This means that it can be applied to different types of on-device programs, making it universally applicable. Furthermore, because it can predict facial expressions in real time, it can observe instantaneous facial expressions such as micro-expressions. ResearchersWe propose a facial expression recognition learning method using a teacher bound that can improve both classification and regression performance by learning both loss of classification and regression problems, not only simple KD. The proposed method shows more than 10% performance improvement compared to SOTA methods, with improved performance in both valence and arousal measurements.
Multi-task EfficientNet-B2 is known to display SOTA performance with an accuracy of 63.03% [9], according to the literature [10]. The proposed method has been confirmed to provide the best results considering model size and processing speed, and is an approach that can be executed in real-time on mobile and embedded systems with a computation time only 10% that of other SOTA-level approaches. Therefore, researchers conducted research to improve model performance and render the model lightweight enough that it can be used on mobile devices in real-time.

2. Emonet

Emonet proposed a fast and accurate FER model on the challenged dataset without any constraints on face angle or the presence of glasses, hats, etc. Emonet is suitable for real-time applications, as it incorporates face alignment and jointly estimates categorical and continuous emotions in a single pass. Emonet was tested on three challenging datasets collected under naturalistic conditions, thereby demonstrating that it outperforms all previous methods [11]. In this study, Emonet’s representation vectors and soft labels are used as the teacher model’s output for knowledge distillation. Emonet infers facial landmarks from a network designed for face alignment, and uses emotional layers with an attention mechanism based on landmark inference results to achieve better classification accuracy, valence, and arousal regression than previous works. In the case of the proposed student model, researcherswe do not design a network responsible for landmark inference, showing that it is possible to achieve a similar level of accuracy faster than that of the teacher model using only the existing CNN method. Regarding the KD results, the multiply-accumulate (MAC) of the teacher model is 15.3 G, whereas the MAC of the student model is 0.3 G, which is 50 times faster; this makes it possible to operate in mobile and embedded environments.

3. Knowledge Distillation

KD is a lightweight model development technique that transfers the knowledge of a model with good performance and a large amount of computation to a smaller model with a lighter amount of association and faster processing speed. This technique was proposed by Hinton et al., and can be applied to various classification problems [12]. During classification, the value of the ground truth and the value predicted by the teacher are used by the student model during learning. In the case of the ground truth, the correct answer class is hard-labeled as 1 and other classes are hard-labeled as 0. However, most of the results of the predicted teacher model are soft-labeled in the form of the largest correct value using softmax [13]. This is a lightweight technique first introduced in [12], with an easier-to-understand equation introduced in [14]. Equation (1) is the formula used to calculate the total loss in KD. Here, L C E ( ) is the cross-entropy loss, σ() is the softmax loss, Z S refers to the output logits of the student model, Z T refers to the output logits of the teacher model, y is the ground truth, α is a balancing parameter, and T is the temperature hyperparameter. 
T o t a l   L o s s = ( 1 α ) L C E ( y ,   σ ( Z S ) ) + 2 α T 2 L C E ( σ ( Z S T ) , σ ( Z T T ) )  
where L C E denotes the cross-entropy loss [15] and σ denotes the softmax loss. Therefore, the first term on the right refers to the cross-entropy loss between the correct answer and the student model, whereas the second term refers to the cross-entropy loss between the predictions of the teacher and student models. The degree to which the student mimics the teacher can be controlled by adjusting α. This method is outlined in Figure 1.
Figure 1. Overall explanation of knowledge distillation.
KD is applied during learning, specifically, the part corresponding to the expression that can be classified by the facial expression label values. A total of eight facial expressions (neutral, happiness, sadness, anger, contempt, disgust, surprise, fear) [16] expressed in Emonet were learned.

4. Teacher Bound

Knowledge distillation is a method used in classification. However, many deep learning models have various outputs, and there is a regression problem among the outputs. Therefore, the method proposed by Chen at al. uses a teacher bound to learn the value of the teacher model for the regression problem [17]. This method applies a large loss, similar to a penalty, only when the performance of the student model is lower than that of the teacher model by a certain extent, and uses the student model’s own loss when performance improves. The formula for this method is expressed in Equation (2), where R s denotes the regression value predicted by the student model, y denotes the correct answer value, and Rt denotes the value predicted by the teacher model. In Equation (2), if the loss of the student model is greater than the loss of the teacher model by m or more, R s y 2 2 is applied to L b ; if it is smaller, L b is 0. Finally, as shown in Equation (3), the final loss uses L s with L b applied. 
L b ( R s ,   R t ,   y ) { R s y 2 2 ,     i f   m R s y 2 2 > R t y 2 2 0   ,     o t h e r w i s e
L r e g = L s L 1 ( R s ,   y r e g ) + ν L b ( R s ,   R t ,   y r e g )
Video Production Service