Facial expression recognition (FER) in the wild has attracted much attention due to its wide range of applications. Approaches use deep learning models trained on relatively large images, which significantly reduces their accuracy when they have to infer low-resolution images.
1. Introduction
In recent years, the need to recognize a person’s emotions has increased, and there has been a growing interest in human emotion recognition across various fields, including brain–computer interfaces [
1,
2], assistance [
3], medicine [
4], psychology [
5,
6], and marketing [
7]. Facial expressions are one of the primary nonverbal means of conveying emotion and play an important role in everyday human communication. According to a seminal paper [
8], more than half of the messages related to feelings and attitudes are contained in facial expressions. Emotions are continuous in nature. However, it is common to measure them on a discrete scale. Ekman and Friesen [
9] identified six universal emotions based on a study of people from different cultures. This study showed that people, regardless of their culture, perceive some basic emotions in the same way. These basic emotions are happiness, anger, disgust, sadness, surprise, and fear. Over time, critics of this model have emerged [
10], arguing that emotions are not universal and have a high cultural component; nevertheless, the model of the six basic emotions continues to be widely used in emotion recognition [
11].
In the last few decades, facial expression recognition (FER) has come a long way thanks to advances in computer vision and machine learning [
12]. Traditionally, feature-extraction algorithms such as scale-invariant feature transform or local binary patterns and classification algorithms such as support vector machines or artificial neural networks have been used for this task [
13,
14,
15]. However, the current trend is to use convolutional neural networks (CNNs), which perform feature extraction and classification at the same time [
16]. FER datasets can be divided into two main categories, depending on how the samples were obtained: laboratory-controlled or wild. In lab-controlled datasets, all images are taken under the same conditions. Therefore, there are no variations in illumination, occlusion, or pose. Under these conditions, it is relatively easy to achieve high classification accuracy without resorting to complex models; in fact, in some datasets, such as CK+ or JAFFE, 100% of the images can be correctly classified [
17,
18]. On the other hand, in-the-wild datasets contain images taken under uncontrolled conditions, such as those found in the real world. In this scenario, the classification accuracy is significantly lower than that obtained in laboratory-controlled datasets [
19,
20].
FER in-the-wild datasets typically contain images of many different sizes [
21], which are typically scaled to 224 × 224 to feed a neural network. The main drawback of training a CNN on images of this size is that, as the network infers lower-resolution images, classification accuracy drops significantly [
20]. There are many applications where obtaining high-resolution images of human faces is not feasible, for example when trying to determine the emotional state of multiple people simultaneously in large spaces such as shopping malls, parks, or airports. In these situations, each person is at a different distance from the camera, resulting in images of different resolutions. As highlighted in a survey [
22], these circumstances present a variety of challenges, including occlusion, pose, low resolution, scale variations, and variations in illumination levels. In addition, they underscore the importance of the efficiency of FER models when processing images of multiple people in real-time.
In these scenarios, a network trained on low-resolution images can be more robust because the network is less dependent on fine details that are not present in low-resolution images, thus increasing its ability to generalize. In addition, working with smaller images reduces the computational cost and bandwidth required to transfer the images from the cameras to the computer where they are processed. CNNs that work with low-resolution images are lighter because the features occupy less memory, making it possible to use methods such as ensemble learning, which is not widely used in deep learning due to its high computational complexity [
23]. Ensemble learning methods combine the results of multiple machine learning estimators with the goal of obtaining a model that generalizes better than the estimators of which it is composed [
24]. Assembling
n CNNs means multiplying by
n the number of trainable parameters of the model and the size of the features of each image in the network.
2. Facial Expression Recognition in the Wild
Many of today’s FER methods use a conventional CNN as the backbone and add attention modules to it to improve classification accuracy [
26,
27,
28]. These modules are used to force the CNN to learn and focus more on important information instead of learning useless background information. For example, an occlusion-aware FER system using a CNN with an attention mechanism has been developed [
19]. The authors divided the resulting feature vector of the last convolutional layer of a CNN into 24 regions of interest using a region decomposition scheme and trained an attention mechanism module capable of learning a low weight for a blocked region and a high weight for an unblocked and informative region with them. In a recent paper, a multi-headed cross-attention network was proposed that achieved state-of-the-art performance on three public FER in-the-wild datasets [
29]. In this network, the attention module implements spatial and channel attention, which allows capturing higher-order interactions between local features. A CNN was also trained with multiple patches of the same image, and an attention module was added to the network output [
30]. This model achieved state-of-the-art results on four public FER in-the-wild datasets. Our method has in common that each sample is split into multiple patches in both methods. However, in our approach, the splitting is performed within the CNN, which significantly reduces the size of the image features within the network.
Nevertheless, there are other approaches that achieve high performance in FER in the wild without relying on attentional mechanisms. For example, a novel multitask learning framework that exploits the dependencies between these two models using a graph convolutional network has recently been proposed [
31]. The results of their experiments showed that their method can improve the performance over different datasets and backbone architectures. In addition, a paper proposed three novel CNN models with different architectures and performed extensive evaluations on three popular datasets, demonstrating that these models are competitive and representative in the field of FER in the wild research [
32]. A very recent paper introduced a few-shot learning model called the convolutional relation network for FER in the wild, which was trained by exploiting a feature similarity comparison among the sufficient samples of the emotion categories to identify new classes with few samples [
33].
Another approach that has proven to be state-of-the-art in FER is transformer-based methods. Inspired by transformers used in natural language processing, vision transformers (ViTs) have been proposed as an alternative to CNNs for various computer vision problems such as image generation [
34] or classification [
35]. In this approach, images are divided into multiple samples, and this sequence of images is used as the input to the model. Compared to CNNs, ViTs are more robust for classifying images that have noise or are magnified, but these models are generally more computationally expensive [
36].
However, the pure structure of ViTs, which does not reflect local features, is not suitable for detecting subtle changes between different facial expressions, and therefore, the performance of these models for FER may be inferior to those based on CNNs. In order to exploit the advantages and minimize the limitations of both approaches, hybrid models combining ViTs and CNNs for FER have been developed in recent years. Huang et al. [
37] proposed a novel framework with two attention mechanisms for CNN-based models and a token-based visual transformer technique for image classification of facial expressions in the wild. With this model, they achieved state-of-the-art performance on different datasets without the need for additional training data. With the same goal, Kim et al. [
38] proposed a hybrid approach with a ViT as the backbone. By introducing a squeeze module, they were able to reduce the computational complexity by reducing the number of feature dimensions, while increasing the FER performance by simultaneously combining global and local features. Similarly, Liu et al. [
39] used a CNN to extract local image features and fed a ViT with a positional embedding generator to model correlations between different visual features from a global perspective. With this hybrid model of only 28.4 million parameters, they surpassed the state-of-the-art in FER with occlusion and outperformed other models with hundreds of millions of parameters.
In addition, Ma et al. [
40] proposed a model using two ResNet-18 for parallel feature extraction from the original image and an image obtained by local binary patterns. To model the relationships between features, they used a visual transformer, where features from both networks are merged. However, the improvement in accuracy provided by this model implies a significant increase in computational load compared to other approaches, since this model uses between 51.8 and 108.5 million parameters in the different implementations described in this work. What our proposal has in common with transformer-based methods is that both approaches divide the images into several patches in order to improve the accuracy of emotion recognition. However, in transformer-based models, each patch contains only a small part of the face, which forces working with a large number of samples and, consequently, increases the computational complexity. In contrast, in our approach, almost the entire face is contained in each sample, which reduces the number of samples required and makes the model robust to image translations.
3. FER in the Wild from Low-Resolution Images
In real-world applications, some or all of the images may be low-resolution. Under these conditions, the accuracy of models trained on high-resolution images is significantly reduced. FER in the wild from low-resolution or variable-resolution images is a less-explored area. However, in recent years, several approaches have been proposed to address this problem. Yan et al. [
41] proposed a filter-based subspace learning method that outperformed the state-of-the-art on posed facial expression datasets, but the results were significantly worse on in-the-wild image datasets. The authors argued that this method has a learning capacity superior to some CNN-based methods. However, it requires the image to be converted to gray scale, which can result in the loss of valuable information. Another approach that has proven effective is the use of denoising techniques on low-resolution images to increase classification accuracy [
42].
On the other hand, super-resolution-based methods have shown promising results in this field [
43,
44]. Super-resolution algorithms are generative models used to obtain high-resolution images from small images. The output of these models can be used as the input to any CNN that works with high-resolution images, such as those described in the previous subsection. The main drawback of these methods is the computational cost, since adding a generative model before the classifier means increasing the memory needed to process the image and the number of floating-point operations. While conventional CNNs typically require less than 10 giga floating-point operations per second (GFLOPs) to process an image, super-resolution-based models can require thousands of GFLOPs [
43].
Another approach that has shown promising results in this area is knowledge distillation [
45], which basically consists of transferring knowledge from heavy models trained on high-resolution images to lighter models operating on low-resolution images. For example, Ma et al. [
46] obtained high accuracy rates in FER of resolution-degraded images with a model in which they used the knowledge of different levels of features of a teacher network to transfer it to a lighter student network. O. Huang et al. [
47] proposed a feature-map-distillation (FMD) framework in which the size of the feature map of the teacher and learner networks was different. With this approach, they achieved better results on several recognition tasks than with twenty state-of-the-art knowledge-distillation methods.
This entry is adapted from the peer-reviewed paper 10.3390/electronics12183837