Speech emotion recognition (SER), a rapidly evolving task that aims to recognize the emotion of speakers, has become a key research area in affective computing. Various languages in multilingual natural scenarios extremely challenge the generalization ability of SER, causing the model performance to decrease quickly, and driving researchers to ask how to improve the performance of multilingual SER. To solve this problem, an explainable Multitask-based Shared Feature Learning (MSFL) model is proposed for multilingual SER. The introduction of multi-task learning (MTL) can provide related task information of language recognition for MSFL, improve its generalization in multilingual situations, and further lay the foundation for learning MSFs.
1. Introduction
Speech emotion recognition (SER), which aims to recognize the emotion of speakers via extracting the acoustic features from the speech signal, started in the 1990s
[1] and is a key research area in affective computing. Although modalities such as facial expressions, text, and physiological signals are known to have important and prominent results in affective computing
[2][3][4], research reveals that speech modality, as the most convenient and natural medium of communication, has many advantages over the modalities described above
[5]: compared with facial expressions, speech is a stronger temporal sequence, and it is easy to identify emotional changes from the entire sequence; compared with text, speech is more expressive in its intonation; compared with physiological signals such as those taken from an electroencephalogram (EEG), it is easy to collect speech data through the lightweight devices. In view of these advantages, SER has made breakthroughs in theoretical methods and key technologies after nearly three decades of development
[6][7][8] and is widely used in intelligent vehicles
[9], distance education
[10], medical treatment
[11], media retrieval systems
[12] and other fields.
However, emotion implies huge intra-class and inter-class differences, and objective factors such as gender
[13], age
[14], language
[15], and speaker
[16] reduce the performance of the existing methods. Since most studies in SER focus on a single language, multilingual SER has been addressed only in a few studies, focusing on the effect of diverse languages on SER. Feraru et al.
[17] found that the effect of SER in the same language or within the same language family is higher than that across languages or language families, which means that the model will be limited by the trained corpus due to the single language category and the small sample size of the corpus. This is also the case in real life. Taking SER in the teaching classroom as an example, language courses have lower emotion recognition performance than non-language courses due to the poor generalization of the model in multilingual scenarios. In other words, existing models mainly focus on feature learning within the corpus to improve the performance of SER, without considering the influence of different languages in multiple corpora, far from the SER in real complex situations. A natural question arises: how can researchers improve SER in multilingual scenarios? To answer this question, the connection between different languages and SER is investigated in this research, and the key challenges are summarized as follows: (1) What are the similarities in the feature representation of speech emotion expression in different languages? (2) How can researchers build a generalization model that can improve the performance of SER in multiple languages simultaneously?
To address the above challenges, scholars have attempted to find breakthroughs from the perspective of features and models. For the first issue, researchers have explored fusing multiple features to find the features that are more conducive to multilingual emotional expression, and lead to an improvement. Currently, feature fusion has included the fusion between traditional handcrafted features
[18][19][20][21][22][23], the fusion between traditional handcrafted features and deep features
[24][25][26][27], and the fusion between deep features
[28][29]. In addition, feature selection is also an effective way to obtain optimal acoustic features. Li et al.
[30] used a three-layer model consisting of acoustic features, semantic primitives, and emotion dimensions to map multilingual acoustic features to emotion dimensions by using a model inspired by human emotion perception ability, where acoustic features were selected using Fisher discriminant ratio and sequential forward selection methods were used to select features and develop a shared standard acoustic parameter set. For the second issue, the main point is how to control the influence of language as an objective factor on SER. Most of the existing studies have been conducted by training language recognition classifiers for model selection
[31][32] or improving the model to enhance the generalization
[33][34][35][36][37][38][39]. Although these approaches show good results, there are still some limitations in these efforts. Researchers can summarize as follows: (1) existing methods emphasize controlling the influence of the languages and ignore the intrinsic connection between languages. (2) Most studies either focus on the exploration between multilingual features or on the improvement of models, with little consideration given to studying models and features together.
A more efficient approach alternative to address the above limitations is multi-task learning (MTL), which is inspired by the fact that humans can learn multiple tasks simultaneously and use the knowledge learned in one task to help the learning of the remaining tasks
[40]. MTL can learn robust and generalized feature representations from multiple tasks to better enable knowledge sharing between tasks, with the core idea of reducing the risk of overfitting for each task by weighing the training information between tasks. From this, it can be assumed that if the machine learns both language recognition and emotion recognition, will the multilingual shared features learned during the MTL training contribute to the improvement of multilingual SER? The literature proves that this is a feasible method. Lee
[41] investigated multilingual SER across English and French using MTL trained with gender recognition and language recognition. Through comparative experiments, he confirmed that the MTL strategy can lead to further improvements under all conditions and is effective for multilingual SER. Zhang et al.
[42] proposed a multi-task deep neural network with shared hidden layers and jointly trained several SER tasks from different corpora. This method achieved large-scale data aggregation and obtained feature transformation of all corpora through shared hidden layers. Sharma
[43] combined 25 open-source datasets to create a relatively larger multilingual corpus, which shows good performance on a multilingual and multi-task learning SER system based on the multilingual pre-trained wav2vec 2.0 model. In his experiments, several auxiliary tasks were performed, including gender prediction, language prediction, and three regression tasks related to acoustic features. Gerczuk et al.
[44] created a novel framework with the concept of residual adapters for multi-corpus SER in a deep transfer learning perspective, where the multi-task transfer experiment of the model trained a shared network for all datasets while only the adapter modules and final classification layers were specific to each dataset. Experiments showed that multi-task transfer experiment led to increased results for 21 of the 26 databases and achieved the best performance. From these studies, it is clear that MTL applied to SER is beneficial for aggregating data, sharing features, and establishing emotional representations. However, previous studies have only applied MTL to improve the generalization ability of models, and have not fully taken into account the interpretability of the model and its generated shared features. In other words, MTL should not only be a method to improve model generalization but is also an effective way to analyze and explain shared features.
To this end, considering the variability of emotional expressions in different languages, researchers propose an explainable Multitask-based Shared Feature Learning (MSFL) model for multilingual SER, which can improve the SER performance of each language and effectively analyze multilingual shared features (MSFs). Based on the basic idea of MTL, the module can be divided into a task-sharing module and a task-specific module, where the task-sharing module is the key component of MSFL, as it undertakes the feature selection and transformation to uncover the generalized high-level discriminative representations. The task-specific module is for the classification of emotion and language tasks. Specifically, the task-sharing module utilizes a long short-term memory network (LSTM) and attention mechanism from a new perspective, where LSTM uses the global feature dimensions as time steps to obtain long-term dependencies of features and the attention mechanism layer enables the model to better understand the important contribution of each feature in MSFs by assigning different weights. The weights of MSFs generated from the attention mechanism are essential to explain the reason for the improved validity and generalizability of the MSFL model and its MSF features.
2. Deep Learning for Speech Emotion Recognition
Early SER techniques relied on extensive feature engineering and performed emotion recognition by traditional machine learning models, such as Hidden Markov Model (HMM), Support Vector Machine (SVM), and Gaussian Mixture Model (GMM)
[45]. The flourishing development of deep learning has broadened the representation of acoustic features, and feature extraction is no longer limited to traditional feature engineering. The method of extracting deep representation by using the powerful feature learning ability of deep neural networks has gradually become mainstream, which has also laid the foundation for end-to-end models. SER formally steps into the era of relying on deep learning technology and achieving good performance. Convolution neural networks (CNNs)
[46] and recurrent neural networks (RNNs)
[47] have become the common deep neural networks in SER. CNNs are designed to process data with a grid-like topology, such as time series and image data, and generally contain convolutional layers, pooling layers, and fully connected layers. Since these can overcome the scalability problem of standard neural networks by allowing the multiple regions of the input to share the same weights
[48], they have been widely used in SER to learn the frequency and time domain representations from spectrum images
[49]. However, to enhance the interpretability of features, researchers use the traditional handcrafted features, which are always applied in deep neural networks (DNNs) and RNNs. Since DNNs are basic for deep learning, researchers will thus introduce RNNs in detail.
The self-connection property of an RNN has a great advantage in dealing with the temporal sequence problem, but with continuous training, the gradient will disappear and it is difficult to deal with the long temporal sequence, which promotes the proposal of the long short-term memory network
[50]. Differing from the RNN, the LSTM adds one cell unit that holds the data for a common time and enables it to call the last calculated value. To protect and control information in the cell state, the LSTM sets up three gate structures including input gate, forgetting gate and output gate. Since the LSTM can both learn the long-term dependence in the data and effectively alleviate the gradient disappearance problem during the training process, frame-level and spectral features generally input LSTM to learn long-term contextual relationships in speech
[51]. On this basis, bidirectional long short-term memory (BiLSTM) is proposed to obtain the present and future information in an utterance
[52]. To strengthen the capability of capturing the long-time dependency in sequential data, Wang et al.
[53] combined BiLSTM with a multi-residual mechanism, where the multi-residual mechanism targets the pattern of the relationship between the current time step and further distant time steps instead of only one previous time step. Additionally, the attention mechanism, which is borrowed from human visual selective attention and was first introduced into SER by Mirsamadi
[54], is often combined with LSTM to select the importance of a sentence or some frame segments in the whole sentence on the time series
[55].
3. Multi-Task Learning for Speech Emotion Recognition
Multi-task learning (MTL), also known as joint learning, learning to learn, and learning with auxiliary tasks, was proposed by Caruana in 1997
[56]. Its successful applications in natural language processing, computer vision, speech recognition, and other fields demonstrate the irreplaceable advantages of this learning paradigm. It is more conducive to alleviating the data sparsity problem by exploiting shared low-dimensional representations in multiple related tasks. In this way, representations learnt in the MTL scenario become more generalized, which helps improve the performance
[57]. However, the premise of applying this method is that all tasks need to correlate. Otherwise, it will generate a negative transfer phenomenon and reduce the inter-task learning effect, so selecting tasks with a strong correlation is crucial for multi-task SER. Based on previous research, the related auxiliary tasks are summarized into four categories. The first is about different emotion representations such as dimensional emotion
[58], the second is about the different objective factors related to speech emotion such as gender
[59], speaker
[60], and language
[41], the third is about the different feature representations
[61], and the final category is the related task from different databases
[42]. Through these tasks, the multi-task SER model can share common feature representations to improve the generalization ability and performance of the model. To establish the link between languages and emotions for multilingual SER, the language recognition is regarded as an auxiliary task in this study.
Thung et al.
[62] divide MTL models into single-input multi-output models (SIMO), multi-input multi-output models (MIMO), and multi-input single-output models (MISO). According to the existing literature and the basic situation of SER, multi-task SER can be classified into SIMO and MIMO. SIMO usually takes the traditional handcrafted features
[63] or spectrograms
[64] as model input, and outputs multiple task targets. MIMO trains the model with multiple sources of data such as multimodal
[65], multi-corpus
[42], and multi-domain
[66] data as inputs, and the probability of one input source predicted as one target is defined as the task. The framework of the two models is shown in
Figure 1. Generally, during MTL model optimization, the loss function of the model is combined with the sum of weighted loss functions of tasks. Previous studies on multi-task SER have applied an experience-based weight-adjusting method to assign different weights for each task. However, the loss magnitudes of different tasks during training may not be consistent, and this will lead to being dominated by a certain task at a certain stage and being more inclined to fit a certain kind of task. Therefore, scholars have started to balance the task gradients adaptively during the training process to improve the performance of all tasks. Inspired by this, an adaptative loss balancing method called gradient normalization is introduced to improve the performance of the two tasks in the proposed model
[67].
Figure 1. The framework of multi-task SER based on SIMO and MIMO.
This entry is adapted from the peer-reviewed paper 10.3390/app122412805