Techniques Related to Chinese Speech Emotion Recognition: Comparison
Please note this is a comparison between Version 1 by Ming Che Lee and Version 2 by Dean Liu.

TIn recent years, the use of Artificial Intelligence for emotion recognition has attracted much attention. The industrial applicability of emotion recognition is quite comprehensive and has good development potential. 

  • emotion recognition
  • deep neural network
  • acoustic features

1. Introduction

Language is the main way for people to communicate. In addition to the message meaning contained in language, it also contains the transmission of emotions. Through emotions, tone, and other messages; even if the other party does not understand the meaning of the message in the language, one can still feel the speaker’s emotions in words. In recent years, the use of artificial intelligence and deep learning for emotion recognition has attracted much attention. The industrial applicability of emotion recognition is quite comprehensive and has good development potential. In various applications in daily life, human–computer interaction has gradually been replaced by voice operations and dialogues from touch-sensitive interfaces. Speech recognition is widely used in transportation, catering, customer service systems, personal health care, and leisure entertainment [1][2][3][4][5][1,2,3,4,5]. In recent years, Automatic Speech Recognition (ASR) technology has matured and has been able to accurately recognize speech and convert it into text [6][7][8][6,7,8]. However, in addition to the meaning of language itself that can convey information between dialogues, the emotions accompanying the dialogue are also important information. Since emotions are full of information, Automatic Emotional Speech Recognition (AESR) technology will be the focus of the next generation of speech technology. TIn recent years, the use of deep-learning-related technologies to recognize speech emotions has increased rapidly. Li et al. [9] used a hybrid deep neural network combined with a Hidden Markov Chain to construct a speech recognition model, achieving significant effects in the EMO-DB dataset. In the research of Mao et al. [10] and Zhang et al. [11], it was verified that a convolutional neural network can effectively learn the emotional features in speech. In Umamaheswari, J., and Akila, A. [12], a Pattern Recognition Neural Network (PRNN) combined with a KNN algorithm was first tried, and the results were better than the traditional HMM and GMM algorithms. Mustaqeem and Soonil Kwon [13] proposed a deep stride strategy to construct spectrogram feature maps and achieved good identification performance in the well-known IEMOCAP and RAVDESS datasets. In 2021, Li et al. [14] proposed a bi-directional LSTM model combined with a self-attention mechanism to recognize speech emotions, which achieved remarkable performance in well-known and corpora IEMOCAP and EMO-DB. The proposed model achieved the highest recognition accuracy in the recent period in the recognition of ‘Happiness’ and ‘Anger’ emotions. Until now, most of AESR’s research has mainly focused on English or European languages [15][16][15,16], and research on the recognition of Chinese speech emotions by deep neural networks is relatively rare.

2. Acoustic Features

The extraction and selection of acoustic features is an important part of speech recognition. In sound analysis, short-term analysis is usually the main method. The sound is cut into several frames and then analyzed according to the signal in each frame. Three main sound characteristics can be observed, as follows: Volume: in terms of the amplitude of the sound, the greater the amplitude, the greater the volume of the sound waveform. Pitch: this expresses the sound level by frequency; the higher the basic frequency of the sound, the higher the pitch. Timbre: Timbre represents the content of the sound, which can be represented by the change in each waveform in a basic cycle. Different timbres represent different audio content. There has been extensive research on specific features related to emotions in speech and audio. In Schuller et al. [17], short-term analysis was used to define 6373 feature sets. In addition, Eyben et al. [18] proposed a set of minimalistic features in the Geneva Minimalistic Acoustic Parameter Set (GeMAPS), consisting of 62 features. 

3. Speech Representation and Emotion Recognition

After the recent rapid development of artificial intelligence technologies, such as machine learning and deep learning, affective computing began to appear in various applications, such as robot dialogue and medical care. Affective computing infers the user’s emotions and responds by sensing and understanding the differences in human faces, gestures, and speech in different states. In this field, emotion recognition with pure speech is the most challenging and the most widely used technology, and the development of this field is highly dependent on the construction of emotional speech datasets. The construction of the emotional speech corpus can be roughly divided into two categories. The first type is guided recording, which is mostly recorded in a laboratory or a recording studio. It is recorded through high-quality microphones and guided by linguistic experts. These types of data can generate an emotional corpus with high emotional expression and diversity. Representative sentiment corpora include: Emo-DB [19][26], recorded by the Technical University of Berlin, Germany, with 10 actors (5 males and 5 females), performing 10 German voices, containing a total of 800 sentences. IEMOCAP [20][27], recorded by the University of Southern California, including 10 actors performing a session, a total of 5 sessions, and each utterance is assessed by at least three experts. CASIA [21][28], a Chinese sentiment corpus, recorded by the Institute of Automation of the Chinese Academy of Sciences, where the voice data were recorded by two men and two women with 500 different texts. Another corpus type is non-lab recording. The difference between this type of corpus and guided recording is that it is made up of spontaneous emotional expression sentences of natural scenes, for example, living environment, theatrical performance paragraphs, etc. This type of corpus is a relatively new corpus, such as: NNIME [22][29], the NTHU-NTUA Chinese Interactive Emotion Corpus, is a performing-arts-type corpus. It combines speech, drama, body language, and scene design. CHEAVD [23][30], CASIA Chinese Natural Emotional Audio–Visual Database. The corpus extracts 140 min emotional clips from movies, TV dramas, and talk shows. The actors include a total of 238 people, from children to the elderly, and they are annotated by 4 native Chinese speakers. This study adopts the public version of the CASIA Chinese sentiment corpus. The emotional sounds are divided into six categories: ‘Happiness’, ‘Sadness’, ‘Angry’, ‘Fright’, ‘Calm’, and ‘Fear’. Compared to the underlying emotion–cognitive dimensions, such as James Russell Arousal-Valence four-quadrant model [24][31], the six emotions belonging to quadrants I, III, II, I, IV, and II, respectively. DIn recent years, deep learning has made great progress in speech representation. Baevski and Schneider et al. [25][26][32,33] proposed a wav2vec model, which is an unsupervised speech recognition system. The framework uses only 10 min of transcribed speech data to support automatic speech recognition models. In 2021, Hsu et al. proposed a speech pre-training model [27][34] that surpasses wav2vec 2.0. The authors in [27][34] pointed out that there are several problems in the unsupervised learning of speech, including that there are many pronunciation units in speech, the lengths of pronunciation units are different, and the units of speech have no fixed segmentation, etc. For these problems, the idea of [27][34] is to label the predicted values in a clustering manner, and then mask the labels as unsupervised learning targets. Meanwhile, researchers at Microsoft Research Asia proposed a method called UniSpeech [28][35]. UniSpeech is able to leverage both supervised and unsupervised data to learn a unified contextual representation. The model includes a feature extraction network based on a convolutional neural network, and a context network of a Transformer model and a feature quantization module for learning discrete vectors. In a specific setting, UniSpeech is significantly better than supervised transfer learning. Further, in 2021, researchers from Microsoft Research Asia and Microsoft Azure Speech Group proposed a general speech pre-training model, WavLM [29][36], which achieved state-of-the-art performance on multiple speech datasets. Although voice representation approaches can effectively provide text or vector representation at the coding level, they cannot judge the user’s emotions at the application level. Speech emotion recognition requires a speech emotion database for training. The public emotion corpora commonly used in recent studies are the German Berlin Database of Speech Emotion [19][26], FAU Aibo [30][37], and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [31][38]. A typical machine learning speech emotion recognition system includes speech input, feature extraction, classification models, and emotional output recognition. Commonly used classification models include SVM [32][39], HMM [33][40], and Gaussian Mixture Model (GMM) [34][41]. Lin and Wei [21][28] used SVM and HMM classification methods to identify different categories of emotions, such as angry, happy, sad, surprised, and calm. In total, 39 candidate features were extracted and Sequential Forward Selection (SFS) was used. The method finds the best feature subset and the final average recognition accuracy of the HMM classifier is 99.5%; the SVM classifier is 88.9%. Lim et al. [35][42] first performed Short-Time Fourier Transform (STFT) on the voice data into a spectrogram, putting it in series with a Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) model for speech emotion recognition, with emotions including: ‘Angry’, ‘Happy’, ‘Sad’, ‘Calm’, ‘Fearful’, ‘Disgust’, and ‘Bored’. Its model is to combine four-layer CNN with a long short-term memory network (long short-term memory, LSTM), and the final emotion recognition accuracy rate is 88%.
Video Production Service