Techniques Related to Chinese Speech Emotion Recognition

Techniques Related to Chinese Speech Emotion Recognition: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

Zhen-Yi Chen

The use of Artificial Intelligence for emotion recognition has attracted much attention. The industrial applicability of emotion recognition is quite comprehensive and has good development potential.

emotion recognition
deep neural network
acoustic features

1. Introduction

Language is the main way for people to communicate. In addition to the message meaning contained in language, it also contains the transmission of emotions. Through emotions, tone, and other messages; even if the other party does not understand the meaning of the message in the language, one can still feel the speaker’s emotions in words. In recent years, the use of artificial intelligence and deep learning for emotion recognition has attracted much attention. The industrial applicability of emotion recognition is quite comprehensive and has good development potential. In various applications in daily life, human–computer interaction has gradually been replaced by voice operations and dialogues from touch-sensitive interfaces. Speech recognition is widely used in transportation, catering, customer service systems, personal health care, and leisure entertainment ^[1]^[2]^[3]^[4]^[5]. In recent years, Automatic Speech Recognition (ASR) technology has matured and has been able to accurately recognize speech and convert it into text ^[6]^[7]^[8]. However, in addition to the meaning of language itself that can convey information between dialogues, the emotions accompanying the dialogue are also important information. Since emotions are full of information, Automatic Emotional Speech Recognition (AESR) technology will be the focus of the next generation of speech technology. The use of deep-learning-related technologies to recognize speech emotions has increased rapidly. Li et al. ^[9] used a hybrid deep neural network combined with a Hidden Markov Chain to construct a speech recognition model, achieving significant effects in the EMO-DB dataset. In the research of Mao et al. ^[10] and Zhang et al. ^[11], it was verified that a convolutional neural network can effectively learn the emotional features in speech. In Umamaheswari, J., and Akila, A. ^[12], a Pattern Recognition Neural Network (PRNN) combined with a KNN algorithm was first tried, and the results were better than the traditional HMM and GMM algorithms. Mustaqeem and Soonil Kwon ^[13] proposed a deep stride strategy to construct spectrogram feature maps and achieved good identification performance in the well-known IEMOCAP and RAVDESS datasets. In 2021, Li et al. ^[14] proposed a bi-directional LSTM model combined with a self-attention mechanism to recognize speech emotions, which achieved remarkable performance in well-known and corpora IEMOCAP and EMO-DB. The proposed model achieved the highest recognition accuracy in the recent period in the recognition of ‘Happiness’ and ‘Anger’ emotions. Until now, most of AESR’s research has mainly focused on English or European languages ^[15]^[16], and research on the recognition of Chinese speech emotions by deep neural networks is relatively rare.

2. Acoustic Features

The extraction and selection of acoustic features is an important part of speech recognition. In sound analysis, short-term analysis is usually the main method. The sound is cut into several frames and then analyzed according to the signal in each frame. Three main sound characteristics can be observed, as follows:

Volume: in terms of the amplitude of the sound, the greater the amplitude, the greater the volume of the sound waveform.

Pitch: this expresses the sound level by frequency; the higher the basic frequency of the sound, the higher the pitch.

Timbre: Timbre represents the content of the sound, which can be represented by the change in each waveform in a basic cycle. Different timbres represent different audio content.

There has been extensive research on specific features related to emotions in speech and audio. In Schuller et al. ^[17], short-term analysis was used to define 6373 feature sets. In addition, Eyben et al. ^[18] proposed a set of minimalistic features in the Geneva Minimalistic Acoustic Parameter Set (GeMAPS), consisting of 62 features.

3. Speech Representation and Emotion Recognition

After the recent rapid development of artificial intelligence technologies, such as machine learning and deep learning, affective computing began to appear in various applications, such as robot dialogue and medical care. Affective computing infers the user’s emotions and responds by sensing and understanding the differences in human faces, gestures, and speech in different states. In this field, emotion recognition with pure speech is the most challenging and the most widely used technology, and the development of this field is highly dependent on the construction of emotional speech datasets. The construction of the emotional speech corpus can be roughly divided into two categories.

The first type is guided recording, which is mostly recorded in a laboratory or a recording studio. It is recorded through high-quality microphones and guided by linguistic experts. These types of data can generate an emotional corpus with high emotional expression and diversity. Representative sentiment corpora include: Emo-DB ^[19], recorded by the Technical University of Berlin, Germany, with 10 actors (5 males and 5 females), performing 10 German voices, containing a total of 800 sentences. IEMOCAP ^[20], recorded by the University of Southern California, including 10 actors performing a session, a total of 5 sessions, and each utterance is assessed by at least three experts. CASIA ^[21], a Chinese sentiment corpus, recorded by the Institute of Automation of the Chinese Academy of Sciences, where the voice data were recorded by two men and two women with 500 different texts.

Another corpus type is non-lab recording. The difference between this type of corpus and guided recording is that it is made up of spontaneous emotional expression sentences of natural scenes, for example, living environment, theatrical performance paragraphs, etc. This type of corpus is a relatively new corpus, such as: NNIME ^[22], the NTHU-NTUA Chinese Interactive Emotion Corpus, is a performing-arts-type corpus. It combines speech, drama, body language, and scene design. CHEAVD ^[23], CASIA Chinese Natural Emotional Audio–Visual Database. The corpus extracts 140 min emotional clips from movies, TV dramas, and talk shows. The actors include a total of 238 people, from children to the elderly, and they are annotated by 4 native Chinese speakers.

The public version of the CASIA Chinese sentiment corpus. The emotional sounds are divided into six categories: ‘Happiness’, ‘Sadness’, ‘Angry’, ‘Fright’, ‘Calm’, and ‘Fear’. Compared to the underlying emotion–cognitive dimensions, such as James Russell Arousal-Valence four-quadrant model ^[24], the six emotions belonging to quadrants I, III, II, I, IV, and II, respectively.

Deep learning has made great progress in speech representation. Baevski and Schneider et al. ^[25]^[26] proposed a wav2vec model, which is an unsupervised speech recognition system. The framework uses only 10 min of transcribed speech data to support automatic speech recognition models. In 2021, Hsu et al. proposed a speech pre-training model ^[27] that surpasses wav2vec 2.0. The authors in ^[27] pointed out that there are several problems in the unsupervised learning of speech, including that there are many pronunciation units in speech, the lengths of pronunciation units are different, and the units of speech have no fixed segmentation, etc. For these problems, the idea of ^[27] is to label the predicted values in a clustering manner, and then mask the labels as unsupervised learning targets. Meanwhile, researchers at Microsoft Research Asia proposed a method called UniSpeech ^[28]. UniSpeech is able to leverage both supervised and unsupervised data to learn a unified contextual representation. The model includes a feature extraction network based on a convolutional neural network, and a context network of a Transformer model and a feature quantization module for learning discrete vectors. In a specific setting, UniSpeech is significantly better than supervised transfer learning. Further, in 2021, researchers from Microsoft Research Asia and Microsoft Azure Speech Group proposed a general speech pre-training model, WavLM ^[29], which achieved state-of-the-art performance on multiple speech datasets.

Although voice representation approaches can effectively provide text or vector representation at the coding level, they cannot judge the user’s emotions at the application level. Speech emotion recognition requires a speech emotion database for training. The public emotion corpora commonly used in recent studies are the German Berlin Database of Speech Emotion ^[19], FAU Aibo ^[30], and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) ^[31]. A typical machine learning speech emotion recognition system includes speech input, feature extraction, classification models, and emotional output recognition. Commonly used classification models include SVM ^[32], HMM ^[33], and Gaussian Mixture Model (GMM) ^[34].

Lin and Wei ^[21] used SVM and HMM classification methods to identify different categories of emotions, such as angry, happy, sad, surprised, and calm. In total, 39 candidate features were extracted and Sequential Forward Selection (SFS) was used. The method finds the best feature subset and the final average recognition accuracy of the HMM classifier is 99.5%; the SVM classifier is 88.9%. Lim et al. ^[35] first performed Short-Time Fourier Transform (STFT) on the voice data into a spectrogram, putting it in series with a Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) model for speech emotion recognition, with emotions including: ‘Angry’, ‘Happy’, ‘Sad’, ‘Calm’, ‘Fearful’, ‘Disgust’, and ‘Bored’. Its model is to combine four-layer CNN with a long short-term memory network (long short-term memory, LSTM), and the final emotion recognition accuracy rate is 88%.

This entry is adapted from the peer-reviewed paper 10.3390/s22134744

References

Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2001, 18, 32–80.
El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587.
Koolagudi, S.G.; Rao, K.S. Emotion recognition from speech: A review. Int. J. Speech Technol. 2012, 15, 99–117.
Song, T.; Zheng, W.; Song, P.; Cui, Z. EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Trans. Affect. Comput. 2018, 11, 532–541.
Zhang, J.; Yin, Z.; Chen, P.; Nichele, S. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Inf. Fusion 2020, 59, 103–126.
Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76.
Kaur, J.; Singh, A.; Kadyan, V. Automatic speech recognition system for tonal languages: State-of-the-art survey. Arch. Comput. Methods Eng. 2020, 28, 1039–1068.
Schuller, B.W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99.
Li, L.; Zhao, Y.; Jiang, D.; Zhang, Y.; Wang, F.; Gonzalez, I.; Sahli, H. Hybrid deep neural network—Hidden Markov model (dnn-hmm) based speech emotion recognition. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva, Switzerland, 2–5 September 2013; pp. 312–317.
Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 2014, 16, 2203–2213.
Zhang, S.; Zhang, S.; Huang, T.; Gao, W. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimed. 2017, 20, 1576–1590.
Umamaheswari, J.; Akila, A. An enhanced human speech emotion recognition using hybrid of PRNN and KNN. In Proceedings of the 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 14–16 February 2019; pp. 177–183.
Mustaqeem; Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2020, 20, 183.
Li, D.; Liu, J.; Yang, Z.; Sun, L.; Wang, Z. Speech emotion recognition using recurrent neural networks with directional self-attention. Expert Syst. Appl. 2021, 173, 114683.
Abbaschian, B.J.; Sierra-Sosa, D.; Elmaghraby, A. Deep learning techniques for speech emotion recognition, from databases to models. Sensors 2021, 21, 1249.
Fahad, M.S.; Ranjan, A.; Yadav, J.; Deepak, A. A survey of speech emotion recognition in natural environment. Digit. Signal Processing 2021, 110, 102951.
Schuller, B.; Steidl, S.; Batliner, A.; Vinciarelli, A.; Scherer, K.; Ringeval, F.; Chetouani, M.; Weninger, F.; Eyben, F.; Marchi, E.; et al. The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013), Lyon, France, 25–29 August 2013.
Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Truong, K.P. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202.
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the 9th European Conference on Speech Communication and Technology, Lisboa, Portugal, 4–8 September 2005; Volume 5, pp. 1517–1520.
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359.
Lin, Y.L.; Wei, G. Speech emotion recognition based on HMM and SVM. In Proceedings of the 2005 International Conference on Machine Learning and Cybernetics, Guangzhou, China, 18–21 August 2005; Volume 8, pp. 4898–4901.
Chou, H.C.; Lin, W.C.; Chang, L.C.; Li, C.C.; Ma, H.P.; Lee, C.C. Nnime: The nthu-ntua Chinese interactive multimodal emotion corpus. In Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA, 23–26 October 2017; pp. 292–298.
Li, Y.; Tao, J.; Chao, L.; Bao, W.; Liu, Y. CHEAVD: A Chinese natural emotional audio–visual database. J. Ambient Intell. Humaniz. Comput. 2017, 8, 913–924.
Russell, J.A.; Pratt, G. A description of the affective quality attributed to environments. J. Personal. Soc. Psychol. 1980, 38, 311.
Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862.
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460.
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460.
Wang, C.; Wu, Y.; Qian, Y.; Kumatani, K.; Liu, S.; Wei, F.; Zeng, M.; Huang, X. Unispeech: Unified speech representation learning with labeled and unlabeled data. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10937–10947.
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. arXiv 2021, arXiv:2110.13900.
Batliner, A.; Steidl, S.; Nöth, E. Releasing a thoroughly annotated and processed spontaneous emotional database: The FAU Aibo emotion corpus. In Proceedings of the Satellite Workshop of LREC, Marrakech, Morocco, 26–27 May 2008; Volume 28.
Livingstone, S.R.; Russo, F.A. The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391.
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297.
Starner, T.; Pentland, A. Real-time American sign language recognition from video using hidden Markov models. In Motion-based Recognition; Springer: Dordrecht, The Netherlands, 1997; pp. 227–243.
Povey, D.; Burget, L.; Agarwal, M.; Akyazi, P.; Kai, F.; Ghoshal, A.; Glembek, O.; Goel, N.; Karafiát, M.; Rastrow, A.; et al. The subspace Gaussian mixture model—A structured model for speech recognition. Comput. Speech Lang. 2011, 25, 404–439.
Lim, W.; Jang, D.; Lee, T. Speech emotion recognition using convolutional and recurrent neural networks. In Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Jeju, Korea, 13–15 December 2016; pp. 1–4.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.