Speech Emotion Recognition Using Convolution Neural Networks

Speech Emotion Recognition Using Convolution Neural Networks: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

Fakhar Anjam

Lunchakorn Wuttisittikulkij

Shashi Shah

Syed Mansoor Ali

Mohammad Alibakhshikenari

Speech emotion recognition (SER) is a challenging task in human–computer interaction (HCI) systems. One of the key challenges in speech emotion recognition is to extract the emotional features effectively from a speech utterance.

speech emotion recognition
convolutional neural networks
convolutional Transformer encoder

1. Introduction

In the context of rapidly advancing Artificial Intelligence (AI), human–computer interactions (HCI) are studied in depth. We are living in a world where Siri and Alexa are physically closer. Understanding human emotions paves the way toward understanding people’s needs. Speech emotion recognition (SER) systems ^[1] classify emotions in speech utterances and are vital in advancing the HCI, healthcare, customer satisfaction, social media analysis, stress monitoring, and intelligent systems. Moreover, SER systems are useful in online tutorials, language translation, intelligent driving, and therapy sessions. In a few situations, humans can be substituted by computer-generated characters with the ability to act naturally and communicate convincingly by expressing human-like emotions. Machines need to interpret the emotions carried by speech utterances. Only with such an ability can a completely expressive dialogue based on joint human–machine trust and understanding be accomplished.

SER is a challenging task due to various reasons. Firstly, (i) emotions are subjective and their expression can vary significantly across individuals. Different people may exhibit varying patterns of speech, tone, and vocal cues to convey the same emotion. (ii) The availability of high-quality, diverse, and standardized datasets is crucial in training and evaluating SER models. (iii) Emotions are often context-dependent, and the same speech utterance can convey different emotions depending on the situational context.

Speech emotion recognition systems have gained attention due to the extensive use of deep learning. Prior to deep learning, SER systems were reliant on techniques such as hidden Markov models (HMM) ^[2], Gaussian mixture models (GMM) ^[3], and support vector machines (SVM) ^[4], along with extensive preprocessing and accurate feature engineering. Comprehensive reviews of SER systems are available in ^[5]^[6]. A benchmark comparison is available in ^[7]. However, the development of deep learning tools and processes, and solutions for SER, has also changed. There have been significant studies and research proposing SER techniques to recognize and classify various emotions in speech ^[8]^[9]^[10]^[11]^[12]^[13]^[14]. In addition to recent developments in deep learning, there has been a wave of studies on SER using long short-term memory, recurrent neural networks, generative adversarial networks, and autoencoders to solve the above problem ^[15]^[16]^[17]^[18]^[19]^[20]^[21]^[22].

In the recent past, deep learning has significantly contributed to natural language understanding (NLU). Deep belief network (DBN)-based SER in ^[23]^[24] showed a substantial improvement over the baseline non-DL models ^[25]. Extreme learning machine (ELM)-based SER in ^[26]^[27] used feature representations from the probability distributions at the segment level, employing a single hidden layer neural network to classify speech emotions at the utterance level. Deep hierarchical models, data augmentation, and regularization-based DNNs for SER are proposed in ^[28], whereas deep CNNs using spectrograms are proposed in ^[29]. DNNs are trained for SER with the acoustic features extracted from the short intervals of speech using a probabilistic CTC loss function in ^[30]. Bidirectional LSTM-based SER in ^[31] is trained on feature sequences and achieves better accuracy than DNN-ELM ^[26]. Deep CNN+LSTM-based SER in ^[32] achieves even better results. The hybrid deep CNN + LSTM improves the SER accuracy but raises the overall computational complexity. Auditory–visual modality (AVM)-based SER in ^[33] captures emotional content from different speaking styles. The Tensor Fusion Network (TFN)-based SER in ^[34] learns intra- and inter-modality dynamics. Convolutional deep belief network-based SER in ^[35] learns multimodal feature representations linked to expressions. The single plain CNN model is weak in classifying the speaker’s emotional state with the required accuracy level because it loses some basic sequential information during the convolutional operation. Therefore, two parallel CNN models can solve the limitation concerning the loss of important information in speech. The study in ^[36] shows two parallel CNN models and utilizes them for SER accordingly.

With dominance, pleasure, and excitement, one can nearly define all emotions; however, the implementation of such a deterministic system using DL is very challenging and complex. Therefore, in DL, statistical models and the clustering of samples are used to qualitatively classify emotions such as sadness, happiness, and anger. For the classification and clustering of emotions, features must be extracted from speech, usually relying on different types of prosody, voice quality, and spectral features ^[37]. The prosody features usually include the fundamental frequency (F0), intensity, and speaking rate, but they cannot confidently discriminate between angry and happy emotions. The features associated with voice quality are usually the most successful in determining the emotions of the same speaker. However, these features vary from speaker to speaker, making them difficult to use in speaker-independent settings ^[38]. On the other hand, spectral features are widely used to determine emotions from speech. These features can confidently distinguish anger from happiness. However, the magnitudes and shifts of the formant frequencies for identical emotions change across different vowels, which increases the complexity of the speech emotion recognition system ^[39]. For all the feature types, there are several standard representations of features. Prosody features are typically represented by F0 and measure the speaking rates ^[40], whereas spectral features are defined by cepstrum-based feature representations. Mel-frequency cepstral coefficients (MFCC) or linear prediction cepstral coefficients (LPCC) are commonly used spectral features along with formants, and other information can also be used ^[41]. Finally, the voice quality features usually include the normalized amplitude quotient, shimmer, and jitter ^[42].

Feature extraction is a crucial step in many machine learning tasks, including speech recognition, computer vision, and natural language processing. The goal of feature extraction is to transform raw data into a representation that captures the most salient information for the task at hand. In speech recognition, features are typically extracted from the acoustic signal using techniques such as mel-frequency cepstral coefficients (MFCCs), which have been widely used in the literature due to their effectiveness in capturing the spectral envelope of a signal. Other popular techniques include perceptual linear predictive (PLP) features, gamma tone features, and filterbank energies. In computer vision, features are extracted from images using techniques such as SIFT, SURF, and HOG, which are effective in capturing local visual patterns. In natural language processing, features are extracted from text using techniques such as bag-of-words, n-grams, and word embeddings, which capture the syntactic and semantic information in the text ^[43]^[44]^[45]^[46]^[47]^[48]. The study uses MFCCs as input features for several reasons. First, (i) the MFCCs are used as a grayscale image as a simultaneous input to the parallel CNNs and Transformer modules for spectral and temporal feature extraction. (ii) MFCCs can capture the spectral envelopes of speech signals, which is crucial in characterizing different emotional states. MFCCs are less sensitive to variations in speaker characteristics, background noise, and channel distortions, making them more robust for emotion recognition tasks. (iii) MFCCs are derived based on the human auditory system’s frequency resolution, which aligns well with how humans perceive and differentiate sounds. By focusing on perceptually relevant information, MFCCs can effectively capture the distinctive features related to emotions conveyed through speech. (iv) MFCCs provide a compact representation of speech signals by summarizing the spectral information into a smaller number of coefficients. This dimensionality reduction helps to reduce the computational complexity and memory requirements of SER models while still preserving the essential information needed for emotion classification. (v) By computing MFCCs over short time frames and applying temporal analysis techniques such as delta and delta–delta features, the dynamic changes in speech can be captured. Emotions often manifest as temporal patterns in speech, and MFCCs enable the modeling of these dynamics, enhancing the discriminative power of SER models.

2. Speech Emotion Recognition Using Convolution Neural Networks

Speech emotion recognition is an attractive research field and numerous novel techniques have been proposed to learn optimal SER solutions. The SER method contains two modules, namely feature representation and emotion classification. Optimal feature representation and superior classification for a robust SER system are difficult tasks ^[9]. The MFCC feature-based SER in ^[49] classifies various emotions using the logistic model tree (LMT) classifier. An ensemble model using 20 SVMs with a Gaussian kernel in ^[50] is proposed for SER and achieves 75.79% accuracy. The 2D-CNN-based SER method in ^[51] recognizes emotions by extracting deep discriminative cues from spectrograms. Pre-trained CNN architectures—for example, AlexNet and VGG—are used to construct the SER framework via transfer learning to classify emotions from spectrograms in ^[52]. A trained CNN model in ^[53] is utilized for the extraction of features from spectrograms, and speech emotions are classified using SVM. Moreover, 1D-CNN + FCN-based SER in ^[54] use prosodic and spectral features from MFCCs to classify various speech emotions. The LSTM and RNNs are used to classify the long-term sequences in the speech signals for SER ^[55]. The DNN-LSTM-based SER method in ^[56] uses a hybrid approach to learn spatiotemporal cues from raw speech data.

The CNN-BLSTM-based SER method in ^[57] learns the spatial features and temporal cues of speech symbols and increases the accuracy of the existing model. The SER extracts spatial features and feeds them to the BLSTM in order to learn temporal cues for the recognition of the emotional state. A DNN in ^[26] is used to compute the probability distributions for various emotions given all segments. The DNN identifies emotions from utterance-level feature representations, and, with the given features, ELM is used to classify speech emotions. The CNN in ^[58] successfully detects emotions with 66.1% accuracy when compared to the feature-based SVM. Meanwhile, the 1D-CNN in ^[59] reports 96.60% classification accuracy for negative emotions. The CNN-based SER in ^[60] learns deep features and employs a plain rectangular filter with a new pooling scheme to achieve more effective emotion discrimination. A novel attention-based SER is proposed utilizing a long attention process to link mel-spectrogram and interspeech-09 features to generate the attention weights for a CNN. A deep CNN-based SER is constructed in ^[61] for the ImageNet LSVRC-2010 challenge. The AlexNet trained with 1.2 million images and fine-tuned with samples from the EMO-DB is used to recognize angry, sad, and happy emotions. An end-to-end context-aware SER system in ^[62] classifies speech emotions using CNNs followed by LSTM.

The difference compared to other deep learning SER frameworks lies in not using the preselected features before network training and introducing raw input to the SER system. The ConvLSTM-based SER in ^[63] adopted convolutional LSTM layers for the state transitions so as to extract spatial cues. Four LFLBs are used for the extraction of the spatiotemporal cues in the hierarchical correlational form of speech signals utilizing a residual learning strategy. The BLSTM + CNN stacking-based SER in ^[64] matches the input formats and recognizes emotions by using logistic regression. BC-LSTM relies on context-aware utterance-level representations of features. This model captures the contextual cues from utterances using a BLSTM layer. The SVM-DBN-based SER in ^[65] improves emotion recognition via diverse feature representation. Gender-dependent and -independent results show 80.11% accuracy. The deep-stride CNN-based SER in ^[66] uses raw spectrograms and learns discriminative features from speech spectrograms. After learning the features, the Softmax classifier is employed to classify speech emotions.

Attention mechanism-based deep learning for SER is another notable approach that has achieved vast success; a complete review can be found in ^[67]. In classical DL-based SER, all features in a given utterance receive the same attention. Nevertheless, emotions are not consistently distributed over all localities in the speech samples. In attention-based DL, attention is paid by the classifier to the given specific localities of the samples using attention weights assigned to a particular locality of data. The SER system based on multi-layer perceptron (MLP) and a dilated CNN in ^[68] uses channel and spatial attention to extract cues from input tensors. Bidirectional LSTM with the weighted-polling scheme in ^[69] learns more illustrative feature representations concerning speech emotions. The model focuses more on the main emotional aspects of an utterance, whereas it ignores other aspects of the utterance. The self-attention and multitasking learning CNN-BLSTM in ^[70] improves the SER accuracy by 7.7% in comparison with the multi-channel CNN ^[71] when applied to the IEMOCAP dataset. With speech spectrograms as input, gender classification has been considered as a secondary task. The LSTM in ^[18] for SER demonstrates reduced computational complexity by replacing the LSTM forget gate with an attention gate, where attention is applied on the time and feature dimensions. The attention LSTM-based time-delay SER in ^[72] extracts high-level feature representations from raw speech waveforms to classify emotions.

The deep RNN-based SER in ^[73] learns emotionally related acoustic features and aggregates them temporally into a compact representation at the utterance level. Another deep CNN ^[74] is proposed for SER. In addition, a feature pooling strategy over time is proposed, using local attention to focus on specific localities of a speech utterance that are emotionally prominent. A self-attention mechanism utilizes a CNN via sequential learning to generate the attention weights. Another attention-based SER is proposed that uses a fully connected neural network (FCNN). Frame- and utterance-level features are used for emotion classification by applying MLP and attention processes to classify emotions. A multi-hop attention model for SER in ^[75] uses two BLSTM streams to extract the hidden cues from speech utterances. The multi-hop attention model is applied for the generation of final weights for the classification of emotions. Other important research related to SER includes fake news and sentiment analysis, as emotions can also be found in fake news, negative sentiments, and hate speech ^[76]^[77]^[78]^[79]^[80]^[81].

This entry is adapted from the peer-reviewed paper 10.3390/s23136212

References

Liu, Z.T.; Xie, Q.; Wu, M.; Cao, W.H.; Mei, Y.; Mao, J.W. Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 2018, 309, 145–156.
Nwe, T.L.; Foo, S.W.; De Silva, L.C. Speech emotion recognition using hidden Markov models. Speech Commun. 2003, 41, 603–623.
Patel, P.; Chaudhari, A.; Kale, R.; Pund, M. Emotion recognition from speech with gaussian mixture models via boosted gmm. Int. J. Res. Sci. Eng. 2017, 3, 294–297.
Chen, L.; Mao, X.; Xue, Y.; Cheng, L.L. Speech emotion recognition: Features and classification models. Digit. Signal Process. 2012, 22, 1154–1160.
Koolagudi, S.G.; Rao, K.S. Emotion recognition from speech: A review. Int. J. Speech Technol. 2012, 15, 99–117.
Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76.
Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Qadir, J.; Schuller, B.W. Survey of deep representation learning for speech emotion recognition. IEEE Trans. Affect. Comput. 2021, 14, 1634–1654.
Fayek, H.M.; Lech, M.; Cavedon, L. Evaluating deep learning architectures for Speech Emotion Recognition. Neural Netw. 2017, 92, 60–68.
Tuncer, T.; Dogan, S.; Acharya, U.R. Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl.-Based Syst. 2021, 211, 106547.
Singh, P.; Srivastava, R.; Rana, K.P.S.; Kumar, V. A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowl.-Based Syst. 2021, 229, 107316.
Magdin, M.; Sulka, T.; Tomanová, J.; Vozár, M. Voice analysis using PRAAT software and classification of user emotional state. Int. J. Interact. Multimed. Artif. Intell. 2019, 5, 33–42.
Huddar, M.G.; Sannakki, S.S.; Rajpurohit, V.S. Attention-based Multi-modal Sentiment Analysis and Emotion Detection in Conversation using RNN. Int. J. Interact. Multimed. Artif. Intell. 2021, 6, 112–121.
Wang, K.; An, N.; Li, B.N.; Zhang, Y.; Li, L. Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 2015, 6, 69–75.
Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 2014, 16, 2203–2213.
Ho, N.H.; Yang, H.J.; Kim, S.H.; Lee, G. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 2020, 8, 61672–61686.
Saleem, N.; Gao, J.; Khattak, M.I.; Rauf, H.T.; Kadry, S.; Shafi, M. Deepresgru: Residual gated recurrent neural network-augmented kalman filtering for speech enhancement and recognition. Knowl.-Based Syst. 2022, 238, 107914.
Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323.
Xie, Y.; Liang, R.; Liang, Z.; Huang, C.; Zou, C.; Schuller, B. Speech emotion classification using attention-based LSTM. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1675–1685.
Wang, J.; Xue, M.; Culhane, R.; Diao, E.; Ding, J.; Tarokh, V. Speech emotion recognition with dual-sequence LSTM architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6474–6478.
Zhao, H.; Xiao, Y.; Zhang, Z. Robust semisupervised generative adversarial networks for speech emotion recognition via distribution smoothness. IEEE Access 2020, 8, 106889–106900.
Shilandari, A.; Marvi, H.; Khosravi, H.; Wang, W. Speech emotion recognition using data augmentation method by cycle-generative adversarial networks. Signal Image Video Process. 2022, 16, 1955–1962.
Yi, L.; Mak, M.W. Improving speech emotion recognition with adversarial data augmentation network. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 172–184.
Huang, C.; Gong, W.; Fu, W.; Feng, D. A research of speech emotion recognition based on deep belief network and SVM. Math. Probl. Eng. 2014, 2014, 749604.
Huang, Y.; Tian, K.; Wu, A.; Zhang, G. Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J. Ambient. Intell. Humaniz. Comput. 2019, 14, 1787–1798.
Schuller, B.W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99.
Guo, L.; Wang, L.; Dang, J.; Liu, Z.; Guan, H. Exploration of complementary features for speech emotion recognition based on kernel extreme learning machine. IEEE Access 2019, 7, 75798–75809.
Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the Interspeech, Singapore, 14–18 September 2014.
Tiwari, U.; Soni, M.; Chakraborty, R.; Panda, A.; Kopparapu, S.K. Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2014; pp. 7194–7198.
Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech emotion recognition from spectrograms with deep convolutional neural network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea, 13–15 February 2017; pp. 1–5.
Dong, Y.; Yang, X. Affect-salient event sequence modelling for continuous speech emotion recognition. Neurocomputing 2021, 458, 246–258.
Chen, Q.; Huang, G. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition. Eng. Appl. Artif. Intell. 2021, 102, 104277.
Atila, O.; Şengür, A. Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Appl. Acoust. 2021, 182, 108260.
Lambrecht, L.; Kreifelts, B.; Wildgruber, D. Gender differences in emotion recognition: Impact of sensory modality and emotional category. Cogn. Emot. 2014, 28, 452–469.
Fu, C.; Liu, C.; Ishi, C.T.; Ishiguro, H. Multi-modality emotion recognition model with GAT-based multi-head inter-modality attention. Sensors 2020, 20, 4894.
Liu, D.; Chen, L.; Wang, Z.; Diao, G. Speech expression multimodal emotion recognition based on deep belief network. J. Grid Comput. 2021, 19, 22.
Zhao, Z.; Li, Q.; Zhang, Z.; Cummins, N.; Wang, H.; Tao, J.; Schuller, B.W. Combining a parallel 2d cnn with a self-attention dilated residual network for ctc-based discrete speech emotion recognition. Neural Netw. 2021, 141, 52–60.
Gangamohan, P.; Kadiri, S.R.; Yegnanarayana, B. Analysis of emotional speech—A review. Towar. Robot. Soc. Believable Behaving Syst. 2016, 1, 205–238.
Gobl, C.; Chasaide, A.N. The role of voice quality in communicating emotion, mood and attitude. Speech Commun. 2003, 40, 189–212.
Vlasenko, B.; Philippou-Hübner, D.; Prylipko, D.; Böck, R.; Siegert, I.; Wendemuth, A. Vowels formants analysis allows straightforward detection of high arousal emotions. In Proceedings of the 2011 IEEE International Conference on Multimedia and Expo, Barcelona, Spain, 11–15 July 2011; pp. 1–6.
Lee, C.M.; Narayanan, S.S. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 2005, 13, 293–303.
Schuller, B.; Rigoll, G. Timing levels in segment-based speech emotion recognition. In Proceedings of the INTERSPEECH 2006, Proceedings International Conference on Spoken Language Processing ICSLP, Pittsburgh, PA, USA, 17–21 September 2006.
Lugger, M.; Yang, B. The relevance of voice quality features in speaker independent emotion recognition. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Honolulu, HI, USA, 15–20 April 2007; Volume 4, p. IV-17.
Mutlag, W.K.; Ali, S.K.; Aydam, Z.M.; Taher, B.H. Feature extraction methods: A review. J. Phys. Conf. Ser. 2005, 1591, 012028.
Cavalcante, R.C.; Minku, L.L.; Oliveira, A.L. Fedd: Feature extraction for explicit concept drift detection in time series. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 740–747.
Phinyomark, A.; Quaine, F.; Charbonnier, S.; Serviere, C.; Tarpin-Bernard, F.; Laurillau, Y. Feature extraction of the first difference of EMG time series for EMG pattern recognition. Comput. Methods Programs Biomed. 2014, 177, 247–256.
Schneider, T.; Helwig, N.; Schütze, A. Automatic feature extraction and selection for classification of cyclical time series data. Tech. Mess. 2017, 84, 198–206.
Salau, A.O.; Jain, S. Feature extraction: A survey of the types, techniques, applications. In Proceedings of the 2019 International Conference on Signal Processing and Communication (ICSC), Noida, India, 7–9 March 2019; pp. 158–164.
Salau, A.O.; Olowoyo, T.D.; Akinola, S.O. Accent classification of the three major nigerian indigenous languages using 1d cnn lstm network model. In Advances in Computational Intelligence Techniques; Springer: Singapore, 2020; pp. 1–16.
Zamil, A.A.A.; Hasan, S.; Baki, S.M.J.; Adam, J.M.; Zaman, I. Emotion detection from speech signals using voting mechanism on classified frames. In Proceedings of the 2019 International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), Dhaka, Bangladesh, 10–12 January 2019; pp. 281–285.
Bhavan, A.; Chauhan, P.; Shah, R.R. Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 2019, 184, 104886.
Huang, Z.; Dong, M.; Mao, Q.; Zhan, Y. Speech emotion recognition using CNN. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 801–804.
Latif, S.; Rana, R.; Younis, S.; Qadir, J.; Epps, J. Transfer learning for improving speech emotion classification accuracy. arXiv 2018, arXiv:1801.06353.
Xie, B.; Sidulova, M.; Park, C.H. Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion. Sensors 2021, 21, 4913.
Ahmed, M.; Islam, S.; Islam, A.K.M.; Shatabda, S. An Ensemble 1D-CNN-LSTM-GRU Model with Data Augmentation for Speech Emotion Recognition. arXiv 2021, arXiv:2112.05666.
Yu, Y.; Kim, Y.J. Attention-LSTM-attention model for speech emotion recognition and analysis of IEMOCAP database. Electronics 2020, 9, 713.
Ohi, A.Q.; Mridha, M.F.; Safir, F.B.; Hamid, M.A.; Monowar, M.M. Autoembedder: A semi-supervised DNN embedding system for clustering. Knowl.-Based Syst. 2020, 204, 106190.
Sajjad, M.; Kwon, S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 2020, 8, 79861–79875.
Bertero, D.; Fung, P. A first look into a convolutional neural network for speech emotion detection. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5115–5119.
Mekruksavanich, S.; Jitpattanakul, A.; Hnoohom, N. Negative emotion recognition using deep learning for Thai language. In Proceedings of the 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON), Pattaya, Thailand, 11–14 March 2020; pp. 71–74.
Anvarjon, T.; Kwon, S. Deep-net: A lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 2020, 20, 5212.
Zhang, S.; Zhang, S.; Huang, T.; Gao, W. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimed. 2017, 20, 1576–1590.
Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204.
Kwon, S. CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics 2020, 8, 2133.
Li, D.; Sun, L.; Xu, X.; Wang, Z.; Zhang, J.; Du, W. BLSTM and CNN Stacking Architecture for Speech Emotion Recognition. Neural Process. Lett. 2021, 53, 4097–4115.
Zhu, L.; Chen, L.; Zhao, D.; Zhou, J.; Zhang, W. Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN. Sensors 2017, 17, 1694.
Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2019, 20, 183.
Lieskovská, E.; Jakubec, M.; Jarina, R.; Chmulík, M. A review on speech emotion recognition using deep learning and attention mechanism. Electronics 2021, 10, 1163.
Kwon, S. Att-Net: Enhanced emotion recognition system using lightweight self-attention module. Appl. Soft Comput. 2021, 102, 107101.
Chen, S.; Zhang, M.; Yang, X.; Zhao, Z.; Zou, T.; Sun, X. The impact of attention mechanisms on speech emotion recognition. Sensors 2021, 21, 7530.
Li, Y.; Zhao, T.; Kawahara, T. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2803–2807.
Yenigalla, P.; Kumar, A.; Tripathi, S.; Singh, C.; Kar, S.; Vepa, J. Speech Emotion Recognition Using Spectrogram Phoneme Embedding. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3688–3692.
Sarma, M.; Ghahremani, P.; Povey, D.; Goel, N.K.; Sarma, K.K.; Dehak, N. Emotion Identification from Raw Speech Signals Using DNNs. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3097–3101.
Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231.
Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 2020, 59, 101894.
Carta, S.; Corriga, A.; Ferreira, A.; Podda, A.S.; Recupero, D.R. A multi-layer and multi-ensemble stock trader using deep learning and deep reinforcement learning. Appl. Intell. 2021, 51, 889–905.
Zhang, J.; Xing, L.; Tan, Z.; Wang, H.; Wang, K. Multi-head attention fusion networks for multi-modal speech emotion recognition. Comput. Ind. Eng. 2022, 168, 108078.
Demilie, W.B.; Salau, A.O. Detection of fake news and hate speech for Ethiopian languages: A systematic review of the approaches. J. Big Data 2022, 9, 66.
Bautista, J.L.; Lee, Y.K.; Shin, H.S. Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation. Electronics 2022, 11, 3935.
Abeje, B.T.; Salau, A.O.; Ebabu, H.A.; Ayalew, A.M. Comparative Analysis of Deep Learning Models for Aspect Level Amharic News Sentiment Analysis. In Proceedings of the 2022 International Conference on Decision Aid Sciences and Applications (DASA), Chiangrai, Thailand, 23–25 March 2022; pp. 1628–1633.
Kakuba, S.; Poulose, A.; Han, D.S. Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features. IEEE Access 2022, 10, 125538–125551.
Tao, H.; Geng, L.; Shan, S.; Mai, J.; Fu, H. Multi-Stream Convolution-Recurrent Neural Networks Based on Attention Mechanism Fusion for Speech Emotion Recognition. Entropy 2022, 24, 1025.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.