Human Emotion Recognition System: Comparison
Please note this is a comparison between Version 1 by AFTAB ALAM and Version 2 by Sirius Huang.

Emotion recognition becomes an important aspect in the development of human-machine interaction (HMI) systems. Positive emotions impact our lives positively, whereas negative emotions may cause a reduction in productivity. Emotionally intelligent systems such as chatbots and artificially intelligent assistant modules help make our daily life routines effortless. Moreover, a system which is capable of assessing the human emotional state would be very helpful to assess the mental state of a person.

  • electrocardiogram (ECG)
  • emotion classifier
  • emotion recognition system
  • human-machine interaction (HMI)
  • SVM
  • RF

1. Introduction

Emotion is an important aspect of human consciousness, which drives our mental state even subconsciously. The emotional state ensures mental well-being, as well as our overall health. The human emotional state is a result of chemical changes in the brain which affect the whole body and its overall expression and actions. They affect feelings, which certainly act as an important parameter which differentiates humans from other species. We feel a diverse range of emotions which may be often situational and might be triggered by outside events. A constant state of sadness for a long time causes depression. A state of severe mental illness may also result in physical illness. When a person is angry, the temperature of the body increases and may even lead to shivering. The blood pressure of a person fluctuates in cases of intense happiness and sadness. A state of intense fear results in sweating and an increase in heart rate. The state of disgust and surprise may lead to a reduction in the heart rate. A pleasant activity relaxes our mood and reduces stress which leads to a reduction in heart rate in comparison to a hyper state. The rhythm of the heart changes with emotions. Researchers have tried to correlate various facial features, speech signals, and audiovisual features, as well as physiological signals such as EEG (electroencephalogram), ECG, GSR (galvanic skin response), and respiration signals, with changes in emotions. Broadly, human emotion recognition systems are categorized into non-physiological- and physiological-based systems. Non-physiological systems utilize facial expressions, speech, audio, and video of the subject when exposed to elicit emotions through an external stimulus. As these features can be masked, for example, a happy person can pretend to have a serious and sad facial expression, as well as a sad person can pretend to have a smiling face, a physiological-based system is also of merit. This second method utilizes physiological signals such as ECG, EEG, GSR, and breathing signals as feature datasets to classify human emotions. The physiological signals are involuntary in their source of generation, hence they cannot be masked or controlled by the subject. Much work has been reported using non-physiological methods of emotion recognition. 

2. Human Emotion Recognition Systems

With the advancement of technology, healthcare services have become more patient-oriented. The implementation of IoT (Internet of Things) and AI (artificial intelligence) with ML (machine learning)-based systems enables us to provide the required preventive care. They are widely used to develop smart healthcare systems for biomedical applications. The detection of diseases, from the very dangerous to the least dangerous, is being conducted using ML techniques. Machine-learning models can process a huge amount of patient data, including medical history, from many hospitals very quickly and are used in the detection and classification of diseases. For example, H. Zhu et al. presented an effective Siamese-oriented region proposal network (Siamese-ORPN) for visual tracking, and the authors proposed an efficient method of feature extraction and feature fusion [1]. W. Nie et al. illustrated a dialogue emotion detection system based on the variation of utterances represented in the form of graph models. The general knowledge conversation gestures in addition to the dialogue were utilized to enhance emotion detection accuracy. A self-supervised learning method was used and optimization techniques were proposed [2]. Zhiyong Xiong et al. developed a physiotherapy tool named SandplayAR and evaluated its impact on 20 participants with mild anxiety disorder. They showed the potential of the SandplayAR system with augmented sensory stimulation to be used as an efficient psychotherapy tool [3]. Rehab A. Rayan et al. briefly reviewed the potential of IoT and AI with ML technologies in biomedical applications. It was established by the authors that technology accelerates the transition from hospital-centered care to patient-centered care. With the development of ML techniques, healthcare devices are capable of handling, storing, and analyzing big data automatically and quickly [4]. Giacomo Peruzzi et al. developed a small, portable microcontroller-based sleep bruxism detection system. The system is capable of classifying the sound of bruxism from other audio signals and can detect the condition remotely at the patient’s house using a CNN-based ML technique [5]. Yuka Kutsumi et al. collected bowel sound (BS) data from 100 participants using a smartphone. They developed a CNN model which is capable of classifying BSs with an accuracy of 98% to comment on the gut health of a person [6]. Renisha Redij et al. also illustrated the application of AI in classifying BSs. The literature explains the relevance and potential of AI-enabled systems to change the GI practice with patient care [7]. Rita Zgheib et al. reviewed the importance of artificial intelligence with machine learning and semantic reasoning in the current scenario of digital healthcare systems. They illustrated and analyzed the relevance of AI and ML technologies in handling the COVID-19 pandemic [8]. Yash Jain et al. developed an ML-based healthcare management system which can act as a virtual doctor to provide a preliminary diagnosis based on the information provided by the subject. The CNN-based ML technique was utilized as well as a GUI interface developed. The system also includes an emotion classifier system. Such technologies are surely going to contribute to digital healthcare management and systems in the future [9].

2.1. Non-Physiological Signal-Based Emotion Classifiers

The non-physiological method of emotion classification involves inputs as responses such as speech, audio, video, and facial expressions corresponding to various emotions. K. P. Seng et al. reported an audiovisual feature-extraction algorithm and emotion classification technique. An RFA neural classifier was used to fuse kernel and Laplacian matrices of the visual path. The emotion recognition for seven types of expressions achieved an accuracy of 96.11% on the CK+ database and 86.67% on the ENTERFACE05 database [10]. Facial expressions can be masked by the subject who may control his/her reactions. T. M. Wani et al. reviewed various speech emotion recognition (SER) techniques in brief. The publicly available databases of speech signals in different languages include the list of models developed. Various feature extraction algorithms are explained and illustrated. The relevance and details of classifiers, such as GMM, HMM, ANN, SVM, KNN, DNN, etc., in speech-based emotion recognition have been illustrated [11]. The research gap includes the selection of robust features and machine-learning-based classification techniques to improve the accuracy of emotion recognition systems. M. S. Hossain et al. illustrated a real-time mobile-based emotion recognition system with fewer computational requirements. Facial video was used as data which were acquired using the inbuilt camera of a mobile phone and bandlet transform was realized with the Kruskal–Wallis feature selection method. The CK and JAFFE database were used to achieve a maximum accuracy of more than 99% [12]. Mobile systems have limitations in their data handling capacity and computational efficiency. S. Hamsa et al. utilized a speech correlogram and used an RF-classifier-based deep-learning technique to recognize human emotions in a noisy and stressful environment. English and Arabic datasets were used and processed to extract features after noise reduction. The four different datasets used were the ESD private Arabic dataset and the SUSAS, RAVDESS, and SAVEE public English datasets, and an average accuracy of more than 80% was achieved [13]. S. Hamsa et al. proposed an emotionally intelligent system to identify the emotion of an unknown speaker using energy, time, and spectral features for three distinct speech datasets of two different languages. Random forest classifiers were used to classify six different kinds of emotions and achieved a maximum accuracy of 89.60% [14]. L. Chen et al. proposed a dynamic emotion recognition system based on facial key features using an Adaboost-KNN adaptive feature optimization technique for human–robot interaction. Adaboost, KNN, and SVM were used for emotion classification. They reported a maximum accuracy of 94.28% [15]. S. Thuseethan et al. proposed a deep-learning-based unknown facial expression recognition technique. They presented a CNN-based architecture for efficient testing results. Model efficacy was evaluated using the benchmark emotion dataset and achieved a maximum accuracy of 86.22% [16]. Hira Hameed et al. reported a contactless British sign language detection system and classified five emotions with spatiotemporal features acquired employing a radar system. They used deep-learning models such as InceptionV3, VGG16, and VGG19 to achieve a maximum accuracy of 93.33% [17]. In [18], the authors presented a contextual cross-modal transformer module for the fusion of textual and audio modalities operated on IEMOCAP and MELD datasets to achieve a maximum accuracy of 84.27%. In [19], the authors illustrated a speech recognition technique on frequency domain features of an Arabic dataset using SVM, KNN, and MLP techniques to achieve a maximum recognition accuracy of 77.14%. In [20], the authors proposed a fusion model both at the feature level (with an LSTM network) and decision level for happy emotion and achieved a maximum accuracy of 95.97%. The non-physiological signals are maskable by the subject easily. Facial expressions can be controlled, as well as speech tones can be modulated intentionally.

2.2. Physiological Signal-Based Emotion Classifiers

Researchers have also explored the physiological method of emotion detection. Mainly, EEG and GSR signals have been used to develop classifier models. Even then, unimodal ECG-based human recognition systems offering high accuracy using contactless acquisition of the ECG signal are still not much explored. In contactless systems, ECG data acquisition with minimum artefacts has always been a challenge to researchers.
M. R. Islam et al. conducted an extensive review of EEG-based emotion recognition techniques in two categories, deep-learning- and shallow-learning-based models. A very detailed list of features used by researchers for the development of emotion classification models was reported. The paper analyzed the relevance of features, classifier models, and publicly available datasets. The authors minutely identified the advantages, as well as the issues, of each technique reported in the domain and suggested possible methods to overcome them [21]. E. P. Torres et al. illustrated RF and deep-learning algorithms to classify emotional states in stock trading behavior using features (five frequency bands, DE, DASM, and RASM) of an EEG signal. The relevance of each feature was identified by a chi-square test, and a maximum accuracy of 83.18% was achieved [22]. T. Song et al. developed a multimodal physiological signal database which includes EEG, ECG, GSR, and respiration signals. Some video clips were selected to induce emotions, and SVM and KNN were used to classify emotions. Moreover, they proposed a novel A-LSTM for more distinctive features and shared their database publicly. The Spearman correlation coefficient was used to identify negative and positive correlated emotions [23]. L. D. Sharma et al. used publicly available databases of ECG and EEG signals. Features were extracted using decomposition into reconstructed components using sliding mode spectral analysis techniques and machine-learning techniques for classification. Two publicly available databases, DREAMER and AMIGOS, were analyzed to achieve a maximum accuracy of 92.38% [24]. G. Li et al. used an EEG-signal-based SEED dataset and experimentally performed batch normalization. An LR classifier was implemented on PSD features of the EEG signals to improve the recognition accuracy of the system by up to 89.63% [25]. A. Goshvarpour et al. examined the effectiveness of the matching pursuit algorithm in emotion recognition problems. They acquired ECG and GSR data of 16 students (a smaller number of subjects) by exposing them to emotional music clips and developed an accurate emotion recognition system based on machine-learning classification tools (such as PCA-KNN) and discriminant analysis. They achieved a 100% accuracy rate and concluded that ECG is a more effective parameter to classify emotions in comparison with GSR [26]. The sample size taken was small. Huanpu Yin et al. presented a contactless IoT user identification and emotion recognition technique. A multi-scale neural network with a mmWave radar system was designed for accurate and robust sensing and achieved a maximum accuracy of 87.68% [27]. Alex Sepulveda et al. established the use of ECG signal features extracted from the AMIGOS database using wavelet transform and classified emotions using KNN, SVM, ensemble, etc., to achieve a maximum accuracy of 89.40% [28]. Muhammad Anas Hasnul et al. reviewed emotion recognition systems based on ECG signals and established the emotional aspect of various heart nodes. They highlighted systems with an accuracy of more than 90% and also validated the publicly available databases [29]. In [30], the authors included a comment on the relationship between emotional state and personality traits in which EEG, ECG, and GSR with facial features were utilized to establish a non-linear correlation. In [31], the authors proposed a deep fusion multimodal model to enhance the accuracy of class separation for emotion recognition on DEAP and DECAF datasets and achieved a maximum accuracy of 70%. An emotion recognition system with multiple modalities such as facial expression, GSR, and EEG using LUMED-2 and DEAP datasets to classify seven emotions with a maximum accuracy of 81.2% has been reported [32]. In [30], the authors reported a hybrid sensor fusion approach to develop a user-independent emotion recognition system. WMD-DTW (a weighted multi-dimensional DTW) and KNN were used on E4 and MAHNOB datasets to achieve a maximum accuracy of 94% [33]. A real-time IoT and LSTM-based emotion recognition system utilizing physiological signals was reported to classify emotions by up to an F-score of 95% achieved using deep-learning techniques [34]. Many of the above have either preferred using publicly available data or the subjects they have investigated have been too few. The classification accuracies achieved in the reported work have room for improvement. The subjects investigated in [26] are too few to claim 100% accuracy in which the classifier generalization may be questionable.
Video Production Service