1. Introduction
In many treatments for medical and psychosomatic disorders, such as physical therapy and other relaxation techniques, there is often a lack of consistent feedback on how well a patient responds to the therapy. This lack of feedback often leads to high drop-out rates and irregular training, which hinders the therapy’s ability to constantly yield an improved physical or psychological state. Moreover, therapies for which a patient’s psycho-physiological state is of particular importance require a human-centered design approach to effectively address subjective pain and emotions experienced by patients. For example, physiotherapy using vaginal dilators to treat genital pain and penetration disorders or re-validation treatment after genital surgery are invasive and may cause high levels of discomfort
[1]. There is hence, an evident need for emotional feedback.
The concept of affective engineering extends beyond the scope of the vaginal dilation therapy example above and encompasses various domains where understanding and incorporating human emotions are crucial. For example, Balters and Steinert
[2] highlight the need for novel methods that account for the influence of human emotions, which often lead to irrational behavior. They propose that physiological signals and machine learning (ML) techniques could be fundamental in advancing research in this field, as these signals cannot be intentionally influenced by the subject. The underlying assumption is that our body functions are regulated by the autonomic nervous system, which is in turn influenced by emotions
[3]. Thus, by measuring physiological signals, emotional states can be inferred. Consequently, the emergence of user-friendly devices capable of non-invasive measurements presents an excellent opportunity to incorporate biofeedback and emotion reactivity into human-machine interactions and to personalize medical therapies. However, the integration of using such information as a form of feedback or to impact the control of a smart device is currently lacking. Further, there is a clear incentive to integrate the collection and classification of physiological signals in home-based physical therapy sessions. Consequently, it is of high importance to use minimally invasive and user-friendly data collection methods.
2. State of the Art for Emotion Recognition
Emotion recognition using physiological signals is an extensively researched area. Using various approaches, both binary
[4][5][6][7][8][9][10][4,5,6,7,8,9,10] and multi-class classification
[11][12][13][11,12,13] have achieved high accuracy rates exceeding 80%. Nevertheless, cross-study results comparison poses a challenge due to varying factors such as the number of subjects, subject-dependent results, the chosen machine learning and validation methods, the signals used, emotion elicitation techniques, and emotion models. For a comprehensive review covering these aspects, please see
[14][15][14,15]. For this
res
earchtudy, it is particularly important t
o us that the data collection method is applicable outside of a laboratory setting. The sensors and the methods for emotion elicitation play a vital role in the technological readiness level of the system. Hence,
we review these categories
are reviewed wwith a focus on their applicability to a minimally invasive scenario.
2.1. Signals
The autonomic nervous system (ANS) regulates “somatic homeostasis and regulates visceral activity, such as the heart rate, digestion, respiratory rate, salivation, perspiration, pupil diameter, micturition, and sexual arousal”
[2]. Hence, it is often assumed that measuring these physiological reactions can offer insights into emotional states, and there have even been attempts to establish fixed relationships between certain signals or features and emotional changes
[15]. It is important to note, however, that not all of the mediated reactions indicate emotional changes
[3], and
itwe must
be acknowledge
d that any such correlations are likely to be highly nonlinear.
Due to the complex relationship between emotions and body responses, most studies utilize multiple physiological signals. As per a review conducted by Bota et al.
[14], the most frequently used sensor systems and signals (listed from the most to the least often used) include electrodermal activity (EDA), electrocardiography (ECG), respiration rate (RESP), electroencephalography (EEG), electromygraphy (EMG), skin temperature (SKT), acceleration (ACC), blood volume pulse (BVP), and electrooculography (EOG). However, it is crucial to note that many of these signals are only viable in a laboratory setting, e.g., due to electrode placement challenges and the need for specialized equipment which often reduces the mobility of the participants. In settings that are closer to a real-life scenario, the emphasis shifts from, e.g., EEG features to cardiovascular and EDA features.
With this objective in mind, photoplethysmography (PPG) sensors, measuring BVP, provide essential heart rate data. These non-invasive sensors, found in smartwatches and medical wearables, aren’t widely used in emotion recognition due to their sensitivity to movement. Nonetheless, PPG sensors have been implemented in several studies
[4][12][4,12], and sometimes in conjunction with other non-invasive sensors for EDA
[11][16][11,16]. One of the most widely used open databases incorporating PPG measurements is the database for emotion analysis using physiological signals (DEAP)
[17], as shown in studies such as
[5][9][10][18][5,9,10,18]. Yet, data transferability to other studies using a BVP signal remains challenging due to DEAP’s use of a fingertip PPG sensor and the inclusion of other measures like ECG or EEG for classification.
There are instances where researchers conducted short-term emotion recognition with a minimally invasive setup, for example,
[5] used only a 10
ss BVP signal and achieved accuracy rates exceeding 80% for binary classification. Moreover, ppg sensors embedded in wearables have been applied in real-life situations by
[7][18][19][7,18,19]. Both
[7][18][7,18] utilized a measurement wristband developed by Empatica. While these devices offer the advantage of portability and easy setup, participants often need to remain still to get a clean PPG signal
[20] and researchers are required to deal with noisy signals, e.g., by disregarding corrupted parts
[7].
2.2. Emotion Models and Elicitation
Emotions are typically defined by two types of models: discrete emotion spaces and continuous dimensions
[14]. Discrete models assume emotions can be grouped into categories like “sad” or “happy”, but this can lead to individual interpretation differences due to factors such as cultural backgrounds
[14]. Continuous models, on the other hand, measure emotions along one or more axes, which simplifies emotion comparison
[14]. According to Bota et al.
[14], the most popular model that uses continuous dimensions is the valence-arousal model by Russell
[21], where valence represents the positivity or negativity of an emotion, and arousal indicates its intensity.
The majority of emotion elicitation methods occur in a laboratory setting, utilizing various stimuli such as pictures, films, sounds, words, or recall schemes
[14][22][14,22]. Just as with the emotion models, the environment of the participants, including factors like language and cultural background, will considerably impact the emotion elicitation process inherent to the stimuli of some of these methods. Though recall schemes, such as the autobigraphical emotion memory task (AEMT), offer an established way for emotion elicitation without relying on external stimuli
[23], they are notably underrepresented in emotion recognition studies. As highlighted by Bota et al.
[14], a mere two out of over seventy reviewed studies employed any form of recall scheme. For instance, Picard et al.
[13] effectively employed a recall scheme over two decades ago, achieving correct rates of 81% across eight emotion classes for a single subject. Yet, these results have been infrequently reproduced. Chanel et al.
[24] worked with a group of 10 participants and achieved a correct recognition rate of up to 50% for four emotions using a subject-dependent model.
2.3. Methods for Classification and Validation
In emotion recognition using physiological signals, methods generally fall into two categories: traditional machine learning and deep learning
[14]. Traditional machine learning involves signal preprocessing, feature engineering, feature fusion, feature-dependent classification, and validation. The aim of signal preprocessing is to reduce noise while retaining relevant information. Feature engineering, the subsequent step, is intended to maximize the informative content of the preprocessed signals
[14]. After feature computation, dimensionality reduction methods are typically applied to avoid the curse of dimensionality. Selecting appropriate features and dimensionality reduction techniques is critical, and the classifier’s success heavily depends on these choices. Features are usually fused into a single vector, which is then used to train and validate the classifier. Most commonly, support vector machine (SVM) algorithms are employed as a supervised learning (SL) method
[14], as demonstrated in
[6][7][8][10][11][12][24][6,7,8,10,11,12,24].
In contrast to feature-dependent classifiers, deep learning (DL) methods can perform pattern recognition directly on preprocessed signals without feature engineering
[18]. For instance, convolutional neural networks (CNNs) can be used to learn feature-like representations
[5][25][5,25]. Additional representation learning algorithms like autoencoders, utilized in
[16], serve as a method for dimensionality reduction, where the autoencoder’s output feeds into the CNN. While such techniques can greatly reduce the time-consuming steps of preprocessing and feature engineering, DL methods have the limitation of behaving like a black box and require large data sets for training
[14], which are often unavailable in medical experiments.
2.4. Validation, Segmentation, and Subject-Dependency
After the classifier is trained and hyperparameters are learned, the model is typically validated on a test data set to measure performance on unseen data
[14]. Common cross validation (CV) methods include
k-fold, leave-one-subject-out (LOSO), and leave-one-out (LOO) CV. In the development of patient models, since
we are using multiple experiments from the same subject
are used, the CV method is of particular importance. Notably, both
k-fold and LOO CV methods include subject data in the validation set. Thus, it is important to note that to be able to claim subject-independent results, LOSO CV should be used, where the classifier is tested on data exclusively from one subject, with none of this subject’s data used in training
[14]. The subject-dependency in the context of
k-fold cross-validation becomes increasingly pronounced when multiple trials are conducted per emotion or class, and even more so when several samples are derived from a single trial. In this context, the use of LOSO CV for subject-independent results is absolutely crucial.
Table 1 provides an overview of some studies, including their segmentation approaches, validation methods, trained models, and achieved accuracy rates. As mentioned previously, conducting a fair cross-study comparison is a complex issue. Nevertheless, itwe is anticipated that this table can give a qualitative perspective on the significance of segmentation and validation methodologies. The segment number indicates the total count of segments/data points utilized for training and testing and can be inferred from other numbers. For instance, the number of subjects and labels used for emotion elicitation (which could refer to various emotions or high/low arousal levels) can be an indicator. Furthermore, some researchers conducted multiple trials per label, such as presenting several movie clips to elicit an emotion. In some cases, researchers split a single continuous segment from one trial into multiple segments for data analysis. However, the number of studies listed in the table is limited as not all researchers detail their segmentation methods. It becomes apparent that, despite the overall aim to achieve subject-independent results for greater generalizability, only a few studies utilize LOSO.
Table 1.
Segmentation approaches from a selected number of studies.
Author |
Number of Segments |
Number of Subjects |
Number of Labels |
Trials per Label |
Segments per Trial |
Validation |
Model |
Accuracy |
[11] |
111 |
37 |
3 |
1 |
1 |
10-fold |
SVM |
97% |
[26] |
360 |
3 |
4 * |
30 |
1 |
LOO |
EMDC |
70% |
[5] |
4800 |
20 |
2 |
40 ** |
6 |
CV |
NN |
82.1% |
[12] |
198 |
33 |
3 |
2 |
1 |
LOSO |
SVM |
83.8% |
[27] |
477 |
101 |
5 |
75–134 *** |
1 |
LOO |
RF |
74% |
[24] |
3000 |
10 |
3 |
100 |
1 |
LOO |
SVM |
50% |
[13] |
160 |
1 |
8 |
20 |
1 |
LOO |
kNN |
81% |
[6] |
192 |
32 |
2 |
3 |
1 |
20-fold |
SVM |
90% |
[16] |
ca. 288 |
36 |
4 |
8 ** |
1 |
10-fold |
NN |
75% |
[19] |
176 |
4 |
8 |
44 ** |
1 |
LOSO |
RF |
62.1% |
[9] |
ca. 12,800 |
32 |
2 |
40 ** |
10 |
10-fold |
CNN |
87.3% |