Human–computer interaction (HCI) plays a significant role in modern education, and emotion recognition is essential in the field of HCI. The potential of emotion recognition in education remains to be explored. Confusion is the primary cognitive emotion during learning and significantly affects student engagement. ERecent studies show that electroencephalogram (EEG) signals, obtained through electrodes placed on the scalp, are valuable for studying brain activity and identifying emotions.
1. Introduction
In modern education, human–computer interaction (HCI) plays a crucial role, with emotion recognition being particularly significant in the field of HCI. By accurately identifying and understanding students’ emotional states, educational systems can better respond to their needs and provide personalized support. Emotion recognition technology can assist educators in determining whether students are experiencing confusion, frustration, or focus during the learning process, enabling timely adoption of appropriate teaching strategies and supportive measures
[1][2][3][1,2,3]. Therefore, the importance of emotion recognition in HCI and education is self-evident. It optimizes the teaching process, enhances learning outcomes, and provides students with more personalized support and guidance. Confusion is more common than other emotions in the learning process
[4][5][6][4,5,6]. Although confusion is an unpleasant emotion, addressing confusion during controllable periods has been shown to be beneficial for learning
[7][8][9][7,8,9], as it promotes active student engagement in learning activities. However, research on learning confusion is still in its early stages and requires further exploration.
Electroencephalography (EEG) is considered a physiological indicator of the aggregated electrical activity of neurons in the human brain’s cortex. EEG is employed to record such activities and, compared to non-physiological indicators like facial expressions and gestures, offers a relatively objective assessment of emotions, making it a reliable tool for emotion recognition
[10].
Traditionally, the classification of EEG signals relies on manual feature extractors and machine learning classifiers
[11], such as Naive Bayes, SVM, and Random Forest. Although deep-learning architectures are a more recent introduction, they have consistently improved performance
[12]. Convolutional Neural Networks (CNNs) and Long Short-Term Memory Networks (LSTMs) are the primary architectures employed
[13]. However, employing CNNs for feature extraction primarily focuses on local aspects, hindering temporal information perception. Although LSTM-based approaches exhibit commendable performance, they also struggle with global temporal representation. Various attempts with end-to-end hybrid networks
[14] have been made. However, these endeavors have resulted in models with excessively intricate architectures, leading to sluggish convergence rates or even failures to converge. Furthermore, end-to-end methodologies lack the advantages of conventional feature extraction methods in representing EEG signals. The Transformer
[15] has showcased its formidable capabilities in natural language processing (NLP), owing to its significant advantage in comprehending global semantics. However, its application in EEG systems is still an area that requires further exploration.
2. Confusion Analysis in Learning Based on EEG Signals
Confusion in learning refers to feeling perplexed or uncertain while absorbing knowledge or solving problems. Given its shared attributes with emotions, it is a nascent study area, primarily exploring confusion’s classification as an emotion or affective state. Confusion is deemed a cognitive emotion, indicating a state of cognitive imbalance
[9][16][9,16]. Individuals are encouraged to introspect and deliberate upon the material to redress this imbalance and facilitate progress, enabling a more profound comprehension. Consequently, when confused, individuals tend to activate profound cognitive processes to pursue enhanced learning outcomes. The investigation into confusion within the learning context remains in its preliminary stages.
Using EEG to recognize human emotions during various activities, including learning, is an area currently being explored. Recent research has focused on using electroencephalography to study cognitive states and emotions for educational purposes. These studies focus on attention or engagement
[17][18][17,18], cognitive load, and some basic emotions such as happiness and fear. For example, researchers
[19] used an EEG-based brain-computer interface (BCI) to record EEG in the FP1 region to track changes in attention. By utilizing visual and auditory cues, such as rhythmic hand raising, adaptive proxy robots can help students shift their attention when their attention falls below a preset threshold. The research results indicate that this BCI can improve learning performance.
Most traditional EEG-based classification methods rely on two steps: feature extraction and classification, and emotion classification is no exception. Many researchers have focused on exploring effective EEG features for classification, and the advancement of machine learning methods and technologies has significantly contributed to the development of these traditional methods. There have been attempts using the Common Spatial Pattern (CSP) algorithm
[20], such as the FBCSP algorithm
[21], which filters signals through filter banks, computes CSP energy features for each signal output through time filters, and then selects and classifies these features. Despite enhancements to the original CSP method, these techniques solely focus on analyzing the CSP energy dimension, disregarding the incorporation of temporal contextual information. Kaneshiro et al.
[11] proposed Principal Component Analysis (PCA), extracting feature vectors of specific sizes from minimally preprocessed EEG signals, followed by training a classifier based on Linear Discriminant Analysis (LDA). Karimi-Rouzbahani et al.
[22] explored the discriminative power of many statistical and mathematical features, and their experiments on three datasets showed that multi-valued features like wavelet coefficients and the theta frequency band performed better. Zheng et al.
[23] investigated the pivotal frequency bands and channels of multi-channel EEG data in emotion recognition. Jensen & Tesche
[24] and Bashivan et al.
[25] demonstrated through experiments that cortical oscillatory activity associated with memory operations primarily exists in the theta (4–7 Hz), alpha (8–13 Hz), and beta (13–30 Hz) frequency bands. The studies above utilize traditional machine learning classifiers to explore critical frequency bands and channels; nevertheless, traditional machine learning classifiers do not demonstrate any performance advantages. In addition, separately optimizing feature extraction and classifier could potentially result in suboptimal global optimization.
Compared to traditional methods, end-to-end deep networks eliminate the need for manual feature extraction. For most EEG applications, it has been observed that shallow models yield good results, while deep models might lead to performance degradation
[12][13][12,13]. Especially for classification based on CNNs, despite the shallow architectures of CNNs with few parameters, they have been widely utilized: DeepConvNet
[12], EEGNet
[26], ResNet
[27], and other variants
[28]. However, due to the limitations imposed by kernel size, CNNs can learn features with local receptive fields. However, they cannot capture the crucial long-term dependencies for time series analysis. Furthermore, Recurrent neural networks(RNNs) and long short-term memory(LSTM) are introduced to capture the temporal features of EEG classification
[29][30][29,30]. However, these models cannot be trained in parallel, and the dependencies calculated by hidden states quickly vanish after a few time steps, making it challenging to capture global temporal dependencies. Moreover, end-to-end methods insist on utilizing deep networks to learn from raw signals, often overlooking the advantages of manual feature extraction, and complex networks can lead to difficulties in model convergence.