Deep Learning for Motor Imagery Brain–Computer Interface: Comparison
Please note this is a comparison between Version 1 by Konstantinos Karampidis and Version 2 by Catherine Yang.

The field of brain–computer interface (BCI) enables us to establish a pathway between the human brain and computers, with applications in the medical and nonmedical field. Brain computer interfaces can have a significant impact on the way humans interact with machines. In recent years, the surge in computational power has enabled deep learning algorithms to act as a robust avenue for leveraging BCIs. 

  • EEG
  • deep learning
  • BCI
  • motor imagery

1. Introduction

Brain–computer interfaces (BCIs) are an emerging field of technology that combines and allows the connection between the brain and a computer or other external devices. BCIs have the potential to revolutionize the way humans interact with machines, opening countless possibilities both in medical and nonmedical domains. In the medical field, it can help people suffering from locked-in syndrome to communicate [1]. Moreover, brain–computer interfaces (BCIs) are displaying potential in the realm of neuroprosthetics, offering the prospect for individuals with limb amputations or paralysis to command robotic limbs or exoskeletons through their brain signals [2]. In epilepsy management, BCIs are researched for real-time seizure detection and intervention, potentially mitigating the impacts of seizures [3]. In the nonmedical domain, an EEG BCI has applications in areas such as gaming where players can play a game using only their thoughts [4] and in fields such as entertainment where users can control drones or other robotic devices [5].

2. Convolutional Neural Networks

CNNs mimic the operational principles of the human visual cortex and possess the ability to dynamically comprehend spatial hierarchies in EEG data, recognizing patterns associated with motor imagery tasks through multiple layer transformations [6][51]. A CNN architecture begins with an input layer that accepts raw or preprocessed EEG data as shown in Figure 19. These data can be represented in various formats, such as time–frequency images, allowing the network to effectively process and analyze the brain signals associated with motor imagery. These data are then convolved using multiple kernels or filters, enabling the network to learn local features. Subsequently, the network employs a pooling layer for dimensionality reduction, refining the comprehension of the information. As the model progresses through these layers, it acquires the capacity to understand increasingly complex features. The final component is a fully connected (dense) classification layer that maps the learned high-level features to the desired output classes, such as different types of motor imagery, effectively acting as a decision-making layer that converts abstract representations into definitive classifications.
Figure 19. Typical EEG CNN architecture [7].
Typical EEG CNN architecture [52].
Dose et al. proposed a CNN trained on 3 s of segments from EEG signals [8][53]. The proposed method achieved an accuracy of 80.10%, 69.72%, and 59.71% on two, three, and four MI classes, respectively, on the Physionet dataset. Miao M et al. proposed a CNN with five layers to classify two motor imagery tasks, right hand and right foot, from the BCI Competition III-IV-a dataset, achieving a 90% accuracy [9][54]. Zhao et al. proposed a novel CNN with multiple spatial temporal convolution (STC) blocks and fully connected layers [10][55]. Contrastive learning was used to push the negative samples away and pull the positive samples together. This method achieved an accuracy of 74.10% on BCI III-2a, 73.62% on SMR-BCI, and 69.43% on OpenBMI datasets. Liu et al. proposed an end-to-end compact multibranch one-dimensional CNN (CMO-CNN) network for decoding MI EEG signals, achieving 83.92% and 87.19% accuracies on the BCI Competition IV-2a and the BCI Competition IV-2b datasets, respectively [11][56]. Han et al. proposed a parallel CNN (PCNN) to classify motor imagery signals [12][57]. That method, which achieved an average accuracy of 83.0% on the BCI Competition IV-2b dataset, began by projecting raw EEG signals into a low-dimensional space using a regularized common spatial pattern (RCSP) to enhance class distinctions. Then, the short-time Fourier transform (STFT) collected the mu and beta bands as frequency features, combining them to form 2D images for the PCNN input. The efficacy of the PCNN structure was evaluated against other methods such as stacked autoencoder (SAE), CNN-SAE, and CNN. Ma et al. proposed an end-to-end, shallow, and lightweight CNN framework, known as Channel-Mixing-ConvNet, aimed at improving the decoding accuracy of the EEG-Motor Raw datasets [13][58]. Unlike traditional methods, the first block of the network was designed to implicitly stack temporal–spatial convolution layers to learn temporal and spatial EEG features after EEG channels were mixed. This approach integrated the feature extraction capabilities of both layers and enhanced performance. This resulted in a 74.9% accuracy rate on the BCI IV-2a dataset and 95.0% accuracy rate on the High Gamma Dataset (HGD). Ak et al. performed an EEG data analysis to control a robotic arm. In their work, spectrogram images derived from EEG data were used as input to the GoogLeNet. They tested the system on imagined directional movements—up, down, left, and right—to control the robotic arm [14][59]. The approach resulted in the robotic arm executing the desired movements with over 90% accuracy, while on their private dataset, they achieved 92.59% accuracy. Musallam Y et al. proposed the TCNet-Fusion model, which used multiple techniques such as temporal convolutional networks (TCNs), separable convolution, depthwise convolution, and layer fusion [15][60]. This process created an imagelike representation, which was then fed into the primary TCN. During testing, the model achieved a classification accuracy of 83.73% on the four-class motor imagery of the BCI Competition IV-2a dataset and an accuracy of 94.41% on the High Gamma Dataset. Zhang et al. proposed a CNN with a 1D convolution on each channel followed by a 2D convolution to extract spatial features based on all 20 channels [16][61]. Then, to deal with the high computational cost, the idea of pruning was used, which is a technique of reducing the size and complexity of the neural network by removing certain connections or neurons. In the proposed method, a fast recursive algorithm (FRA) was applied to prune redundant parameters in the fully connected layers to reduce computational costs. The proposed architecture achieved an accuracy of 62.7% in the OPENBCI dataset. A similar approach was proposed by Vishnupriya et al. [17][62] to reduce the complexity of their architecture. The magnitude-based weight pruning was performed on the network, which achieved an accuracy of 84.46% on two MI tasks (left hand, right hand) in Lee et al.’s dataset. Shajil et al. proposed a CNN architecture to classify four MI tasks, using the common spatial pattern filter on the raw EEG signal, then using the spectrograms extracted from the filtered signals as input into the CNN [18][63]. The proposed method achieved an accuracy of 86.41% on their private dataset. Korhan et al. proposed a CNN architecture with five layers [19][64]. The proposed architecture was compared using only the CNN without any filtering, then with five different filters, and finally, with common spatial patterns followed by the CNN with the last architecture, which achieved the highest accuracy of 93.75% in the BCI Competition III-3a dataset. Alazrai et al. proposed a CNN network, with the raw signal transformed into the time–frequency domain with the quadratic time–frequency distribution (QTFD), followed by the CNN network to extract and classify the features [20][65]. The proposed method was tested on their two private datasets, with 11 MI tasks (rest, grasp-related tasks, wrist-related tasks, and finger-related tasks) and obtained accuracies of 73.7% for the able-bodied and 72.8% for the transradial-amputated subjects. Table 12 summarizes the research articles that utilize CNNs along with the tasks, the datasets used, and their performance.
Table 12.
Reviewed CNN architectures, datasets and their accuracies.
79] proposed a deep neural network with four layers each including 50, 30, 15, and 1 node, respectively, achieving a 49.5% classification accuracy in the BCI Competition IV-2b with two MI tasks selected (arm and foot movement). Cheng et al. proposed a deep neural network which accepted as input multiple sub-bands of the raw signal extracted by a sliding window strategy [36][80]. Under these sub-bands, diverse spatial–spectral features were extracted and fed into a deep neural network for classification, achieving an accuracy of 71.5% on their private dataset. Yohanandan et al. proposed a binary classifier (relaxed and right-handed MI tasks) using a deep neural network with the μ-rhythm (8–12 Hz frequency) data being fed into the network [37][81]. The authors used different sliding windows from 1 s to 9 s to determine the highest-accuracy window. An average accuracy of ~83% was achieved on their privately collected dataset from seven human volunteers. Kumar et al. proposed a deep neural network for the classification of extracted features using a common spatial pattern in the BCI Competition III-4a dataset, achieving an accuracy of ~85% on two MI tasks (right hand and left foot) [38][82]. Table 34 shows the performance of each one of the aforementioned architectures.
Table 34.
Reviewed deep neural network architectures and their accuracies.
.
Autoencoder architecture [84].
Autthasan et al. proposed an end-to-end multitask autoencoder and tested it on three datasets, BCI Competition IV-2a, SMR-BCI, and OpenBMI, achieving accuracies of 70.09%, 72.95%, and 66.51%, respectively [41][85]. Similarly, capsule networks, which introduce a hierarchical structure to capture pose and viewpoint information, have shown promising results in MI task classification [42][86]. Capsules in capsule networks utilize vector-based representations. This property enables the network to capture hierarchical relationships and spatial dependencies among features. Each capsule comprises a group of neurons, with each neuron’s output representing a different property of the same feature, enabling the recognition of the whole entity by first identifying its parts. Ha et al. proposed a capsule network, using the images extracted with the short-time Fourier transform as input to the capsule network [43][87]. Their proposed method achieved a 77% accuracy on the BCI competition IV-2b dataset (left-hand and right-hand MI tasks). Long short-term memory (LSTM) networks [44][88], a type of recurrent neural network, have been utilized to model temporal dependencies in MI data, enabling effective sequence learning for classification. Leon-Urbano et al. proposed an LSTM approach on an MNE python library dataset which consisted of two MI tasks (feet, hands), and after fine-tuning their model, they achieved a 90% accuracy [45][89]. Saputra et al. also deployed an LSTM network on the BCI Competition IV-2a dataset, achieving an accuracy of 49.65% [46][90]. Hwang et al. also performed a classification based on an LSTM on the BCI competition IV-2a dataset with a feature extraction based on overlapping band-based FBCSP (filter-bank common spatial pattern), with an accuracy of 97% [47][91]. Ma et al. proposed a parallel architecture including a temporal LSTM and a spatial bidirectional LSTM [48][92]. The proposed method was tested on the four MI tasks (moving both feet, both fists, left fist and right fist) from the EEGMMIDB dataset and achieved an accuracy of 68.20%. Another proposed method is the restricted Boltzmann machine [49][93], a type of probabilistic graphical model, leveraging its ability to model joint probability distributions. Xu et al. utilized a restricted Boltzmann machine and a support vector machine (SVM) to classify and recognize deep multiview features [50][94]. The proposed method achieved an accuracy of 78.50% on the BCI competition IV-2a dataset. Moreover, metalearning [51][95] empowers models to acquire the skill of learning on their own, with a limited quantity of data. This is achieved through training the model on a diverse range of tasks, allowing it to leverage the knowledge gained from these tasks when presented with new challenges. Among the various metalearning algorithms, one of the most prominent ones is MAML (model-agnostic metalearning) [51][95]. MAML trains the model to efficiently update its parameters, facilitating a rapid adaptation to new tasks with minimal updates. Li et al. proposed a metalearning method which learned from the output of other machine learning algorithms [52][96]. The proposed method achieved an 80% accuracy on the Physionet dataset (on left fist vs. right fist and both fists vs. both feet). Contrastive learning [53][97] is a self-supervised learning technique that aims to create meaningful representations by contrasting positive and negative pairs of data. Han et al. proposed the so-called contrastive learning network. The proposed method was tested on the BCI competition IV-2a dataset achieving an accuracy of 79.54% when all the training labels were used [54][98]. A deep belief network (DBN) [55][99] is an unsupervised neural network known for its feature extraction from raw data. It uses a two-step training process: unsupervised pretraining with a restricted Boltzmann machine and supervised fine-tuning. Li et al. proposed a deep belief architecture where the time–frequency information from the raw EEG signal was fed into the DBN, which was used for the identification and classification [56][100]. The proposed method achieved an accuracy of 93.57% on the BCI competition II-3 dataset. A synopsis of the above-mentioned proposals can be found in Table 45.
Table 45.
Other reviewed deep learning architectures and their accuracies.
Authors Accuracy Dataset MI Tasks Architecture
Autthasan et al. [41][85] 70.09%, 72.95%

66.51%
BCI IV-2a, SMR_BCI, Open BCI LH, RH, BL, T Autoencoder
Ha et al. [43][87] 77% BCI IV-2b
Limpiti et al. [29][73] 95.03%, 91.86% BCI IV-2a LH, RH, BL, T
LH, RH Capsule network Yohonanndan et al. [37][81] 83% Private RS, RH
Urbano et al. [45][89] 90% MNE dataset Han et al. [12][57] 83% BCI IV-2b LH, RH
Ma et al. [13][58] 74.9%, 95.0% BCI IV-2a, HGD LH, RH, BL, T
Ak et al. [14][59] 92.59% Private U, D, L, R
Musallam et al. [15][60] 83.73%, 94.41% BCI IV-2a, HGD LH, RH, BL, T
Zhang et al. [16][61] 62.7% OpenBMI LH, RH
Vishnupriya et al. [17][62] 84.46% Lee et al. LH, RH
Shajil et al. [18][63] 86.41% Private LH, RH, BH, BL
Korhan et al. [19][64] 93.75% BCI III-3a LH, RH, BL, T
Alazrai et al. [20][65] 73.7%, 72.8% Private RS, SDG, LG, ETG, RDW, EW, FI, FM, FR, FL, FT
Video Production Service