Reconstructing natural stimulus images using functional magnetic resonance imaging (fMRI) is one of the most challenging problems in brain decoding and is also the crucial component of a brain–computer interface.
1. Introduction
Understanding the neural mechanisms of the human visual system is a hot topic in neuroscience. When the visual system receives external stimuli, the human brain encodes the information and produces specific neural responses. Human visual decoding aims to establish the mapping from the given brain activity to the input stimulus information [
1,
2]. Functional magnetic resonance imaging (fMRI) indirectly reflects the neuron population response by measuring the local variation in the blood oxygen level. Due to the non–invasive and high–spatial–resolution properties, fMRI is widely used in human visual decoding [
1,
2]. According to the task, human visual decoding research can be divided into three categories: semantic classification [
3], image recognition [
4], and image reconstruction [
2]. The aim of semantic classification is to predict the stimulus category from brain activity. Image recognition requires the model to identify the seen image from a set of candidate stimuli. Due to the complexity and the low signal–to–noise ratio (SNR) property of the fMRI signal, image reconstruction is the most challenging task, intending to reconstruct the entire stimulus from the brain activity pattern. Developing a natural image reconstruction model can theoretically help us to understand how the brain encodes stimulus information, while practically exploring potential solutions for brain–computer interfaces.
Traditional image reconstruction methods rely on machine learning tools such as linear regression [
5,
6], Bayesian modeling [
7,
8], and principle component analysis [
9] to estimate the pixel values or hand–crafted features from the fMRI. These methods can decode simple stimulus images, such as letters and numbers. However, when applied to complex natural scenes, these algorithms often fail to produce faithful reconstructions due to the simplicity of the models.
With the rapid development of deep neural networks, deep–learning–based image reconstruction methods have been proposed [
2,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25]. These approaches are classified into two categories: approximation–based approaches [
10,
11,
12,
13,
14,
15,
16,
17,
18] and generation–based approaches [
19,
20,
21,
22,
23,
24,
25]. Approximation–based approaches aim to improve the pixel–level similarity between the stimulus and the reconstruction via training well–designed networks from scratch, while generation–based approaches aim to improve the semantic–level consistency by utilizing the powerful pre–trained generative models such as generative adversarial network (GAN), variational autoencoder (VAE), and diffusion model (DM).
Although significant progress has been made in deep–learning–based image reconstruction methods, there is still room for improvement in several aspects. Firstly, the input fMRI signals originate from different brain regions, and there are extensive information interactions between these areas in the underlying mechanisms of the visual system. However, previous studies merely consider the fMRI signals from different regions as a whole high–dimensional vector input of the reconstruction model [
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25], which neglects the nature of information exchange that should be considered during the processing. As a result, the limited extracted features lead to poor performance of the model. Secondly, the information contained within an image at multiple scales has a powerful expressive capability. Lower–resolution scales provide the global image structure and higher–resolution scales describe the fine image details. Previous single–scale approaches are insufficient for fully exploiting image information [
10,
11,
12,
13,
14,
15,
16,
17,
19,
20,
21,
22,
23,
24,
25], resulting in blurry reconstructed images under the situation of limited training samples.
2. Natural Image Reconstruction Methods Based on Deep Learning
2.1. Approximation–Based Methods
Approximation–based methods often involve designing the effective network and training the model from scratch, without relying on excessive pre–trained components. The characteristic of these methods is high pixel–level similarity. Three types of implementations can be divided: (1) Iterative optimization. Shen et al. [
10] estimated the reconstructed image by continuously iterating the image with the objective of minimizing the distance between the features of the reconstructed image and the features decoded from fMRI. (2) Autoencoder. Beliy et al. [
11] utilized the encoder–decoder structure to integrate self–supervised learning (SSL) into the decoding model. The encoder learns the mapping from image to fMRI, while the decoder learns the mapping from fMRI to image. By stacking the encoder and decoder, the model can be trained in an SSL manner using unlabeled data. Gaziv et al. [
12] and Qiao et al. [
13] further improved this method via multi–level feature similarity loss and alternative encoder–decoder regularization, respectively. (3) Generative adversarial network. Seeliger et al. [
14] employed the DCGAN as the generator and trained a conversion network to transfer the fMRI activity to the input latent variable of the generator. Shen et al. [
15] achieved end–to–end image reconstruction training using image, feature, and adversarial loss. Fang et al. [
16] proposed Shape–Semantic GAN, with a shape decoder to decode the shape from low–level visual areas and a semantic decoder to decode the category from high–level visual regions. The outputs from the shape and semantic decoders were used as inputs to an image generation network to reconstruct the stimulus. Ren et al. [
17] utilized a dual–path VAE–GAN network structure and trained the model using the knowledge distillation paradigm. Meng et al. [
18] adopted a hierarchical network for image feature extraction and reconstruction, combining an fMRI decoder to produce intermediate features for the network to obtain faithful reconstructions.
2.2. Generation–Based Methods
Generation–based methods often exploit powerful pre–trained generative models, combine relatively simple fMRI decoders to convert the brain activity to the input or intermediate variables of the generative models, or obtain reconstructions with high semantic–level consistency via the strong generation ability. Two types of implementations can be divided: (1) Generative adversarial network. Mozafari et al. [
19] trained a vanilla linear regression model as an fMRI decoder to learn the mapping from fMRI to the input latent variables of the pre–trained BigBiGAN [
26]. Reconstructions with high fidelity were obtained by combining the above components. Ozcelik et al. [
20] further enhanced this model with another powerful model called ICGAN [
27]. Lin et al. [
21] utilized the pre–trained CLIP model [
28] to extract image and text features from the stimulus image and corresponding caption. An fMRI decoder comprising convolutional operations and linear layers was employed to align the fMRI activity to the CLIP feature space via contrastive learning. Finally, a pre–trained StyleGAN2 [
29] was adopted to achieve stimulus image reconstruction. (2) Diffusion model. Chen et al. [
22] trained an fMRI feature extraction model using the masked signal modeling learning paradigm [
30] and used the fMRI features as the conditional inputs to fine–tune the pre–trained latent diffusion model (LDM) [
31]. Ni et al. [
23] optimized the implementation of masked signal modeling to further improve the image quality. Meng et al. [
24] developed a vanilla linear network to map the fMRI signal to the features extracted by the pre–trained CLIP model [
28]. The decoded features were then combined with the reverse process of LDM [
31] to produce reconstructions. Lu et al. [
25] adopted an LDM [
31] to obtain initial reconstruction and then iteratively updated the input variables with the objective of structural similarity between the reconstruction and the corresponding groundtruth.
3. Graph Neural Network
A graph is a kind of data structure consisting of objects (nodes) and the relationships between them (edges). It is used to represent data in non–Euclidean spaces such as social networks, knowledge dependencies, and so on. Graph neural network (GNN) is a deep learning model that extracts features from the topological information of graphs via information exchange between the nodes and has shown promising results in a variety of node–, edge– and graph–level tasks [
32]. In fMRI activity, brain regions and the functional connectivity between them exhibit an explicit graph structure. Therefore, some researchers have attempted to incorporate GNN into the processing model. Kawahara et al. [
33] proposed BrainNetCNN for neurodevelopment prediction using novel edge–to–edge, edge–to–node, and node–to–graph layers to capture the topological relationship between brain areas. Li et al. [
34] further advanced this approach by introducing multi–order path information aggregation. Meng et al. [
35] developed a visual stimulus category decoding model based on the graph convolutional neural network, which is used to extract the functional correlation features between different brain regions. Saeidi et al. [
36] employed a graph neural network to decode task–fMRI data, combining a graph convolutional operation and various node embedding algorithms. However, previous approaches based on graph neural network have focused solely on either nodes or edges, neglecting the importance of interactions between them, which restricts the expressive power of the model.
4. Multi–Scale Constraint
In modern convolutional neural networks, the feature extraction process is typically divided into several stages to extract multi–scale features from images [
37]. Features with high resolution contain fine details, while features with low resolution provide coarse image structure. By adding multi–scale constraint to the network, the model can effectively utilize the information at different levels, resulting in improved performance and robustness. For instance, SSD [
38] advanced the object detection system performance by predicting on feature maps of various scales simultaneously. DeepLabV3+ [
39] boosted the semantic segmentation accuracy by integrating local and global embedding through the atrous convolution and multi–scale feature fusion module. In the field of natural image reconstruction from brain activity, Miyawaki et al. [
5] reconstructed the arbitrary binary contrast patterns by separately predicting on predefined multi–scale local image bases. Luo et al. [
40] proposed DA–HLGN–MSFF, which combines the hierarchical feature extraction and multi–scale feature fusion block to improve the reconstruction performance. Meng et al. [
18] exploited a similar multi–scale encoder–decoder architecture to achieve promising natural image reconstruction.
This entry is adapted from the peer-reviewed paper 10.3390/brainsci14030234