Natural Image Reconstruction from fMRI: Comparison
Please note this is a comparison between Version 2 by Rita Xu and Version 1 by ZongYi Zhan.

Reconstructing natural stimulus images using functional magnetic resonance imaging (fMRI) is one of the most challenging problems in brain decoding and is also the crucial component of a brain–computer interface.

  • brain decoding
  • natural image reconstruction
  • fMRI

1. Introduction

Understanding the neural mechanisms of the human visual system is a hot topic in neuroscience. When the visual system receives external stimuli, the human brain encodes the information and produces specific neural responses. Human visual decoding aims to establish the mapping from the given brain activity to the input stimulus information [1,2][1][2]. Functional magnetic resonance imaging (fMRI) indirectly reflects the neuron population response by measuring the local variation in the blood oxygen level. Due to the non–invasive and high–spatial–resolution properties, fMRI is widely used in human visual decoding [1,2][1][2]. According to the task, human visual decoding research can be divided into three categories: semantic classification [3], image recognition [4], and image reconstruction [2]. The aim of semantic classification is to predict the stimulus category from brain activity. Image recognition requires the model to identify the seen image from a set of candidate stimuli. Due to the complexity and the low signal–to–noise ratio (SNR) property of the fMRI signal, image reconstruction is the most challenging task, intending to reconstruct the entire stimulus from the brain activity pattern. Developing a natural image reconstruction model can theoretically help us to understand how the brain encodes stimulus information, while practically exploring potential solutions for brain–computer interfaces.
Traditional image reconstruction methods rely on machine learning tools such as linear regression [5[5][6],6], Bayesian modeling [7[7][8],8], and principle component analysis [9] to estimate the pixel values or hand–crafted features from the fMRI. These methods can decode simple stimulus images, such as letters and numbers. However, when applied to complex natural scenes, these algorithms often fail to produce faithful reconstructions due to the simplicity of the models.
With the rapid development of deep neural networks, deep–learning–based image reconstruction methods have been proposed [2,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25][2][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]. These approaches are classified into two categories: approximation–based approaches [10,11,12,13,14,15,16,17,18][10][11][12][13][14][15][16][17][18] and generation–based approaches [19,20,21,22,23,24,25][19][20][21][22][23][24][25]. Approximation–based approaches aim to improve the pixel–level similarity between the stimulus and the reconstruction via training well–designed networks from scratch, while generation–based approaches aim to improve the semantic–level consistency by utilizing the powerful pre–trained generative models such as generative adversarial network (GAN), variational autoencoder (VAE), and diffusion model (DM).
Although significant progress has been made in deep–learning–based image reconstruction methods, there is still room for improvement in several aspects. Firstly, the input fMRI signals originate from different brain regions, and there are extensive information interactions between these areas in the underlying mechanisms of the visual system. However, previous studies merely consider the fMRI signals from different regions as a whole high–dimensional vector input of the reconstruction model [11,12,13,14,15,16,17,18,19,20,21,22,23[11][12][13][14][15][16][17][18][19][20][21][22][23][24][25],24,25], which neglects the nature of information exchange that should be considered during the processing. As a result, the limited extracted features lead to poor performance of the model. Secondly, the information contained within an image at multiple scales has a powerful expressive capability. Lower–resolution scales provide the global image structure and higher–resolution scales describe the fine image details. Previous single–scale approaches are insufficient for fully exploiting image information [10,11,12,13,14,15,16,17,19,20,21[10][11][12][13][14][15][16][17][19][20][21][22][23][24][25],22,23,24,25], resulting in blurry reconstructed images under the situation of limited training samples.

2. Natural Image Reconstruction Methods Based on Deep Learning

2.1. Approximation–Based Methods

Approximation–based methods often involve designing the effective network and training the model from scratch, without relying on excessive pre–trained components. The characteristic of these methods is high pixel–level similarity. Three types of implementations can be divided: (1) Iterative optimization. Shen et al. [10] estimated the reconstructed image by continuously iterating the image with the objective of minimizing the distance between the features of the reconstructed image and the features decoded from fMRI. (2) Autoencoder. Beliy et al. [11] utilized the encoder–decoder structure to integrate self–supervised learning (SSL) into the decoding model. The encoder learns the mapping from image to fMRI, while the decoder learns the mapping from fMRI to image. By stacking the encoder and decoder, the model can be trained in an SSL manner using unlabeled data. Gaziv et al. [12] and Qiao et al. [13] further improved this method via multi–level feature similarity loss and alternative encoder–decoder regularization, respectively. (3) Generative adversarial network. Seeliger et al. [14] employed the DCGAN as the generator and trained a conversion network to transfer the fMRI activity to the input latent variable of the generator. Shen et al. [15] achieved end–to–end image reconstruction training using image, feature, and adversarial loss. Fang et al. [16] proposed Shape–Semantic GAN, with a shape decoder to decode the shape from low–level visual areas and a semantic decoder to decode the category from high–level visual regions. The outputs from the shape and semantic decoders were used as inputs to an image generation network to reconstruct the stimulus. Ren et al. [17] utilized a dual–path VAE–GAN network structure and trained the model using the knowledge distillation paradigm. Meng et al. [18] adopted a hierarchical network for image feature extraction and reconstruction, combining an fMRI decoder to produce intermediate features for the network to obtain faithful reconstructions.

2.2. Generation–Based Methods

Generation–based methods often exploit powerful pre–trained generative models, combine relatively simple fMRI decoders to convert the brain activity to the input or intermediate variables of the generative models, or obtain reconstructions with high semantic–level consistency via the strong generation ability. Two types of implementations can be divided: (1) Generative adversarial network. Mozafari et al. [19] trained a vanilla linear regression model as an fMRI decoder to learn the mapping from fMRI to the input latent variables of the pre–trained BigBiGAN [26]. Reconstructions with high fidelity were obtained by combining the above components. Ozcelik et al. [20] further enhanced this model with another powerful model called ICGAN [27]. Lin et al. [21] utilized the pre–trained CLIP model [28] to extract image and text features from the stimulus image and corresponding caption. An fMRI decoder comprising convolutional operations and linear layers was employed to align the fMRI activity to the CLIP feature space via contrastive learning. Finally, a pre–trained StyleGAN2 [29] was adopted to achieve stimulus image reconstruction. (2) Diffusion model. Chen et al. [22] trained an fMRI feature extraction model using the masked signal modeling learning paradigm [30] and used the fMRI features as the conditional inputs to fine–tune the pre–trained latent diffusion model (LDM) [31]. Ni et al. [23] optimized the implementation of masked signal modeling to further improve the image quality. Meng et al. [24] developed a vanilla linear network to map the fMRI signal to the features extracted by the pre–trained CLIP model [28]. The decoded features were then combined with the reverse process of LDM [31] to produce reconstructions. Lu et al. [25] adopted an LDM [31] to obtain initial reconstruction and then iteratively updated the input variables with the objective of structural similarity between the reconstruction and the corresponding groundtruth.

3. Graph Neural Network

A graph is a kind of data structure consisting of objects (nodes) and the relationships between them (edges). It is used to represent data in non–Euclidean spaces such as social networks, knowledge dependencies, and so on. Graph neural network (GNN) is a deep learning model that extracts features from the topological information of graphs via information exchange between the nodes and has shown promising results in a variety of node–, edge– and graph–level tasks [32]. In fMRI activity, brain regions and the functional connectivity between them exhibit an explicit graph structure. Therefore, some researchers have attempted to incorporate GNN into the processing model. Kawahara et al. [33] proposed BrainNetCNN for neurodevelopment prediction using novel edge–to–edge, edge–to–node, and node–to–graph layers to capture the topological relationship between brain areas. Li et al. [34] further advanced this approach by introducing multi–order path information aggregation. Meng et al. [35] developed a visual stimulus category decoding model based on the graph convolutional neural network, which is used to extract the functional correlation features between different brain regions. Saeidi et al. [36] employed a graph neural network to decode task–fMRI data, combining a graph convolutional operation and various node embedding algorithms. However, previous approaches based on graph neural network have focused solely on either nodes or edges, neglecting the importance of interactions between them, which restricts the expressive power of the model.

4. Multi–Scale Constraint

In modern convolutional neural networks, the feature extraction process is typically divided into several stages to extract multi–scale features from images [37]. Features with high resolution contain fine details, while features with low resolution provide coarse image structure. By adding multi–scale constraint to the network, the model can effectively utilize the information at different levels, resulting in improved performance and robustness. For instance, SSD [38] advanced the object detection system performance by predicting on feature maps of various scales simultaneously. DeepLabV3+ [39] boosted the semantic segmentation accuracy by integrating local and global embedding through the atrous convolution and multi–scale feature fusion module. In the field of natural image reconstruction from brain activity, Miyawaki et al. [5] reconstructed the arbitrary binary contrast patterns by separately predicting on predefined multi–scale local image bases. Luo et al. [40] proposed DA–HLGN–MSFF, which combines the hierarchical feature extraction and multi–scale feature fusion block to improve the reconstruction performance. Meng et al. [18] exploited a similar multi–scale encoder–decoder architecture to achieve promising natural image reconstruction.

References

  1. Du, B.; Cheng, X.; Duan, Y.; Ning, H. fMRI Brain Decoding and Its Applications in Brain–Computer Interface: A Survey. Brain Sci. 2022, 12, 228.
  2. Rakhimberdina, Z.; Jodelet, Q.; Liu, X.; Murata, T. Natural image reconstruction from fmri using deep learning: A survey. Front. Neurosci. 2021, 15, 795488.
  3. Horikawa, T.; Kamitani, Y. Generic decoding of seen and imagined objects using hierarchical visual features. Nat. Commun. 2017, 8, 15037.
  4. Kay, K.N.; Naselaris, T.; Prenger, R.J.; Gallant, J.L. Identifying natural images from human brain activity. Nature 2008, 452, 352–355.
  5. Miyawaki, Y.; Uchida, H.; Yamashita, O.; Sato, M.a.; Morito, Y.; Tanabe, H.C.; Sadato, N.; Kamitani, Y. Visual image reconstruction from human brain activity using a combination of multiscale local image decoders. Neuron 2008, 60, 915–929.
  6. Schoenmakers, S.; Barth, M.; Heskes, T.; Van Gerven, M. Linear reconstruction of perceived images from human brain activity. NeuroImage 2013, 83, 951–961.
  7. Naselaris, T.; Prenger, R.J.; Kay, K.N.; Oliver, M.; Gallant, J.L. Bayesian reconstruction of natural images from human brain activity. Neuron 2009, 63, 902–915.
  8. Fujiwara, Y.; Miyawaki, Y.; Kamitani, Y. Modular encoding and decoding models derived from Bayesian canonical correlation analysis. Neural Comput. 2013, 25, 979–1005.
  9. Cowen, A.S.; Chun, M.M.; Kuhl, B.A. Neural portraits of perception: Reconstructing face images from evoked brain activity. Neuroimage 2014, 94, 12–22.
  10. Shen, G.; Horikawa, T.; Majima, K.; Kamitani, Y. Deep image reconstruction from human brain activity. PLoS Comput. Biol. 2019, 15, e1006633.
  11. Beliy, R.; Gaziv, G.; Hoogi, A.; Strappini, F.; Golan, T.; Irani, M. From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRI. Adv. Neural Inf. Process. Syst. 2019, 32, 6517–6527.
  12. Gaziv, G.; Beliy, R.; Granot, N.; Hoogi, A.; Strappini, F.; Golan, T.; Irani, M. Self-supervised natural image reconstruction and large-scale semantic classification from brain activity. NeuroImage 2022, 254, 119121.
  13. Qiao, K.; Chen, J.; Wang, L.; Zhang, C.; Tong, L.; Yan, B. Reconstructing natural images from human fMRI by alternating encoding and decoding with shared autoencoder regularization. Biomed. Signal Process. Control. 2022, 73, 103397.
  14. Seeliger, K.; Güçlü, U.; Ambrogioni, L.; Güçlütürk, Y.; van Gerven, M.A. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage 2018, 181, 775–785.
  15. Shen, G.; Dwivedi, K.; Majima, K.; Horikawa, T.; Kamitani, Y. End-to-end deep image reconstruction from human brain activity. Front. Comput. Neurosci. 2019, 13, 21.
  16. Fang, T.; Qi, Y.; Pan, G. Reconstructing perceptive images from brain activity by shape-semantic gan. Adv. Neural Inf. Process. Syst. 2020, 33, 13038–13048.
  17. Ren, Z.; Li, J.; Xue, X.; Li, X.; Yang, F.; Jiao, Z.; Gao, X. Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning. NeuroImage 2021, 228, 117602.
  18. Meng, L.; Yang, C. Semantics-guided hierarchical feature encoding generative adversarial network for natural image reconstruction from brain activities. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–9.
  19. Mozafari, M.; Reddy, L.; VanRullen, R. Reconstructing natural scenes from fMRI patterns using bigbigan. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8.
  20. Ozcelik, F.; Choksi, B.; Mozafari, M.; Reddy, L.; VanRullen, R. Reconstruction of perceived images from fmri patterns and semantic brain exploration using instance-conditioned gans. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8.
  21. Lin, S.; Sprague, T.; Singh, A.K. Mind reader: Reconstructing complex images from brain activities. Adv. Neural Inf. Process. Syst. 2022, 35, 29624–29636.
  22. Chen, Z.; Qing, J.; Xiang, T.; Yue, W.L.; Zhou, J.H. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22710–22720.
  23. Ni, P.; Zhang, Y. Natural Image Reconstruction from fMRI Based on Self-supervised Representation Learning and Latent Diffusion Model. In Proceedings of the 15th International Conference on Digital Image Processing, Nanjing, China, 19–22 May 2023; pp. 1–9.
  24. Meng, L.; Yang, C. Dual-Guided Brain Diffusion Model: Natural Image Reconstruction from Human Visual Stimulus fMRI. Bioengineering 2023, 10, 1117.
  25. Lu, Y.; Du, C.; Zhou, Q.; Wang, D.; He, H. MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5899–5908.
  26. Donahue, J.; Simonyan, K. Large scale adversarial representation learning. Adv. Neural Inf. Process. Syst. 2019, 32.
  27. Casanova, A.; Careil, M.; Verbeek, J.; Drozdzal, M.; Romero Soriano, A. Instance-conditioned gan. Adv. Neural Inf. Process. Syst. 2021, 34, 27517–27529.
  28. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 8748–8763.
  29. Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119.
  30. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009.
  31. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695.
  32. Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81.
  33. Kawahara, J.; Brown, C.J.; Miller, S.P.; Booth, B.G.; Chau, V.; Grunau, R.E.; Zwicker, J.G.; Hamarneh, G. BrainNetCNN: Convolutional neural networks for brain networks; towards predicting neurodevelopment. NeuroImage 2017, 146, 1038–1049.
  34. Li, Y.; Zhang, X.; Nie, J.; Zhang, G.; Fang, R.; Xu, X.; Wu, Z.; Hu, D.; Wang, L.; Zhang, H.; et al. Brain connectivity based graph convolutional networks and its application to infant age prediction. IEEE Trans. Med. Imaging 2022, 41, 2764–2776.
  35. Meng, L.; Ge, K. Decoding Visual fMRI Stimuli from Human Brain Based on Graph Convolutional Neural Network. Brain Sci. 2022, 12, 1394.
  36. Saeidi, M.; Karwowski, W.; Farahani, F.V.; Fiok, K.; Hancock, P.; Sawyer, B.D.; Christov-Moore, L.; Douglas, P.K. Decoding Task-Based fMRI Data with Graph Neural Networks, Considering Individual Differences. Brain Sci. 2022, 12, 1094.
  37. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
  38. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37.
  39. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder–decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818.
  40. Luo, J.; Cui, W.; Liu, J.; Li, Y.; Guo, Y.; Xu, S.; Wang, L. Visual Image Decoding of Brain Activities using a Dual Attention Hierarchical Latent Generative Network with Multi-Scale Feature Fusion. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 761–773.
More