Image Fusion Methods: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor: ,

Image fusion is the generation of an informative image that contains complementary information from the original sensor images, such as texture details and attentional targets. Existing methods have designed a variety of feature extraction algorithms and fusion strategies to achieve image fusion. 

  • image fusion
  • shared feature
  • differential feature

1. Introduction

In many monitoring fields, it is difficult for a single sensor to capture enough information required to meet the monitoring tasks [1]. Sensors in different wavebands (for example, infrared and visible light) have obvious advantages in monitoring the same scene. However, on the one hand, multiple sensors bring data storage challenges, and on the other hand, the image information contained by a single sensor is flawed. Taking infrared and visible light images as an example, infrared sensors reflect the radiation characteristics of foreground targets via thermal radiation imaging, but infrared images often lack structural and texture information. The visible light sensor describes the background details of the scene via light reflection, but it is greatly affected by changes in lighting and weather conditions [2]. Therefore, image fusion has become a popular research field [3]. Image fusion is to fuse the input images into one image. At the same time, the fused image contains all the information of the input images and can even generate more significant information.
According to different application scenarios, image fusion is mainly divided into multi-focus [4], multi-spectral [5], and medical images [6]. The most studied multi-spectral image fusion is the fusion of infrared and visible light images, which also includes the fusion of hyperspectral images in the field of remote sensing. The fused multi-focus image can clearly image the background and foreground at the same time. The fused image of the multi-spectral image contains imaging information of multiple spectra. The fused image of magnetic resonance imaging (MRI) and computed tomography (CT) can clearly see the soft tissue and the bones at the same time.
The two core tasks of image fusion are feature extraction and feature fusion strategies. The original images are transformed into the feature domain, where the fusion rules are designed for features fusing, and then the fused features are reconstructed back to the original pixel space to obtain the fused image. For feature extraction tasks, existing pioneer image fusion works are divided into two major categories: methods of artificially designed transformations and methods based on feature representation learning. Feature fusion strategies are also divided into two types: manual design and global optimization learning.
Artificially constructed feature extraction methods and feature fusion rules at all levels are the most intensively studied areas of image fusion. Since such methods do not require training and are completely unsupervised methods, they have good versatility. The main feature transformation methods include discrete wavelet transform (DWT) [7], shearlet transform [8], nonsubsampled contourlet transform [9], low-rank representation (LRR) [10], and bilateral filter [11]. The manually designed fusion rules mainly include maximum value, average value, and nuclear norm. The usual approach is base parts adopt the average fusion rule, and detail parts adopt the maximum fusion rule [12]. In the representation learning domain, the typical methods are based on sparse representation (SP) [8,13,14]. SP learns a single-layer common over-complete dictionary from the input images, then performs sparse representation of the input images respectively; fuses the sparse coefficients; and reconstructs the fused image. Deep learning has better representation learning capabilities than SP, and it has become a popular research point in the field of image fusion [2,4,15,16,17,18]. Methods based on deep learning first train a common encoder and decoder using a large number of images; use the encoder to extract features of the input images respectively; use a fusion rule to fuse these feature maps; and finally, the decoder is used to reconstruct the fusion image [15,19].
For the fusion of representation features, manually designed fusion rules lack interpretability. Some research works try to learn fusion rules by defining loss functions. Global optimization methods such as particle swarm [20], Grasshopper [21], and membrane computing [22] are used for fusion rule learning. Another class of methods learns the fused decision map via deep learning [23,24].
Therefore, the core of image fusion is to transform the image from the original pixel space to a feature representation space that is easy to fuse. After fusion is achieved in the new feature space, the fused image is obtained via inverse transformation. Judging from the current research trends in academia, it is difficult to develop a new method in the traditional field of multi-scale transformation, and methods based on deep learning are the current research hotspots. The current general idea based on deep learning methods is to implement feature extraction via an encoder. After fusing the features, image fusion is achieved via a decoder. The core problem of these methods is weak interpretability and lack of criteria for judging the quality of extracted features. 

2. Model-Based Feature Extraction Method

Performing pixel-level transformation on the input source images and extracting multi-scale features of the original images are hot spots in early research. The image features extracted by such methods have very nice interpretability. They decompose the input images into low-frequency (base) parts and high-frequency (detail) parts. The high-frequency part reflects the basic semantic information of the scene, and the high-frequency part reflects the target information in the background. The nonsubsampled contourlet transform (NSCT) [9] is a pioneer work that is used for image fusion. To combine the advantages of multi-scale and deep learning, Wang et al. [25] proposed an image fusion method based on a convolutional neural network and NSCT. MDLatLRR [26] is a baseline method in this field, which first performs a multi-level low-rank sparse decomposition of the input images, and then fuses the base and detail parts separately. Li et al. [27] performed norm optimization on the fused images of MDLatLRR to obtain more significant fused images. Gaussian difference is used for image fusion, which is simple, efficient, and versatile [28].

3. Generative-Based Methods

GAN-based methods attempt to use generative neural network models to generate fused images conditioned on inputting multi-source images [29]. DDcGAN [30] drives the deep neural network to learn complementary features to reconstruct the fused image based on a defined loss function. GAN-FM [31] introduces a full-scale skip-connected generator and Markovian discriminators, and Fusion-UDCGAN [32] adopts a U-Type densely connected generation adversarial network. AT-GAN [33] proposes a generative adversarial network with intensity attention modules and semantic transition modules. This type of method can obtain additional image enhancement effects based on the definition of the loss function.

4. Task-Driven Approach

Fusion to improve target segmentation accuracy in low-light environments is one of the current research hotspots. This type of method hopes that the fused image will have higher brightness and more prominent target contours. SCFusion [24] achieves target saliency enhancement via a mask of the target area. SGFusion [34] achieves saliency guidance in the fusion process through multi-tasking of target segmentation. TIM [35] proposes a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion. SOSMaskFuse [36] also uses a target segmentation mask to achieve target enhancement. PIAFusion [37] realizes image fusion under low light conditions. This type of method generally has an enhanced effect on the original multi-source images, and the fused image has better gradient and visual saliency. But, the consistency with the original images is poor.

5. Autoencoder-Based Methods

The fusion method based on the autoencoder and decoder believes that neurons have a stronger amplitude response to salient areas. Fu et al. [38] proposed a dual branch network encoder to learn richer features. DeepFuse [39] performs feature extraction on multiple channels and is used for multi-exposure image fusion. DenseFuse [40] introduces a dense block in the encoder to extract multi-scale features. FusionDN [41] also uses a densely connected network and defines a multi-task loss function. NestFuse [42] introduces a nest connection architecture and also introduces a spatial attention mechanism to enhance the salient features. RFNNest [43] proposes a residual fusion network and can better retain detailed features. PSFusion [44] presents a practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity constraints and the fusion images have good visual appeal. CDDFuse [45] is inspired by multi-scale decomposition and uses neural networks to decompose images into basic parts and detailed parts. The fused image is reconstructed after fusing the two parts respectively. This method requires two stages of training.

This entry is adapted from the peer-reviewed paper 10.3390/e26010057

This entry is offline, you can click here to edit this entry!
Video Production Service