Infrared and visible image fusion (IVIF) aims to render fused images that maintain the merits of both modalities.
1. Introduction
Image fusion is a basic and popular topic in image processing that seeks to generate informative fused images by integrating essential information from multiple source images. Infrared and visible image fusion (IVIF) is one of the important sub-categories of image fusion
[1]. IVIF focuses on preserving detailed texture and thermal information in the input images
[2]. The fused images can mitigate the disadvantages of visible images, being susceptible to illumination and other environmental conditions, as well as avoiding the issue of infrared images lacking texture.
Numerous methods have been proposed to tackle the challenge of IVIF
[3][4][5][6][7][8][9][10][11]. These methods can be mainly categorized into deep learning-based approaches and conventional methods. Deep learning methods are becoming increasingly popular in the fusion task due to their ability to extract high-level semantic features
[5][7][10][12], but there is still a need for improvement in preserving complex and irregular edges within images. Infrared and visible images, coming from the same scene, inherently share statistical co-occurrent information, such as background and large-scale features. Transformer-based deep learning frameworks are good at extracting global features from inputs, so they are well suited for fusing the main features of infrared and visible images.
Conventional methods offer better interpretability, and their rich prior knowledge enables the design of fusion techniques that effectively preserve high- and low-frequency information. But they may suffer from high design complexity. Conventional fusion methods can be generally divided into several categories according to their adopted theories
[13], i.e., multi-scale transformation (MST), saliency-based methods, sparse representation, subspace, etc. One of the most active and well-established fields for image fusion is MST. It decomposes input images into a base layer containing low-frequency main feature and detail layers containing high-frequency texture and edges. Some studies demonstrated that MST-based methods are aligned with human visual characteristics
[14][15] and this property enables fused images to have an appropriate visual effect. Regarding MST-based fusion schemes, many methods employ weighted averaging or maximum value schemes. Simple weighted averaging may diminish the contrast of salient regions, while the pixel-wise application of the maximum value strategy may not adequately preserve the continuity of edges and textures.
2. Multi-Scale Transformation-Based Fusion Methods
Multi-scale transformation (MST) contains many methods, such as wavelet transform, contour transform, nonsubsampled contourlet transform (NSCT), and nonsubsampled shearlet transform (NSST). Various MST-based methods have been applied to image fusion
[14][16]. NSCT was proposed by Da Cunha et al.
[17], and is based on contourlet transform
[18]. NSCT has been widely applied in infrared and visible image fusion. The entropy of the square of the coefficients and the sum of the modified Laplacian were utilized in the frequency domain
[16]. Easley et al. proposed NSST
[19], which is realized by nonsubsampled Laplacian pyramid and shearing filters. Zhang et al.
[9] proposed a new image fusion method regarding global–regional–local rules applied to overcome the problem of wrongly interpreting the source image. The source images are statistically correlated by the G-CHMM model, R-CHMM model, and L-CHMM model in the high subband region. High-pass subbands were fused by global–regional–local CHMM design and choose-max rules based on the local gradient measure. Finally, the fused images were extracted by exploiting the inverse NSST. Liu X et al.
[3] proposed a multi-modality medical image fusion algorithm that utilizes a moving frame-based decomposition framework (MFDF) and the NSST. The MFDF is applied to decompose the source images into texture components and approximation components. The maximum selection fusion rule is employed to fuse the texture components, aiming to transfer salient gradient information to the fused image. The approximate components are merged using NSST. Finally, a component synthesis process is adopted to produce the fused image. Liu et al. proposed an image fusion algorithm based on NSST and modified-spatial frequency (MSF)
[4]. It selects the coefficients with greater MSF to combine images when high-frequency and low-frequency subbands of source images are compared. Miao et al.
[20] proposed an image fusion algorithm based on the NSST. The algorithm employs an average fusion strategy for the low-frequency information fusion and a novel method to fuse high-frequency information.
Lots of MST-based fusion methods utilize weighted averaging or maximum value strategies. However, simple weighted averaging may reduce the contrast of salient regions, and the simple maximum value strategy is applied pixel-wise, which may not preserve the continuity of edges and textures. To tackle these limitations, researchers propose an edge-consistency fusion method. This method incorporates the activity rules to preserve the brightness of salient edges and achieves texture continuity and integrity through consistency verification.
3. Deep Learning-Based Fusion Methods
The convolutional neural network (CNN) is a commonly used deep learning network model. In STDFusionNet
[2], a salient target mask is employed to enhance the contrast information from the infrared image in the fused image. This approach aims to achieve a significant injection of contrast information. SeAFusion
[7] is a novel semantic-aware framework for fusing infrared and visible images, achieving outstanding performance in image fusion and advanced visual tasks. These methods leverage CNN to extract features and perform fusion operations, enabling the effective integration of information from different modalities or sources. FusionGAN
[6] is a groundbreaking method that applies a generative adversarial network (GAN) to the field of image fusion. It establishes a generative adversarial framework between the fused image and visible image, allowing the fused image to acquire texture and structure in a more enhanced manner. Following FusionGAN, there have been numerous fusion methods inspired by GAN, such as TarDal
[5]. Additionally, a wide range of fusion methods based on autoencoders (AE) have been proposed by researchers. These methods commonly employ AE to extract features from source images and achieve image reconstruction. AE can capture relevant information and reconstruct images effectively, making them a popular choice in fusion techniques. DenseFuse
[8] uses the structural strength of DenseNet
[21], resulting in an effective fusion outcome. DIDFuse
[10] is also an AE-based image fusion method that replaces the transformers and inverse transformers with encoders and decoders.
The deep learning methods excel at extracting high-level semantic features, and the AE-based approaches are capable of capturing global information from images, making the extraction of shared features between infrared and visible images more effective, such as background and large-scale features. This advantage makes them well-suited for fusing the main features of images. Therefore, researchers design a correlation-driven AE-based method for fusing the main information of images.