Infrared and visible image fusion is to combine the information of thermal radiation and detailed texture from the two images into one informative fused image. RDecently, deeep learning methods have been widely applied in this task; however, those methods usually fuse multiple extracted features with the same fusion strategy, which ignores the differences in the representation of these features, resulting in the loss of information in the fusion process. Infrared and visible image fusion techniques can be divided into two categories: traditional methods and deep learning-based methods. In the past decades, tTraditional methods have been proposed for the fusion of pixel-level or fixed features.
1. Introduction
Image fusion refers to combining the images obtained by different types of sensors to generate a robust or informative image for subsequent processing and decision-making
[1,2][1][2]. The technique is important for the fields of target detection
[3], image enhancement
[4], video surveillance
[5], remote sensing
[6[6][7][8][9],
7,8,9], defogging
[10], and so on. Due to differences in the imaging mechanism of the sensors, the scene information captured by infrared and visible images is very different in contrast and texture. Visible images are mainly reflection imaging, which is strongly dependent on lighting conditions. They usually have the characteristics of high spatial resolution, rich color, and texture details, which can offer a good source of perception in favorable lighting conditions. However, they are vulnerable to insufficient light or bad weather conditions. The infrared images reflect the thermal radiation of an object and are almost unaffected by weather and light. However, they usually have low spatial resolution and lack detailed texture information. Therefore, the fusion of two images provides more comprehensive information than a single image, which is very useful for subsequent high-level applications
[11,12][11][12].
Currently, infrared and visible image fusion techniques can be divided into two categories: traditional methods and deep learning-based methods. In the past decades, traditional methods have been proposed for the fusion of pixel-level or fixed features. Traditional image fusion methods mainly include multi-scale transform (MST)
[13[13][14],
14], sparse representation (SR)
[15[15][16],
16], salience
[17,18][17][18] and low rank representation (LRR)
[19,20][19][20]. The MST methods design appropriate fusion strategies to fuse the sub-layers obtained by using some transform operators, and the result is achieved through the inverse transformation. As a representative of MST method, Vanmali et al.
[21] employed the laplacian pyramid as the transform operator and generated the weight map that was used to fuse the corresponding layers by considering local entropy, contrast, and brightness; therefore, good results can be achieved under the conditions of bad light. Yan et al.
[22] constructed an edge-preserving filter for image decomposition, which can not only preserve the edge but also attenuate the influence of infrared background, ensuring that the fused image contains rich background information and salient features. However, MST method has a strong dependence on the choice of transformation, and its inappropriate fusion rules can introduce artifacts to the results
[23]. Compared with MST, the goal of SR is to learn an over-complete dictionary to sparsely represent the source image, and the fused image can be reconstructed from the fused sparse representation coefficients. Bin et al.
[24] adopted a fixed over-complete discrete cosine transform dictionary to represent infrared and visible images. Veshki et al.
[25] used a sparse representation with identical support and Pearson correlation constraints without causing strength decay or loss of important information. For target-oriented fusion methods, salience methods can maintain the integrity of the significant target area and improve the visual quality of the fused images. Ma et al.
[26] employed the rolling guidance and Gaussian filter as a multi-scale decomposition operator and used a visual saliency map to make the fusion result contain more visual details. Liu et al.
[27] proposed a method combining salient object extraction and low-light region enhancement to improve the overall brightness of the image and make the results more suitable for human perception. As an efficient representation method, LRR is to decompose the images with low-rank representation and then fuse the sub-layers with appropriate rules. Gao et al.
[22] proposed the combination of latent low-rank representation (LatLRR) and rolling guidance image filter (RGIF) to extract sub-layers from the images, which improved the fusion quality in terms of image contrast, sharpness, and richness of detail information. Although traditional methods have achieved indicated good performance, they still have three drawbacks: (1) the quality of handcrafted features determines the effect of fusion; (2) some traditional methods such as SR are very time-consuming; (3) specific fusion strategies need to be designed for various image datasets.
Recently, due to the advantages of strong adaptability, fault tolerance, and anti-noise capabilities, deep learning (DL) has been widely used in image fusion and has achieved better performance than traditional ones. According to the difference in network structure and output, DL-based fusion methods can be divided into two categories: non-end-to-end learning and end-to-end learning. For the former, the neural networks only extract deep features or output weights as the consideration of fusion strategy. Liu et al.
[28,29][28][29] obtained the activity level measurement of the images through the Siamese convolutional network and combined it with the Laplace pyramid to realize the efficient fusion of infrared and visible images. Jian et al.
[30] proposed a fusion framework based on decomposition network and salience analysis (DDNSA). They combined saliency map and bidirectional edge intensity to fuse the structural and texture features, respectively, and the fusion result can retain more details from the source images. While the end-to-end methods directly produce the fusion results through the network without very sophisticated and time-consuming operations. Xu et al.
[31] proposed the FusionDN by employing a densely connected network to extract features effectively, which can be applied to multiple fusion tasks with the same weights. Ma et al.
[32] proposed a new end-to-end model, termed DDcGAN, which established an adversarial game between a generator and two discriminators for fusing infrared and visible images at different resolutions. In the past two years, many methods have begun to adopt the framework of feature extraction-fusion-image reconstruction. This framework can maximize the capabilities of feature extraction and feature fusion, respectively, and ultimately improve the quality of fusion. Yang et al.
[33] proposed a method based on dual-channel information cross fusion block (DICFB) for cross extraction and preliminary fusion of multi-scale features, and the final image is enhanced by saliency information. By considering the illumination factor in the feature extraction stage, Tang et al.
[34] proposed a progressive image fusion network termed as PIAFusion, which can adaptively maintain the intensity distribution of significant targets and retain the texture information in the background.
Although the above-mentioned methods have achieved competitive performance, they still have as following disadvantages:
-
The design of the multi-feature fusion strategy is simple and does not make full use of feature information.
-
CNN-based methods only consider local features in the fusion process without modeling long-range dependencies, which will lose global context meaningful for the fusion results.
-
End-to-end methods lack obvious feature extraction steps, resulting in poor fusion results.
2. Auto-Encoder-Based Methods
In CNN-based fusion methods, the last layer is often used as output features or to produce fusion results, which will lose the meaningful information contained by the middle layers. In order to solve this problem, Li et al.
[35] proposed DenseFuse for infrared and visible image fusion, which is composed of an encoder network, fusion strategy, and decoder network. In which the encoder network comprised of convolution layers and dense blocks are used to extract deep features, and the decoder network is applied to reconstruct the image. In their fusion phase, the addition strategy or
l1−norm strategy is adopted to fuse the deep features, which can preserve more details from the source images.
To improve DenseFuse, Li et al.
[36] proposed Nestfuse, in which the encoder network is changed to a multi-scale network, and the nest connection architecture is selected as the decoding network. Due to their design of spatial/channel attention fusion strategies, the model can better fuse the background details and salient regions in the image. However, this handcrafted strategy cannot effectively utilize multi-modal features. Therefore, Li et al.
[37] further proposed RFN-nest, adopting a residual fusion network to learn the fusion weight. Although these methods achieve good results to some extent, they adopt the same fusion strategy for multi-modal features, which ignores the differences between these features at various modals. In order to improve the fusion quality, the focal transformer model is adopted, and a self-adaptive fusion strategy is designed for multi-modal features.
3. Transformer-Based Method
Transformer
[38] was first applied to natural language processing and has achieved great success. Unlike CNN’s focus on local features, the transformer’s attention mechanism can help it establish long-range dependence so as to make better use of global information in both shallow and deep layers. The proposal of a vision transformer
[39] shows that the transformer has great potential in computer vision (CV). In recent years, more and more researchers have introduced transformers into CV, such as object detection, segmentation, multiple object tracking, and so on. Liu et al.
[40] proposed VST, which adopts T2T-ViT as the backbone, introducing a new multitask decoder and reverse T2T token upsampling method. Unlike some methods in which class tokens are directly used in image classification via using multilayer perceptron on the token embedding, VST recommends that patch-task-attention should be carried out between patch tokens and task tokens to predict saliency and boundary map.
Although the transformer has better representation ability, it needs enormous computational overhead when processing high-resolution images. To alleviate the challenge of adapting the transformer from language to vision, many researchers began to explore the transformer structure more suitable for CV. Liu et al.
[41] proposed a Swin transformer, in which the key is the shift window scheme, which limits the self-attention calculation to non-overlapping local windows and allows cross window connection so as to improve the efficiency. Inspired by the Swin transformer, Li et al.
[42] proposed a multi-path structure of transformer called LG-Transformer, which can carry out local-to-global reasoning on the multiple granularities of each stage and solve the problem of lack of global reasoning in the early stages of the previous models. These methods of applying coarse-grained global attention and fine-grained local attention improve the performance of the model but also weaken the modeling ability of the transformer’s original self-attention mechanism. Therefore, Yang et al.
[43] proposed a focal transformer, which combines fine-grained local interaction with coarse-grained global interaction. In the work of the focal transformer, a new mechanism called focal self-attention is introduced, in which each token attends to its nearest surrounding tokens in fine granularity and far away tokens in coarse granularity. This method can capture both short-term and long-term visual dependencies, and the computational efficiency is greatly improved.