Infrared and Visible Image Fusion Technologies: Comparison
Please note this is a comparison between Version 2 by Jason Zhu and Version 1 by Xiangzeng Liu.

Infrared and visible image fusion is to combine the information of thermal radiation and detailed texture from the two images into one informative fused image. RDecently, deeep learning methods have been widely applied in this task; however, those methods usually fuse multiple extracted features with the same fusion strategy, which ignores the differences in the representation of these features, resulting in the loss of information in the fusion process. Infrared and visible image fusion techniques can be divided into two categories: traditional methods and deep learning-based methods. In the past decades, tTraditional methods have been proposed for the fusion of pixel-level or fixed features. 

  • infrared image
  • visible image
  • transformer

1. Introduction

Image fusion refers to combining the images obtained by different types of sensors to generate a robust or informative image for subsequent processing and decision-making [1,2][1][2]. The technique is important for the fields of target detection [3], image enhancement [4], video surveillance [5], remote sensing [6[6][7][8][9],7,8,9], defogging [10], and so on. Due to differences in the imaging mechanism of the sensors, the scene information captured by infrared and visible images is very different in contrast and texture. Visible images are mainly reflection imaging, which is strongly dependent on lighting conditions. They usually have the characteristics of high spatial resolution, rich color, and texture details, which can offer a good source of perception in favorable lighting conditions. However, they are vulnerable to insufficient light or bad weather conditions. The infrared images reflect the thermal radiation of an object and are almost unaffected by weather and light. However, they usually have low spatial resolution and lack detailed texture information. Therefore, the fusion of two images provides more comprehensive information than a single image, which is very useful for subsequent high-level applications [11,12][11][12].
Currently, infrared and visible image fusion techniques can be divided into two categories: traditional methods and deep learning-based methods. In the past decades, traditional methods have been proposed for the fusion of pixel-level or fixed features. Traditional image fusion methods mainly include multi-scale transform (MST) [13[13][14],14], sparse representation (SR) [15[15][16],16], salience [17,18][17][18] and low rank representation (LRR) [19,20][19][20]. The MST methods design appropriate fusion strategies to fuse the sub-layers obtained by using some transform operators, and the result is achieved through the inverse transformation. As a representative of MST method, Vanmali et al. [21] employed the laplacian pyramid as the transform operator and generated the weight map that was used to fuse the corresponding layers by considering local entropy, contrast, and brightness; therefore, good results can be achieved under the conditions of bad light. Yan et al. [22] constructed an edge-preserving filter for image decomposition, which can not only preserve the edge but also attenuate the influence of infrared background, ensuring that the fused image contains rich background information and salient features. However, MST method has a strong dependence on the choice of transformation, and its inappropriate fusion rules can introduce artifacts to the results [23]. Compared with MST, the goal of SR is to learn an over-complete dictionary to sparsely represent the source image, and the fused image can be reconstructed from the fused sparse representation coefficients. Bin et al. [24] adopted a fixed over-complete discrete cosine transform dictionary to represent infrared and visible images. Veshki et al. [25] used a sparse representation with identical support and Pearson correlation constraints without causing strength decay or loss of important information. For target-oriented fusion methods, salience methods can maintain the integrity of the significant target area and improve the visual quality of the fused images. Ma et al. [26] employed the rolling guidance and Gaussian filter as a multi-scale decomposition operator and used a visual saliency map to make the fusion result contain more visual details. Liu et al. [27] proposed a method combining salient object extraction and low-light region enhancement to improve the overall brightness of the image and make the results more suitable for human perception. As an efficient representation method, LRR is to decompose the images with low-rank representation and then fuse the sub-layers with appropriate rules. Gao et al. [22] proposed the combination of latent low-rank representation (LatLRR) and rolling guidance image filter (RGIF) to extract sub-layers from the images, which improved the fusion quality in terms of image contrast, sharpness, and richness of detail information. Although traditional methods have achieved indicated good performance, they still have three drawbacks: (1) the quality of handcrafted features determines the effect of fusion; (2) some traditional methods such as SR are very time-consuming; (3) specific fusion strategies need to be designed for various image datasets.
Recently, due to the advantages of strong adaptability, fault tolerance, and anti-noise capabilities, deep learning (DL) has been widely used in image fusion and has achieved better performance than traditional ones. According to the difference in network structure and output, DL-based fusion methods can be divided into two categories: non-end-to-end learning and end-to-end learning. For the former, the neural networks only extract deep features or output weights as the consideration of fusion strategy. Liu et al. [28,29][28][29] obtained the activity level measurement of the images through the Siamese convolutional network and combined it with the Laplace pyramid to realize the efficient fusion of infrared and visible images. Jian et al. [30] proposed a fusion framework based on decomposition network and salience analysis (DDNSA). They combined saliency map and bidirectional edge intensity to fuse the structural and texture features, respectively, and the fusion result can retain more details from the source images. While the end-to-end methods directly produce the fusion results through the network without very sophisticated and time-consuming operations. Xu et al. [31] proposed the FusionDN by employing a densely connected network to extract features effectively, which can be applied to multiple fusion tasks with the same weights. Ma et al. [32] proposed a new end-to-end model, termed DDcGAN, which established an adversarial game between a generator and two discriminators for fusing infrared and visible images at different resolutions. In the past two years, many methods have begun to adopt the framework of feature extraction-fusion-image reconstruction. This framework can maximize the capabilities of feature extraction and feature fusion, respectively, and ultimately improve the quality of fusion. Yang et al. [33] proposed a method based on dual-channel information cross fusion block (DICFB) for cross extraction and preliminary fusion of multi-scale features, and the final image is enhanced by saliency information. By considering the illumination factor in the feature extraction stage, Tang et al. [34] proposed a progressive image fusion network termed as PIAFusion, which can adaptively maintain the intensity distribution of significant targets and retain the texture information in the background.
Although the above-mentioned methods have achieved competitive performance, they still have as following disadvantages:
  • The design of the multi-feature fusion strategy is simple and does not make full use of feature information.
  • CNN-based methods only consider local features in the fusion process without modeling long-range dependencies, which will lose global context meaningful for the fusion results.
  • End-to-end methods lack obvious feature extraction steps, resulting in poor fusion results.

2. Auto-Encoder-Based Methods

In CNN-based fusion methods, the last layer is often used as output features or to produce fusion results, which will lose the meaningful information contained by the middle layers. In order to solve this problem, Li et al. [35] proposed DenseFuse for infrared and visible image fusion, which is composed of an encoder network, fusion strategy, and decoder network. In which the encoder network comprised of convolution layers and dense blocks are used to extract deep features, and the decoder network is applied to reconstruct the image. In their fusion phase, the addition strategy or l1norm strategy is adopted to fuse the deep features, which can preserve more details from the source images. To improve DenseFuse, Li et al. [36] proposed Nestfuse, in which the encoder network is changed to a multi-scale network, and the nest connection architecture is selected as the decoding network. Due to their design of spatial/channel attention fusion strategies, the model can better fuse the background details and salient regions in the image. However, this handcrafted strategy cannot effectively utilize multi-modal features. Therefore, Li et al. [37] further proposed RFN-nest, adopting a residual fusion network to learn the fusion weight. Although these methods achieve good results to some extent, they adopt the same fusion strategy for multi-modal features, which ignores the differences between these features at various modals. In order to improve the fusion quality, the focal transformer model is adopted, and a self-adaptive fusion strategy is designed for multi-modal features.

3. Transformer-Based Method

Transformer [38] was first applied to natural language processing and has achieved great success. Unlike CNN’s focus on local features, the transformer’s attention mechanism can help it establish long-range dependence so as to make better use of global information in both shallow and deep layers. The proposal of a vision transformer [39] shows that the transformer has great potential in computer vision (CV). In recent years, more and more researchers have introduced transformers into CV, such as object detection, segmentation, multiple object tracking, and so on. Liu et al. [40] proposed VST, which adopts T2T-ViT as the backbone, introducing a new multitask decoder and reverse T2T token upsampling method. Unlike some methods in which class tokens are directly used in image classification via using multilayer perceptron on the token embedding, VST recommends that patch-task-attention should be carried out between patch tokens and task tokens to predict saliency and boundary map. Although the transformer has better representation ability, it needs enormous computational overhead when processing high-resolution images. To alleviate the challenge of adapting the transformer from language to vision, many researchers began to explore the transformer structure more suitable for CV. Liu et al. [41] proposed a Swin transformer, in which the key is the shift window scheme, which limits the self-attention calculation to non-overlapping local windows and allows cross window connection so as to improve the efficiency. Inspired by the Swin transformer, Li et al. [42] proposed a multi-path structure of transformer called LG-Transformer, which can carry out local-to-global reasoning on the multiple granularities of each stage and solve the problem of lack of global reasoning in the early stages of the previous models. These methods of applying coarse-grained global attention and fine-grained local attention improve the performance of the model but also weaken the modeling ability of the transformer’s original self-attention mechanism. Therefore, Yang et al. [43] proposed a focal transformer, which combines fine-grained local interaction with coarse-grained global interaction. In the work of the focal transformer, a new mechanism called focal self-attention is introduced, in which each token attends to its nearest surrounding tokens in fine granularity and far away tokens in coarse granularity. This method can capture both short-term and long-term visual dependencies, and the computational efficiency is greatly improved.

References

  1. Hu, H.M.; Wu, J.; Li, B.; Guo, Q.; Zheng, J. An Adaptive Fusion Algorithm for Visible and Infrared Videos Based on Entropy and the Cumulative Distribution of Gray Levels. IEEE Trans. Multimed. 2017, 19, 2706–2719.
  2. Zhao, W.; Lu, H.; Wang, D. Multisensor Image Fusion and Enhancement in Spectral Total Variation Domain. IEEE Trans. Multimed. 2018, 20, 866–879.
  3. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9627–9636.
  4. Kou, F.; Wei, Z.; Chen, W.; Wu, X.; Wen, C.; Li, Z. Intelligent Detail Enhancement for Exposure Fusion. IEEE Trans. Multimed. 2018, 20, 484–495.
  5. Arroyo, S.; Bussi, U.; Safar, F.; Oliva, D. A Monocular Wide-Field Vision System for Geolocation with Uncertainties in Urban Scenes. Eng. Res. Express 2020, 2, 025041.
  6. Rajah, P.; Odindi, J.; Mutanga, O. Feature Level Image Fusion of Optical Imagery and Synthetic Aperture Radar (SAR) for Invasive Alien Plant Species Detection and Mapping. Remote Sens. Appl. Soc. Environ. 2018, 10, 198–208.
  7. Ma, J.; Yu, W.; Chen, C.; Liang, P.; Guo, X.; Jiang, J. Pan-GAN: An Unsupervised Pan-Sharpening Method for Remote Sensing Image Fusion. Inf. Fusion 2020, 62, 110–120.
  8. Liu, W.; Yang, J.; Zhao, J.; Guo, F. A Dual-Domain Super-Resolution Image Fusion Method With SIRV and GALCA Model for PolSAR and Panchromatic Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14.
  9. Ying, J.; Shen, H.-L.; Cao, S.-Y. Unaligned Hyperspectral Image Fusion via Registration and Interpolation Modeling. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14.
  10. Zhu, Z.; Wei, H.; Hu, G.; Li, Y.; Qi, G.; Mazur, N. A Novel Fast Single Image Dehazing Algorithm Based on Artificial Multiexposure Image Fusion. IEEE Trans. Instrum. Meas. 2021, 70, 1–23.
  11. Paramanandham, N.; Rajendiran, K. Infrared and Visible Image Fusion Using Discrete Cosine Transform and Swarm Intelligence for Surveillance Applications. Infrared Phys. Technol. 2018, 88, 13–22.
  12. Wang, G.; Li, W.; Gao, X.; Xiao, B.; Du, J. Functional and Anatomical Image Fusion Based on Gradient Enhanced Decomposition Model. IEEE Trans. Instrum. Meas. 2022, 71, 1–14.
  13. Li, G.; Lin, Y.; Qu, X. An Infrared and Visible Image Fusion Method Based on Multi-Scale Transformation and Norm Optimization. Inf. Fusion 2021, 71, 109–129.
  14. Jian, L.; Yang, X.; Zhou, Z.; Zhou, K.; Liu, K. Multi-Scale Image Fusion through Rolling Guidance Filter. Future Gener. Comput. Syst. 2018, 83, 310–325.
  15. Maqsood, S.; Javed, U. Multi-Modal Medical Image Fusion Based on Two-Scale Image Decomposition and Sparse Representation. Biomed. Signal Process. Control 2020, 57, 101810.
  16. Zhang, Q.; Liu, Y.; Blum, R.S.; Han, J.; Tao, D. Sparse Representation Based Multi-Sensor Image Fusion for Multi-Focus and Multi-Modality Images: A Review. Inf. Fusion 2018, 40, 57–75.
  17. Li, Q.; Han, G.; Liu, P.; Yang, H.; Wu, J.; Liu, D. An Infrared and Visible Image Fusion Method Guided by Saliency and Gradient Information. IEEE Access 2021, 9, 108942–108958.
  18. Zhang, X.; Ma, Y.; Fan, F.; Zhang, Y.; Huang, J. Infrared and Visible Image Fusion via Saliency Analysis and Local Edge-Preserving Multi-Scale Decomposition. JOSA A 2017, 34, 1400–1410.
  19. Li, H.; Wu, X.J. Infrared and Visible Image Fusion Using Latent Low-Rank Representation. arXiv 2022, arXiv:1804.08992v5.
  20. Gao, C.; Song, C.; Zhang, Y.; Qi, D.; Yu, Y. Improving the Performance of Infrared and Visible Image Fusion Based on Latent Low-Rank Representation Nested With Rolling Guided Image Filtering. IEEE Access 2021, 9, 91462–91475.
  21. Vanmali, A.V.; Gadre, V.M. Visible and NIR Image Fusion Using Weight-Map-Guided Laplacian–Gaussian Pyramid for Improving Scene Visibility. Sādhanā 2017, 42, 1063–1082.
  22. Yan, H.; Zhang, J.-X.; Zhang, X. Injected Infrared and Visible Image Fusion via 1 Decomposition Model and Guided Filtering. IEEE Trans. Comput. Imaging 2022, 8, 162–173.
  23. Zhou, X.; Wang, W. Infrared and Visible Image Fusion Based on Tetrolet Transform. In Proceedings of the 2015 International Conference on Communications, Signal Processing, and Systems, Tianjin, China, 23–24 October 2015; pp. 701–708.
  24. Yang, B.; Yang, C.; Huang, G. Efficient Image Fusion with Approximate Sparse Representation. Int. J. Wavelets Multiresolution Inf. Process. 2016, 14, 1650024.
  25. Veshki, F.G.; Ouzir, N.; Vorobyov, S.A.; Ollila, E. Multimodal Image Fusion via Coupled Feature Learning. Signal Process. 2022, 200, 108637.
  26. Ma, J.; Zhou, Z.; Wang, B.; Zong, H. Infrared and Visible Image Fusion Based on Visual Saliency Map and Weighted Least Square Optimization. Infrared Phys. Technol. 2017, 82, 8–17.
  27. Liu, Y.; Dong, L.; Xu, W. Infrared and Visible Image Fusion via Salient Object Extraction and Low-Light Region Enhancement. Infrared Phys. Technol. 2022, 124, 104223.
  28. Liu, Y.; Chen, X.; Peng, H.; Wang, Z. Multi-Focus Image Fusion with a Deep Convolutional Neural Network. Inf. Fusion 2017, 36, 191–207.
  29. Liu, Y.; Chen, X.; Cheng, J.; Peng, H.; Wang, Z. Infrared and Visible Image Fusion with Convolutional Neural Networks. Int. J. Wavelets Multiresolution Inf. Process. 2017, 16, 1850018.
  30. Jian, L.; Rayhana, R.; Ma, L.; Wu, S.; Liu, Z.; Jiang, H. Infrared and Visible Image Fusion Based on Deep Decomposition Network and Saliency Analysis. IEEE Trans. Multimed. 2021, 1.
  31. Xu, H.; Ma, J.; Le, Z.; Jiang, J.; Guo, X. FusionDN: A Unified Densely Connected Network for Image Fusion. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12484–12491.
  32. Ma, J.; Xu, H.; Jiang, J.; Mei, X.; Zhang, X.-P. DDcGAN: A Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion. IEEE Trans. Image Process. 2020, 29, 4980–4995.
  33. Yang, Y.; Kong, X.; Huang, S.; Wan, W.; Liu, J.; Zhang, W. Infrared and Visible Image Fusion Based on Multiscale Network with Dual-Channel Information Cross Fusion Block. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–7.
  34. Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A Progressive Infrared and Visible Image Fusion Network Based on Illumination Aware. Inf. Fusion 2022, 83, 79–92.
  35. Li, H.; Wu, X.J. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans. Image Process. 2019, 28, 2614–2623.
  36. Li, H.; Wu, X.J.; Durrani, T. NestFuse: An Infrared and Visible Image Fusion Architecture Based on Nest Connection and Spatial/Channel Attention Models. IEEE Trans. Instrum. Meas. 2020, 69, 9645–9656.
  37. Li, H.; Wu, X.J.; Kittler, J. RFN-Nest: An End-to-End Residual Fusion Network for Infrared and Visible Images. Inf. Fusion 2021, 73, 72–86.
  38. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008.
  39. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929v2.
  40. Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual Saliency Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 4722–4732.
  41. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022.
  42. Li, J.; Yan, Y.; Liao, S.; Yang, X.; Shao, L. Local-to-Global Self-Attention in Vision Transformers. arXiv 2021, arXiv:2107.04735v1.
  43. Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; Gao, J. Focal Self-Attention for Local-Global Interactions in Vision Transformers. arXiv 2021, arXiv:2107.00641v1.
More
ScholarVision Creations