Deep Learning-Based IVIF Approaches: Comparison
Please note this is a comparison between Version 2 by Jessie Wu and Version 1 by Hanrui Chen.

Infrared and visible image fusion (IVIF) aims to render fused images that maintain the merits of both modalities. 

  • infrared–visible image
  • image fusion
  • deep learning

1. Introduction

Image fusion is a basic and popular topic in image processing that seeks to generate informative fused images by integrating essential information from multiple source images. Infrared and visible image fusion (IVIF) is one of the important sub-categories of image fusion [1]. IVIF focuses on preserving detailed texture and thermal information in the input images [2]. The fused images can mitigate the disadvantages of visible images, being susceptible to illumination and other environmental conditions, as well as avoiding the issue of infrared images lacking texture.
Numerous methods have been proposed to tackle the challenge of IVIF [3,4,5,6,7,8,9,10,11][3][4][5][6][7][8][9][10][11]. These methods can be mainly categorized into deep learning-based approaches and conventional methods. Deep learning methods are becoming increasingly popular in the fusion task due to their ability to extract high-level semantic features [5[5][7][10][12],7,10,12], but there is still a need for improvement in preserving complex and irregular edges within images. Infrared and visible images, coming from the same scene, inherently share statistical co-occurrent information, such as background and large-scale features. Transformer-based deep learning frameworks are good at extracting global features from inputs, so they are well suited for fusing the main features of infrared and visible images.
Conventional methods offer better interpretability, and their rich prior knowledge enables the design of fusion techniques that effectively preserve high- and low-frequency information. But they may suffer from high design complexity. Conventional fusion methods can be generally divided into several categories according to their adopted theories [13], i.e., multi-scale transformation (MST), saliency-based methods, sparse representation, subspace, etc. One of the most active and well-established fields for image fusion is MST. It decomposes input images into a base layer containing low-frequency main feature and detail layers containing high-frequency texture and edges. Some studies demonstrated that MST-based methods are aligned with human visual characteristics [14,15][14][15] and this property enables fused images to have an appropriate visual effect. Regarding MST-based fusion schemes, many methods employ weighted averaging or maximum value schemes. Simple weighted averaging may diminish the contrast of salient regions, while the pixel-wise application of the maximum value strategy may not adequately preserve the continuity of edges and textures.

2. Multi-STcale Transformation-Based Fusion Methods

Multi-scale transformation (MST) contains many methods, such as wavelet transform, contour transform, nonsubsampled contourlet transform (NSCT), and nonsubsampled shearlet transform (NSST). Various MST-based methods have been applied to image fusion [14,19][14][16]. NSCT was proposed by Da Cunha et al. [20][17], and is based on contourlet transform [21][18]. NSCT has been widely applied in infrared and visible image fusion. The entropy of the square of the coefficients and the sum of the modified Laplacian were utilized in the frequency domain [19][16]. Easley et al. proposed NSST [22][19], which is realized by nonsubsampled Laplacian pyramid and shearing filters. Zhang et al. [9] proposed a new image fusion method regarding global–regional–local rules applied to overcome the problem of wrongly interpreting the source image. The source images are statistically correlated by the G-CHMM model, R-CHMM model, and L-CHMM model in the high subband region. High-pass subbands were fused by global–regional–local CHMM design and choose-max rules based on the local gradient measure. Finally, the fused images were extracted by exploiting the inverse NSST. Liu X et al. [3] proposed a multi-modality medical image fusion algorithm that utilizes a moving frame-based decomposition framework (MFDF) and the NSST. The MFDF is applied to decompose the source images into texture components and approximation components. The maximum selection fusion rule is employed to fuse the texture components, aiming to transfer salient gradient information to the fused image. The approximate components are merged using NSST. Finally, a component synthesis process is adopted to produce the fused image. Liu et al. proposed an image fusion algorithm based on NSST and modified-spatial frequency (MSF) [4]. It selects the coefficients with greater MSF to combine images when high-frequency and low-frequency subbands of source images are compared. Miao et al. [23][20] proposed an image fusion algorithm based on the NSST. The algorithm employs an average fusion strategy for the low-frequency information fusion and a novel method to fuse high-frequency information.
Lots of MST-based fusion methods utilize weighted averaging or maximum value strategies. However, simple weighted averaging may reduce the contrast of salient regions, and the simple maximum value strategy is applied pixel-wise, which may not preserve the continuity of edges and textures. To tackle these limitations, wresearchers propose an edge-consistency fusion method. This method incorporates the activity rules to preserve the brightness of salient edges and achieves texture continuity and integrity through consistency verification.

3. Deep Learning-Based Fusion Methods

The convolutional neural network (CNN) is a commonly used deep learning network model. In STDFusionNet [2], a salient target mask is employed to enhance the contrast information from the infrared image in the fused image. This approach aims to achieve a significant injection of contrast information. SeAFusion [7] is a novel semantic-aware framework for fusing infrared and visible images, achieving outstanding performance in image fusion and advanced visual tasks. These methods leverage CNN to extract features and perform fusion operations, enabling the effective integration of information from different modalities or sources. FusionGAN [6] is a groundbreaking method that applies a generative adversarial network (GAN) to the field of image fusion. It establishes a generative adversarial framework between the fused image and visible image, allowing the fused image to acquire texture and structure in a more enhanced manner. Following FusionGAN, there have been numerous fusion methods inspired by GAN, such as TarDal [5]. Additionally, a wide range of fusion methods based on autoencoders (AE) have been proposed by researchers. These methods commonly employ AE to extract features from source images and achieve image reconstruction. AE can capture relevant information and reconstruct images effectively, making them a popular choice in fusion techniques. DenseFuse [8] uses the structural strength of DenseNet [24][21], resulting in an effective fusion outcome. DIDFuse [10] is also an AE-based image fusion method that replaces the transformers and inverse transformers with encoders and decoders.
The deep learning methods excel at extracting high-level semantic features, and the AE-based approaches are capable of capturing global information from images, making the extraction of shared features between infrared and visible images more effective, such as background and large-scale features. This advantage makes them well-suited for fusing the main features of images. Therefore, wresearchers design a correlation-driven AE-based method for fusing the main information of images.

References

  1. Yin, R.; Yang, B.; Huang, Z.; Zhang, X. DSA-Net: Infrared and Visible Image Fusion via Dual-Stream Asymmetric Network. Sensors 2023, 23, 7079.
  2. Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Trans. Instrum. Meas. 2021, 70, 1–13.
  3. Liu, X.; Mei, W.; Du, H. Multi-modality medical image fusion based on image decomposition framework and nonsubsampled shearlet transform. Biomed. Signal Process. Control 2018, 40, 343–350.
  4. Liu, J.; Gao, M. Image Fusion by Modified Spatial Frequency and Nonsubsampled Shearlet Transform. Int. J. Signal Process. Image Process. Pattern Recognit. 2017, 10, 27–34.
  5. Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811.
  6. Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26.
  7. Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42.
  8. Li, H.; Wu, X.J. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans. Image Process. 2019, 28, 2614–2623.
  9. Zhang, Z.; Xi, X.; Luo, X.; Jiang, Y.; Dong, J.; Wu, X. Multimodal image fusion based on global-regional-local rule in NSST domain. Multimed. Tools Appl. 2021, 80, 2847–2873.
  10. Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Li, P.; Zhang, J. DIDFuse: Deep image decomposition for infrared and visible image fusion. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan, 11–17 July 2020; pp. 970–976.
  11. Yang, D.; Wang, X.; Zhu, N.; Li, S.; Hou, N. MJ-GAN: Generative Adversarial Network with Multi-Grained Feature Extraction and Joint Attention Fusion for Infrared and Visible Image Fusion. Sensors 2023, 23, 6322.
  12. Zhu, H.; Wu, H.; Wang, X.; He, D.; Liu, Z.; Pan, X. DPACFuse: Dual-Branch Progressive Learning for Infrared and Visible Image Fusion with Complementary Self-Attention and Convolution. Sensors 2023, 23, 7205.
  13. Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178.
  14. Liu, Y.; Liu, S.; Wang, Z. A general framework for image fusion based on multi-scale transform and sparse representation. Inf. Fusion 2015, 24, 147–164.
  15. Donoho, D.L.; Flesia, A.G. Can recent innovations in harmonic analysisexplain’key findings in natural image statistics? Netw. Comput. Neural Syst. 2001, 12, 371–393.
  16. Ganasala, P.; Kumar, V. CT and MR image fusion scheme in nonsubsampled contourlet transform domain. J. Digit. Imaging 2014, 27, 407–418.
  17. Da Cunha, A.; Zhou, J.; Do, M. The Nonsubsampled Contourlet Transform: Theory, Design, and Applications. IEEE Trans. Image Process. 2006, 15, 3089–3101.
  18. Do, M.; Vetterli, M. The Contourlet Transform: An Efficient Directional Multiresolution Image Representation. IEEE Trans. Image Process. 2005, 14, 2091–2106.
  19. Easley, G.; Labate, D.; Lim, W.Q. Sparse directional image representations using the discrete shearlet transform. Appl. Comput. Harmon. Anal. 2008, 25, 25–46.
  20. Miao, Q.G.; Shi, C.; Xu, P.F.; Yang, M.; Shi, Y.B. A novel algorithm of image fusion using shearlets. Opt. Commun. 2011, 284, 1540–1547.
  21. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708.
More
ScholarVision Creations