A Unified Framework for RGB-Infrared Transfer: Comparison
Please note this is a comparison between Version 1 by Qiyang Sun and Version 2 by Sirius Huang.

Infrared(IR) images (both 0.7-3 µm and 8-15 µm) offer radiation intensity texture information that visible images lack, making them particularly helpful in daytime, nighttime, and complex scenes. Many researchers are studying how to translate RGB images into infrared images for deep learning-based visual tasks such as object tracking, crowd counting, panoramic segmentation, and image fusion in urban scenarios. The utilization of the RGB-IR dataset in the aforementioned tasks holds the potential to provide comprehensive multi-band fusion data for urban scenes, thereby facilitating precise modeling across different scenarios. In addressing the challenge of accurately generating high-radiance textures for the targets in the infrared spectrum, the pouroposed approach aims to ensure alignment between the generated infrared images and the radiation feature of ground-truth IR images.

  • infrared image
  • image-to-image translation
  • multi-modal controls
  • vector quantization
  • transformer

1. Introduction

Complex illumination scenarios have adversely affected the accuracy of visible light data in recent years. Unfortunately, these factors are beyond our control and significantly reduce the usefulness of images. This situation poses a significant challenge in processing and training, limiting the range of applications for these data. Infrared (IR) images (both 0.7–3 μm and 8–15 μm) offer radiation intensity texture information that visible images lack, making them particularly helpful in daytime, nighttime, and complex scenes. In low-light conditions, infrared images captured through thermal radiation (8–15 μm) provide enriched semantic information. Objects exhibiting high thermal temperatures can reveal discernible features within intricate scenes. Therefore, based on deep learning, cross-modal image translation has become a hot topic in remote sensing research in recent years. Many researchers are studying how to translate RGB images into infrared images for deep learning-based visual tasks such as object tracking, crowd counting, panoramic segmentation, and image fusion in urban scenarios. The utilization of the RGB-IR dataset in the aforementioned tasks holds the potential to provide comprehensive multi-band fusion data for urban scenes, thereby facilitating precise modeling across different scenarios.
A large-scale neural network algorithm based on RGB features can be trained using large monomodal public datasets, such as ImageNet [1], PASCAL VOC [2], and MS COCO [3]. However, compared to RGB datasets, infrared image public datasets often suffer from limitations such as limited scene diversity, a lack of diverse target categories, low data volume, and low resolution. Therefore, researchers have developed a large number of deep-learning-based style transfer algorithms to achieve the conversion from RGB images to infrared images through end-to-end translation, such as CNN [4][5][6][4,5,6], GAN [7][8][9][7,8,9], attention networks [10][11][12][10,11,12], etc., to learn and fit the mapping relationship between RGB and IR images. These RGB-IR algorithms approach the task by solving a pixel-level conditional generation problem. IR images convert the radiation intensity field into grayscale images, leading to a mapping relationship between IR and RGB images that is not based on spectral physical characteristics. As a result, there is no strict pixel-level correspondence [13]. The research conducted in [14][15][14,15] indicated that, while a conditioned generative model can successfully generate customized IR images, these models primarily focus on studying the texture or content transformation from RGB to IR, without considering the diverse types of migration mapping relationship between different visual fields. The mono-modality transformation predominantly relies on simplistic semantic matching and transferring strategies, leading to unrealistic expression of radiation information. Due to the global feature extraction and generation mechanisms of the transfer model, vehicles and pedestrians exhibit significant disparities between the generated infrared textures and the ground-truth, and they may even be overlooked in some results. Consequently, this limits their flexibility and versatility in various scenarios and tasks. In practical applications, it is crucial for the model to accurately translate complex and diverse scenes, data, and task requirements. Therefore, designing a unified visibility-infrared migration framework suitable for multi-scene and multi-task purposes holds significant practical value.
To In this end, researchersRGB-IR study, we propose a novel multi-modal translation approach. This method not only enhances the overall naturalness of human-computer interaction but also consolidates the information from multiple data sources to generate more comprehensive results.

2. Translation from Image to Image

The translation from image to image was first discussed in [16][17], to learn the mapping function between source and target domains. This style of transfer work mainly deals with two significant challenges: First, the imaging principles of IR and RGB sensors differ, and the radiation field where IR is located varies significantly from the color space. Consequently, traditional methods find it difficult to determine the mapping relationship between RGB and IR. Second, the mainstream infrared migration methods are based on end-to-end generative adversarial networks. Among them, cycle consistency is used to handle unpaired data [17][18][18,19], while the enhanced attribute space is proposed to provide diversity [19][20]. Most algorithms for translating infrared images introduced architectures based on Cycle-GAN, such as Drit++ et al. [20][21][22][23][21,22,23,24]. In addition, some other algorithms also provide appropriate structural solutions for this task.For example, FastCUT [24][25] adopts one-sided translation without using cycle consistency to improve diversity [25][26][26,27], and U-GAT-T [27][28] focuses explicitly on geometric transformations of content in translation. Kuang et al. [28][29] improved the pix2pix method and proposed TIC-CGAN, the first GAN application to translate thermal IR (8–15 μm) images in traffic scenes. The generator in ThermalGAN [29][30] utilized a U-Net-based architecture, and the authors used a unique dataset named ThermalWorld to enhance training. In DRIT [21][22], the authors introduced the use of multiple generators. Each generator focused on learning attributes of different scenes, and a classifier based on ResNet [30][31] was used to determine which generator’s output was most suitable for a given input image.
Wang et al. [31][32] proposed an attention-based hierarchical infrared image coloring network (AHTIC-Net) to enhance the realistic and rich texture information of small objects in translated images. It employed a multi-scale structure to extract features of objects with different sizes, thereby improving the model’s focus on small objects during training. In recent years, many migration models have leaned towards using universal style transfer (UST) methods. Representative UST methods include AdaIN [32][33], WCT [33][34], and Avatar-Net [34][35]. These methods have been continuously expanded upon [35][36][37][36,37,38]. However, they are limited in terms of disentanglement and reconstruction of image content during the stylization process. In addition, the research on extracting image content structure and texture style features has become increasingly mature. Gatys et al. [38][39] found that the layers in CNN can extract content structure and style texture, they proposed an optimization-based iterative generation method for stylized images. Li and Justin [39][40][40,41] used an end-to-end model to achieve real-time style transfer with a specific style. To enable more efficient applications, Stylebank et al. [41][42][43][42,43,44] combined multiple types into one model and achieved excellent stylization results. Chen et al. [44][45] proposed an internal–external style transfer algorithm (IEST) that includes two contrastive losses, which can generate more natural stylized effects. However, the existing encoder–transfer–decoder style transfer methods cannot handle random dependencies, which may result in the loss of detailed information.
Recently, the effectiveness of vector quantization (VQ) technology as an intermediate representation for generative models has been demonstrated [45][46][16,46].  Therefore, in theis RGB-IR study, the researcherswork, we explore the suitability of using vectorization as an encoder in RGB-IR tasks, where the latent representation obtained through vector quantization serves as the intermediate vector for RGB-IR tasks. 
Video Production Service