Unsupervised image-to-image translation has received considerable attention due to the recent remarkable advancements in generative adversarial networks (GANs). In image-to-image translation, state-of-the-art methods use unpaired image data to learn mappings between the source and target domains. However, despite their promising results, existing approaches often fail in challenging conditions, particularly when images have various target instances and a translation task involves significant transitions in shape and visual artifacts when translating low-level information rather than high-level semantics. To tackle the problem, we propose a novel framework called Progressive Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization (PRO-U-GAT-IT) for the unsupervised image-to-image translation task was proposed. In contrast to existing attention-based models that fail to handle geometric transitions between the source and target domains, theour model can translate images requiring extensive and holistic changes in shape. Experimental results show the superiority of the proposed approach compared to the existing state-of-the-art models on different datasets.
1. Introduction
In recent years, generative adversarial networks (GANs) have made significant progress in image-to-image translation. Researchers in machine learning and computer vision have given this topic considerable attention because of the wide range of practical applications available
[1][2]. These include image inpainting
[3][4], colorization
[5][6], super-resolution
[7][8], and style transfer
[9][10]. Image-to-image translation refers to a category of vision and graphics problems in which the goal is to learn the mapping between an input image (source domain) and an output image (target domain) from a set of aligned image pairs
[11]. In the case of portrait stylization, various methods have been explored, such as self-to-anime
[1] and cartoon
[12]. There are, however, many tasks that will not offer paired training data. When paired data are provided, the mapping model can be trained using a conditional generative model
[13][14][15] or a simple regression model
[5][16][17] in a supervised manner.
Various works
[18][19][20][21][22][23][24][25] have successfully translated images in unsupervised settings without available paired data by assuming shared latent space
[22] and cycle consistency assumptions
[11][21]. Nevertheless, supervised approaches require paired datasets for training, which can be laborious and expensive, if possible, to prepare manually. In contrast, unsupervised methods need a large volume of unpaired data and frequently need help to reach stable training convergence and generate high-resolution results
[26].
Previous techniques have shortcomings despite their progress and benefits and often fail to meet challenging tasks, especially when the target images have multiple instances to be translated
[27] or the shape of the target instances has drastically changed
[11]. For example, they are efficient for style transfer tasks that map local textures such as photo2vangogh and photo2portrait. However, they are ineffective for image translation tasks with extensive shape transformations, such as selfie2anime and cat2dog, in wild images. As a result, pre-processing measures such as image cropping and alignment can significantly prevent these difficulties by limiting the complexity of data distributions
[1][2]. Further, current methods like DRIT
[28] cannot produce the coveted results for both image translation that preserves appearance (such as horse2zebra) and image translation that transforms shape (such as cat2dog) due to the fixed network structure and hyperparameters. There is a need to adjust the network architecture or hyperparameters for each dataset.
In 2014, Ian Goodfellow et al.
[29] introduced generative adversarial networks (GANs), which can solve image-to-image problems, including anime face style transfer. A study published in 2017 found that Pix2Pix
[13] and CycleGAN
[11] are the two primary GAN-based approaches that can successfully address image-to-image problems. CartoonGAN
[30] was introduced in 2018 as an upgrade of Pix2Pix, specializing in the cartoon sector. Nevertheless, all the earlier methods merely transfer textures. Junho Kim et al., therefore, presented U-GAT-IT
[1], a technique based on CycleGAN that can handle both texture and geometry transfer. However, in the generated image, geometry factors differ dramatically from a human face image. Consequently, the output does not maintain the input signature.
2. Generative Adversarial Networks (GANs)
GAN
This
[29] pa
per
e persuasive generative models that have attained pleasing results in various applications of computer vision tasks such as super-resolution imaging [31] and i proposes Progressive U-GAT-IT (PRO-U-GAT-IT), a novel framework for unsupervised image-to-image
[32] tran
d video generslation
[33]. Kta
rrasks
et al. proposed a method based on a simple progressive growing of GANs [34] to sy, which incorporates an attention module an
thesized l
argely (for example, 256 × 256) realistic images in an unconditional environment. In a GAN framework, the goal of the generative model is to fool a discriminator by generating fake images, whereas that of the discriminative model is to differentiate between the generated samples and actual samples. Furthermore, for generating meaningful images that satisfy user needs, Conditional GANs (CGANs) [35][36] add additiearnable normalization function in an end-to-end strategy. Based on the attention map obtained by the auxiliary classifier, our model guides the translation so
nal information, such as discrete labels [19][37], that it fo
bjec
t key points [38], hu
man s
keletones
[39][40], aon
d semantic maps [36][41][42], tmo
assist in thre
image generation process.
3. Image-to-Image Translation
Coessen
voluti
onal neural networks (CNNs) have been used to learn a translation function for image-to-image translation. The task is to find a mappal regions and disregards minor areas by distinguishing between
athe source and
a target domain
. The models used in the early methods utilize a supervised framework, where the model identifies pairs of examples, for instance, by employing a conditional GAN to determine the mapping functions. Furthermore, these attention maps are embedded in the generator [13][15][20]. Philip Isola
et al.nd proposed Pix2pix [13], which idis
a c
onditional framework that uses a CGAN to determine a mapping function for input-to-output images. Wang et al. proposed Pix2pixHD [15]riminator to emphasize relevant critical areas,
a th
igh-resolution photo-realistic image-to-imageereby enabling shape trans
lation method that can be applied to produce photo-realistic interpretations of semantic label maps. In addition, a similar approach has been implemented for several other tasks, including the generation of hand gestures [39]. Hoformation. For example, a generator’s attention map focuses on regions that distinguish betwe
ve
r, many real-world tasks encounter the issue of having fewer or no paired input-output samples available. The problem of image-to-image translation becomes ill-posed in the absence of paired training data.
Seven the two domains. In contrast, a discriminator
al method’s
that perform unpaired image-to-image translations have recently been proposed to address this limitation, producing remarkable results. These methods are essential for applications that lack or cannot obtain paired data and determine the mapping function without requiring paired training data. In particular, CycleGANattention map assists in fine-tuning by concentrating on the difference [11] learns to map between
tworeal domains ofand fake images
rather than pairs of images. Besides CycleGAN, many other variants of GAN have been propos. Additionally, we discovered
[18][21][25][43][44][45] t
o deha
l with the cross-domain problem. However, the drawback of these models is that they can be easily affected by undesired content and cannot identify the most discriminative semantic information about images during the translation phase.
St the selection of the normalization function substantially influences the quality of the transformed outcomes for various datasets with varying degre
ve
ral works have employed an attention mechanism to alleviate these shortcomings. Many applications in computer visions of shape and texture changes. Furthermore, earlier approaches have
successfully implemented attention mechanismlimitations, including
depth estimation [46]blurry results,
which unsta
llows the models to concentrate on a significant part of the input. In some recent studies, attention modules have been used unsupervised to pay attention to the region of interest (ROI) in the image translation task, which can be divided into two categories. The first category involves providing attention using additional data. For example, Liang et al. introduced ContrastGAN [47], which utilizble training, low resolutions, and limited variation. Moreover, high-resolution images are difficult to generate because their higher res
the o
bject mask annotations from every dataset as additional input data.
Furtlution makes the
rm
ore, Mo et al. proposed InstaGAN [2], wh easily disti
ch combin
es instance information (such as object segmentation masks) to enhance multi-instance transfiguration. Another method involves training segmentation or an attention model to produce attention maps and adapt them to the system. For example, Chen et al. [8] geneguishable from training images. Finally, due to memory constraints, large resolutions also require smaller mini-batches, compr
ated attentio
n maps using an additional attention network to highlight objects of interest more. Kazaniotis et al. presented ATAGAN [48], which gmising training stability. Ne
nver
ates attention maps using a teacher network. A new module was proposed by Yang et al. [49]theless, recent improvements in th
ate predicts an attention map to guide theresolution and quality of image
translation method. Kim et al. [1] ints produced
theby U-GAT-IT model to circumvent the challenge of geometry transfer. The key objective of the model is to pay more attention to the regions that contain distinctive anime-style represgenerative methods, particularly GANs, have been observed.
The
cont
ations. For this purpose, an auxiliary classifier is used to generate attention masks. In a study by Mejjati et al. [50], attention ributions of our work are summ
echa
nisms were implemented with generators, discriminators, and two other attention networks.rized as follows:
-
We propose a framework that improves the image-to-image translation model through a progressive block-training approach. This novel technique allows for the acquisition of distinct features during various training phases, leading to several notable advantages. These include reduced VRAM usage, accelerated training speed on par with or surpassing other methods when using the same device, and the ability to achieve successful image translation at higher resolutions.
-
Furthermore, we propose a novel research field that emphasizes the exploration and refinement of progressive image-to-image translation techniques. Our aim is to enhance both the quality of results and the overall efficiency of image-to-image translation models.