Generative Attentional Networks for Image-to-Image Translation: Progressive U-GAT-IT

Generative Attentional Networks for Image-to-Image Translation: Progressive U-GAT-IT: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Artificial Intelligence

Contributor:

Hong-Yu Lee

Yung-Hui Li

Ting-Hsuan Lee

Muhammad Saqlain Aslam

Unsupervised image-to-image translation has received considerable attention due to the recent remarkable advancements in generative adversarial networks (GANs). In image-to-image translation, state-of-the-art methods use unpaired image data to learn mappings between the source and target domains. However, despite their promising results, existing approaches often fail in challenging conditions, particularly when images have various target instances and a translation task involves significant transitions in shape and visual artifacts when translating low-level information rather than high-level semantics. To tackle the problem, we propose a novel framework called Progressive Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization (PRO-U-GAT-IT) for the unsupervised image-to-image translation task. In contrast to existing attention-based models that fail to handle geometric transitions between the source and target domains, our model can translate images requiring extensive and holistic changes in shape. Experimental results show the superiority of the proposed approach compared to the existing state-of-the-art models on different datasets.

generative adversarial networks
image-to-image translation
style transfer
cartoon styles
anime

In recent years, generative adversarial networks (GANs) have made significant progress in image-to-image translation. Researchers in machine learning and computer vision have given this topic considerable attention because of the wide range of practical applications available ^[1]^[2]. These include image inpainting ^[3]^[4], colorization ^[5]^[6], super-resolution ^[7]^[8], and style transfer ^[9]^[10]. Image-to-image translation refers to a category of vision and graphics problems in which the goal is to learn the mapping between an input image (source domain) and an output image (target domain) from a set of aligned image pairs ^[11]. In the case of portrait stylization, various methods have been explored, such as self-to-anime ^[1] and cartoon ^[12]. There are, however, many tasks that will not offer paired training data. When paired data are provided, the mapping model can be trained using a conditional generative model ^[13]^[14]^[15] or a simple regression model ^[5]^[16]^[17] in a supervised manner.

Various works ^[18]^[19]^[20]^[21]^[22]^[23]^[24]^[25] have successfully translated images in unsupervised settings without available paired data by assuming shared latent space ^[22] and cycle consistency assumptions ^[11]^[21]. Nevertheless, supervised approaches require paired datasets for training, which can be laborious and expensive, if possible, to prepare manually. In contrast, unsupervised methods need a large volume of unpaired data and frequently need help to reach stable training convergence and generate high-resolution results ^[26].

Previous techniques have shortcomings despite their progress and benefits and often fail to meet challenging tasks, especially when the target images have multiple instances to be translated ^[27] or the shape of the target instances has drastically changed ^[11]. For example, they are efficient for style transfer tasks that map local textures such as photo2vangogh and photo2portrait. However, they are ineffective for image translation tasks with extensive shape transformations, such as selfie2anime and cat2dog, in wild images. As a result, pre-processing measures such as image cropping and alignment can significantly prevent these difficulties by limiting the complexity of data distributions ^[1]^[2]. Further, current methods like DRIT ^[28] cannot produce the coveted results for both image translation that preserves appearance (such as horse2zebra) and image translation that transforms shape (such as cat2dog) due to the fixed network structure and hyperparameters. There is a need to adjust the network architecture or hyperparameters for each dataset.

In 2014, Ian Goodfellow et al. ^[29] introduced generative adversarial networks (GANs), which can solve image-to-image problems, including anime face style transfer. A study published in 2017 found that Pix2Pix ^[13] and CycleGAN ^[11] are the two primary GAN-based approaches that can successfully address image-to-image problems. CartoonGAN ^[30] was introduced in 2018 as an upgrade of Pix2Pix, specializing in the cartoon sector. Nevertheless, all the earlier methods merely transfer textures. Junho Kim et al., therefore, presented U-GAT-IT ^[1], a technique based on CycleGAN that can handle both texture and geometry transfer. However, in the generated image, geometry factors differ dramatically from a human face image. Consequently, the output does not maintain the input signature.

This paper proposes Progressive U-GAT-IT (PRO-U-GAT-IT), a novel framework for unsupervised image-to-image translation tasks, which incorporates an attention module and learnable normalization function in an end-to-end strategy. Based on the attention map obtained by the auxiliary classifier, our model guides the translation so that it focuses on more essential regions and disregards minor areas by distinguishing between the source and target domains. Furthermore, these attention maps are embedded in the generator and discriminator to emphasize relevant critical areas, thereby enabling shape transformation. For example, a generator’s attention map focuses on regions that distinguish between the two domains. In contrast, a discriminator’s attention map assists in fine-tuning by concentrating on the difference between real and fake images. Additionally, we discovered that the selection of the normalization function substantially influences the quality of the transformed outcomes for various datasets with varying degrees of shape and texture changes. Furthermore, earlier approaches have limitations, including blurry results, unstable training, low resolutions, and limited variation. Moreover, high-resolution images are difficult to generate because their higher resolution makes them easily distinguishable from training images. Finally, due to memory constraints, large resolutions also require smaller mini-batches, compromising training stability. Nevertheless, recent improvements in the resolution and quality of images produced by generative methods, particularly GANs, have been observed.

The contributions of our work are summarized as follows:

We propose a framework that improves the image-to-image translation model through a progressive block-training approach. This novel technique allows for the acquisition of distinct features during various training phases, leading to several notable advantages. These include reduced VRAM usage, accelerated training speed on par with or surpassing other methods when using the same device, and the ability to achieve successful image translation at higher resolutions.
Furthermore, we propose a novel research field that emphasizes the exploration and refinement of progressive image-to-image translation techniques. Our aim is to enhance both the quality of results and the overall efficiency of image-to-image translation models.

This entry is adapted from the peer-reviewed paper 10.3390/s23156858

References

Kim, J.; Kim, M.; Kang, H.; Lee, K. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv 2019, arXiv:1907.10830.
Mo, S.; Cho, M.; Shin, J. Instagan: Instance-aware image-to-image translation. arXiv 2018, arXiv:1812.10889.
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544.
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. 2017, 36, 1–14.
Zhang, R.; Isola, P.; Efros, A.A. Colorful image colorization. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 649–666.
Zhang, R.; Zhu, J.Y.; Isola, P.; Geng, X.; Lin, A.S.; Yu, T.; Efros, A.A. Real-time user-guided image colorization with learned deep priors. arXiv 2017, arXiv:1705.02999.
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307.
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654.
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423.
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510.
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232.
Li, J. Twin-GAN–unpaired cross-domain image translation with weight-sharing GANs. arXiv 2018, arXiv:1809.00946.
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134.
Li, C.; Liu, H.; Chen, C.; Pu, Y.; Chen, L.; Henao, R.; Carin, L. Alice: Towards understanding adversarial learning for joint distribution matching. Adv. Neural Inf. Process. Syst. 2017, 30, 5501–5509.
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8798–8807.
Larsson, G.; Maire, M.; Shakhnarovich, G. Learning representations for automatic colorization. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 577–593.
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440.
Anoosheh, A.; Agustsson, E.; Timofte, R.; Van Gool, L. Combogan: Unrestrained scalability for image domain translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 783–790.
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797.
Huang, X.; Liu, M.Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 172–189.
Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 1857–1865.
Liu, M.Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. Adv. Neural Inf. Process. Syst. 2017, 30, 700–708.
Royer, A.; Bousmalis, K.; Gouws, S.; Bertsch, F.; Mosseri, I.; Cole, F.; Murphy, K. Xgan: Unsupervised image-to-image translation for many-to-many mappings. In Domain Adaptation for Visual Understanding; Springer: Berlin/Heidelberg, Germany, 2020; pp. 33–49.
Taigman, Y.; Polyak, A.; Wolf, L. Unsupervised cross-domain image generation. arXiv 2016, arXiv:1611.02200.
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2849–2857.
Song, G.; Luo, L.; Liu, J.; Ma, W.C.; Lai, C.; Zheng, C.; Cham, T.J. AgileGAN: Stylizing portraits by inversion-consistent transfer learning. ACM Trans. Graph. 2021, 40, 1–13.
Gokaslan, A.; Ramanujan, V.; Ritchie, D.; Kim, K.I.; Tompkin, J. Improving shape deformation in unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 649–665.
Lee, H.Y.; Tseng, H.Y.; Huang, J.B.; Singh, M.; Yang, M.H. Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 35–51.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144.
Chen, Y.; Lai, Y.K.; Liu, Y.J. Cartoongan: Generative adversarial networks for photo cartoonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9465–9474.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.