Computer-aided diagnosis of skin diseases has become more popular since the introduction of Inception v3
[1] that achieved a performance accuracy of 93.3%
[2] in classifying various cancerous skin conditions. A large dataset with approximately 129,450 images was utilized to develop a skin cancer classification model with Inception v3
[1]. However, gathering such a large amount of data is not feasible for some skin conditions such as rosacea. Although many skin conditions can lead to fatal consequences, cancer has been considered the most serious of all and has motivated the gathering of the most data over time. As a result, many teledermatology
[3] websites have a substantial amount of skin cancer images. On the other hand, there is very limited data for non-fatal chronic skin conditions such as rosacea. Deep convolutional neural networks (DCNNs), e.g., Inception v3, perform relatively well when provided with a large training dataset
[4]. However, their performance significantly degrades in the absence of large amounts of data. A possible solution is to utilize a small amount of the available data by leveraging the concept of generative adversarial networks (GANs)
[5] to generate synthetic images. Synthetic images can aid in expanding a small dataset significantly, potentially enabling more effective training of DCNNs. Generating synthetic datasets for diseases may also help educate non-specialist populations, to create awareness and improve publicity.The generation of synthetic data using deep generative algorithms, mirroring the characteristics of authentic data, is an innovative approach to circumventing data scarcity
[6].
The observational and analytical complexities of skin diseases are challenging aspects of diagnosis and treatment. In most cases, at the early stage, skin diseases are examined visually. Depending on the complexity of the early examination and severity of the disease, several different clinical or pathological measures using images of the affected region may be followed. These include dermoscopic analysis, biopsy, and histopathological examination. Depending on the nature of the skin disease, whether it is acute or chronic, the diagnosis and treatment may be time-consuming.
2. Rosacea Diagnosis and StyleGAN2-ADA
There have been a few noteworthy works conducted on rosacea by Thomsen et al.
[21][12], Zhao et al.
[22][13], Zhu et al.
[23][14], Binol et al.
[24][15], and Xie et al.
[25][16], with significant quantities of data collected from dermatology departments in hospitals. However, the datasets used in these studies were entirely confidential. In these studies, the early detection problem of rosacea was addressed by performing ‘image classification’ among different subtypes of rosacea and other common skin conditions. The classifier was trained using data augmentation and transfer learning from the pretrained weights of ImageNet. In total, over 10,000 images were used in these studies, along with transfer learning. Transfer learning works well when a significant number of images are available, typically over 1000. Following the previous studies mentioned, Mohanty et al.
[13][17] conducted several experiments on full-face rosacea image classification using Inception v3
[1] and VGG16
[26][18]. In their experiments, the aforementioned deep learning models tended to overfit during training and validation, due to insufficient data.
Although there have been a few studies
[27,28,29,30,31,32][19][20][21][22][23][24] on generating synthetic images of skin cancer lesions using various types of GANs architecture, the images were captured through a dermatoscope and other imaging devices that focus only on a specific locality i.e., cancerous regions of the skin. Carrasco et al.
[33][25] and Cho et al.
[34][26] explored the generation of cancerous skin lesion images using the StyleGAN2-ADA architecture. Carrasco et al.
[33][25] employed a substantial dataset comprising 37,648 images in both conditional and unconditional settings. On the other hand, Cho et al.
[34][26] focused on creating a melanocytic lesion dataset using non-standardized Internet images, annotating approximately 500,000 photographs to develop a diverse and extensive dataset.
In the study of Carrasco et al.
[33][25], to address scenarios where hospitals lack large datasets, a simulation involving three hospitals with varying amounts of data was proposed, using federated learning to synthesize a complex, fair, and diverse dataset collaboratively. They utilized the EfficientNetB2 model for classification tasks and conducted expert assessments on 200 images to determine if they were real or synthetically generated by the conditionally trained StyleGAN2-ADA. In their study, the main insights included recognizing the dependency of the chosen architectures on computational resources and time constraints. Unconditional GANs were noted as beneficial for fewer classes, due to the lengthy training required for a single GAN. When a large annotated dataset is available, central training of GAN is preferable. However, for institutions with data silos, the benefits of federated learning are particularly notable, especially for smaller institutions. The study also underscored the importance of a multifaceted inspection of the synthetic data created.
The main objective of Cho et al’s
[34][26] study was to explore the possibility of image generation using images scrapped from various online sources where data are not structured. They created a diverse LESION130k dataset of potential lesions and generated 5000 synthetic images with StyleGAN2-ADA, illustrating the potential of AI in diversifying medical image datasets from various sources. The goal was to investigate image generation from unstructured data scraped from the internet. The team created the LESION130k dataset and 5000 synthetic images using StyleGAN2-ADA, demonstrating AI’s capacity to diversify medical image datasets. They then evaluated the model’s performance using an EfficientNet Lite0 and a test set of 2312 images from seven well-known public datasets to identify malignant neoplasms.
3. Synthetic Facial Image Generation
The first facial image generator using generative adversarial networks (GANs) was designed by Goodfellow et al.
[5] in 2014. The generated synthetic faces were very noisy and required more work to make them convincing. Later, in 2015, deep convolutional GANs (DCGANs)
[35][27] were introduced and used 350,000 face images without any augmentation. DCGANs came with some notable features that resulted in better synthetic faces, such as
However, the DCGAN model had some limitations, noticeable in
These limitations strongly influenced the topics of future work on GANs.
The progressive growing of GANs (ProGANs) introduced by Karras et al.
[36][28], improved the resolution of the generated images with a stable and swifter training process. The main idea of ProGANs is to start from a low resolution, e.g., 4 × 4, and then progressively increase the resolution, e.g., up to 1024 × 1024, by adding layers to the networks. The training time is 2–6 times faster depending on the desired output resolution. ProGANs could generate 1024 × 1024 facial images using the CelebA-HQ
[36][28] dataset with 30,000 selected real images in total. The idea of ProGAN emerged from one of the GAN architectures introduced by Wang et al.
[37][29]. Although ProGAN successfully generated facial images with large resolution, it did not function adequately in generating realistic features and microstructures.
Although the generation of high-resolution images was achieved using GANs, there were still indispensable research gaps that needed to be addressed. Thus, the introduction of StyleGAN
[7][30] allowed further improvements which helped in understanding various characteristics and phases in synthetic image generation/image synthesis. Important improvements with the StyleGAN architecture included
-
Upgrading the number of trainable parameters in style-based generators; this is now 26.2 million, compared to 23.1 million parameters in the ProGAN
[36][28] architecture;
-
Upgrading the baseline using upsampling and downsampling operations, increasing the training time and tuning the hyperparameters;
-
Adding a mapping network and adaptive instance normalization (AdaIN) operations;
-
Removing the traditional input layer and starting from a learned constant tensor that is 4 × 4 × 512;
-
Adding explicit uncorrelated Gaussian noise inputs, which improves the generator by generating stochastic details;
-
Mixing regularization, which helps in decorrelating the neighbouring styles and taking control of fine-grained details in the synthetic images.
In addition to the improvements in generating high-fidelity images, StyleGAN introduced a new dataset of human faces called Flickr Faces HQ (FFHQ). FFHQ has 70,000 images at 1024 × 1024 resolution and has a diverse range of ethnicities, ages, backgrounds artifacts, make-up, lighting, image viewpoints, and various accessories such as eyeglasses, hats, sunglasses, etc. Based on these improvements, comparative outcomes were evaluated using a metric called Fréchet inception distance (FID)
[38][31] on two datasets, i.e., CelebA-HQ
[36][28] and FFHQ. Recommended future investigations include separating high-level attributes and stochastic effects, while achieving linearity of the intermediate latent space.
Successively, another variant of StyleGAN was introduced by Karras et al., called StyleGAN2
[39][32], in which the key focus was exclusively on the analysis of the latent space
W. As the generated output images from StyleGAN contained some unnecessary and common blob-like artifacts, StyleGAN2 addressed the causes of these artifacts and eliminated them by defining some changes in the generator network architecture and in the training methods. Hence, the generator normalization was redesigned, and the generator regularization was redefined to boost conditioning and to improve output image quality. The notable improvements in the StyleGAN2 architecture include
-
The presence of blob-like artifacts such as those in Figure 1 is solved by removing the normalization step from the generator (generator redesign);
-
Grouped convolutions are employed as a part of weight demodulation, in which weights and activation functions are temporarily reshaped. In this setting, one convolution sees one sample with N groups, instead of N samples with one group;
-
Adaption of lazy regularization, in which 𝑅1
-
regularization is performed only once in 16 mini-batches. This reduces the total computational costs and the memory usage;
-
Adding a path length regularization aids in improving the model reliability and performance. This offers a wide scope for exploring this architecture at the latter stages. Path length regularization helps in creating denser distributions, without mode collapse problems;
-
Revisiting the ProGAN architecture to adapt benefits and remove drawbacks, e.g., progressive growing in the residual block, of the discriminator network.
Figure 1. An example of blob-like artifacts in the generated images. This image was taken from Karras et al.
[39][32] indicates that the figure is demonstrating a common issue in image generation, where unintended and irregularly shaped distortions—referred to as “blob-like artifacts” — appear in the output. These artifacts are typically the result of imperfections in the image generation process, which could be due to a variety of factors like model training deficiencies, data quality issues, or algorithmic limitations. The highlighted areas in red show where these artifacts have occurred across different images, pointing out the flaws that can arise when using generative models for creating synthetic images.
The datasets LSUN
[40][33] and FFHQ were used with StyleGAN2 to obtain quantitative results through metrics such as FID
[38][31], perceptual path length (PPL)
[7][30], and precision and recall
[41][34].
Another set of GAN architectures called BigGAN and BigGAN-deep
[42][35] expanded the variety and fidelity of the generated images. These improvements included making architectural changes that improved scalability, and a regularization scheme to recuperate conditioning as well as to boost performance. The above modifications gave a lot of freedom to apply the “truncation trick”, a sampling method that aids in controlling the sample variety and fidelity in the image generation stage. Even though different GAN architectures produced improved results over a period, model instability during training was a common problem in large-scale GAN architectures
[43][36]. This problem was investigated and analyzed through the introduction of BigGAN by leveraging existing techniques and by presenting novel techniques. The ImageNet ILSVRC 2012 dataset
[44][37] with resolutions 128 × 128, 256 × 256, and 512 × 512 was used in BigGAN and BigGAN-deep architectures to demonstrate quantitative results through metrics such as FID and inception score (IS)
[45][38].
The aforementioned GAN architectures were trained on a large amount of data and can generate high-resolution outputs with variety and a fine-grained texture. Although a large amount of data helps GAN models to learn and generate more realistic-looking synthetic images, it is not possible to acquire a large amount of data for certain fields/domains. For example, in the medical/clinical imaging domain, it is hard to acquire a large number of images for each disease case. Therefore, it is important to expand the potential of GAN architectures to perform well and produce high-fidelity synthetic images, even if there are limited images available.
However, the key problem with having a small number of images is the overfitting of training examples in the discriminator network. Hence, the training process starts to diverge, and the generator does not generate anything meaningful because of overfitting. The most common strategy for tackling overfitting in deep learning models is “data augmentation”. There are instances in which augmentation functions learn to generate the augmented distribution, which results in “leaking augmentations” in the generated samples. These leaking augmentations are the features that are learned from the augmentation style rather than the features that are originally present in the real dataset.
Hence, to prevent the discriminator from overfitting when there is only limited data available, a variant of StyleGAN2 called StyleGAN2-ADA
[20][39] was introduced with a wide range of augmentations. An adaptive control scheme was presented, in order to prevent such augmentations from leaking in the generated images. This work produced promising results in generating high-resolution synthetic images obtained with a few thousand images. The significant improvements of StyleGAN2-ADA include
-
Stochastic discriminator augmentation is a flexible method of augmentation that prevents the discriminator from becoming overly confident by showing all the applied augmentation to the discriminator. This assists in generating the desired outcomes;
-
The addition of adaptive discriminator augmentation (ADA), through which the strength of augmentation ‘p’ can be adjusted at every interval of four mini-batches N. This technique helps in achieving convergence during training without the occurrence of overfitting, irrespective of the volume of the input dataset;
-
Invertible transformations are applied to leverage the full benefit of the augmentation. The proposed augmentation pipeline contains 18 transformations grouped in 6 categories, viz. pixel blitting, more general geometric transformations, colour transforms, image-space filtering, additive noise, and cutout;
-
The capability to handle small-volume datasets, such as the 1000 and 2000 images from FFHQ dataset, 1336 images of METFACES
[46][40], 1994 overlapping cropped images from 162 breast cancer histopathology images called BRECAHAD
[47][41], nearly 5000 images of AFHQ, and 50,000 images of CIFAR-10
[48][42].
-
Although the small volume of the dataset is the main feature in the StyleGAN2-ADA, some high-volume datasets are broken down into different sizes for monitoring the model performance. The FFHQ dataset is used for training the model. Various subsets of the dataset such as 140,000, 70,000, 30,000, 10,000, 5000, 2000, and 1000 are used to test the performance. Similarly, the dataset LSUN CAT is considered with the volume starting from 200 k to 1 k for model evaluation. FID is used as an evaluation metric for comparative analysis and the demonstration of StyleGAN2-ADA model performance.
Amongst the studies and related work regarding face generation using GANs, as discussed above and represented in
Figure 2, StyleGAN2-ADA appeared to work adequately with a small volume of data. Especially in the case of small volumes of medical/clinical images, StyleGAN2-ADA is a useful method for investigation.
Figure 2. Progress in synthetic face generation using various GAN models with the maximum volume of dataset available.