2. Major Characteristics and Essential Workabilities of SSL
As illustrated above, supervised learning (SL) needs annotated data to train numerical ML models to enable an efficient classification process in various conventional ML models. On the contrary, unsupervised learning (USL) classification procedures do not require labeled data to accomplish a similar classification task. Rather, USL algorithms can rely solely on identifying meaningful patterns in existing unlabeled data without necessary training, testing, or preparation
[14][7].
For the previously illustrated industrial and medical pragmatic practices, SSL can often be referred to as predictive learning (or pretext learning) (PxL). Labels can be generated automatically, transforming the unsupervised problem into a flexible, supervised one that can be solved viably.
Another favorable solution of SSL algorithms is their efficient categorization of data correlated with natural language processing (NLP). SSL can allow researchers to fill in blanks in databases when they are not fully complete or have a high-quality definition. As an illustration, with the application of ML and DL models, existing video data can be utilized to reconstruct previous and future videos. However, without relying on the annotation procedure, SSL takes advantage of patterns linked to the current video data to efficiently complete the categorization procedure of a massive video database
[15,16][8][9]. Correspondingly, the critical working principles of the SSL approach can be illustrated in the workflow shown in
Figure 21.
Figure 21. The major workflow related to SSL
[17][10].
From
Figure 21, during the pre-training stage (pretext task solving), feature extraction is carried out by pseudo-labels to enable an efficient prediction process. After that, transfer learning is implemented to initiate the SSL phase, in which a small dataset is considered to make data annotations (of ground truth labels). Then, fine-tuning is performed to achieve the necessary prediction task.
3. Main SSL Categories
Because it can be laborious to compile an extensively annotated dataset for a given prediction task, USL strategies have been proposed as a means of learning appropriate image identification without human guidance
[18,19][11][12]. Simultaneously, SSL is an efficient approach through which a training objective can be produced from the data. Theoretically, a deep neural network (DNN) is trained on pretext tasks, in which labels are automatically produced without human annotation. The learned representations can be utilized to complete the pretext tasks. Familiar SSL sorts involve: (A) generative, (B) predictive, (C) contrastive, and (D) non-contrastive models. The multiple contrastive and noncontrastive tactics illustrated here can be recognized as joint-embedded strategies.
However, more types of SSL are considered in some contexts. For example, a graphical illustration in
[20][13] was created, explaining the performance rates that can be achieved when SSL is applied, focusing mainly on further SSL categories, as shown in
Figure 32.
Figure 32. Two profiles of (
a) some types of SSL classification with their performance level and (
b) end-to-end performance of extracted features and the corresponding number of each type
[20][13].
It can be realized from the graphical data expressed in
Figure 32a that the variation in the performance between the self-prediction, combined, generative, innate, and contrastive SSL types fluctuates mostly between 10% and 10%. In
Figure 32b, it can be noticed that end-to-end performance corresponding to contrastive, generative, and combined SSL algorithms varies between nearly 0.7 and 1.0, relating to an extracted feature performance that ranges approximately between 0.7 and 1.0.
In the following sections, more explanation is provided for some of these SSL categories.
3.1. Generative SSL Models
Using an autoencoder to recreate an input image following compression is a common pretext operation. Relying on the first component of the network, called an encoder, the model should learn to compress all pertinent data from the image into a latent space with reduced dimensions to minimize the reconstruction loss. The image is then reconstructed by the latent space of a second network component called a decoder.
Researchers in
[18,19,21,22,23,24,25][11][12][14][15][16][17][18] reported that denoising autoencoders could also provide reliable and stable identifications of images by learning to filter out noise. The network cannot learn the identity function owing to extra noise. By encoding the distribution parameters of a latent space, variational autoencoders (VAE) can advance the autoencoder model
[26,27,28,29][19][20][21][22]. Both the reconstruction of error and extra factor, the Kull-Leibler divergence between an established latent distribution (often a unit-centered Gaussian distribution), and the encoder output are minimized during training. The samples from the resulting distribution can be obtained through this regularization of the latent space. To rebuild entire patches with only 25 percent of the visible patches, scholars in
[30,31][23][24] have recently adopted vision transformers to create large masked autoencoders that work at the patch level rather than pixel-wise. Adding a class token to a sequence of patches or performing global mean pooling on all the patch tokens, as in this reconstruction challenge, yields reliable image representations.
A generative adversarial network (GAN) is another fundamental generative USL paradigm that has been extensively studied
[32,33,34][25][26][27]. This architecture and its variants aim to mimic real data’s appearance and behavior by generating new data from random noise. To train a GAN, two networks compete in an adversarial minimax game, with one learning to turn the rate of random noise,
Ψ𝑅𝑁≈𝑅𝑁(0, 1)
into synthetic data,
𝑆𝐷̃
, which attempts to mimic the distribution of the original data. These aspects can be illustrated in
Figure 43.
Figure 43. The architecture employed in GAN. Adopted from Ref.
[35][28], used under Creative Commons CC-BY license.
In the adversarial method, a second network, which can be termed discriminator D(.) was trained to distinguish between generated and authentic images from the original dataset. When the discriminator is certain that the input image is from the true data distribution, it reports a score of 1, whereas for the images produced by the generator, the score is zero. One possible estimation of this adversarial objective function,
𝐹𝐴𝑂, can be accomplished by the following mathematical formula:
where:
3.2. Predictive SSL Paradigms
Models trained to estimate the impact of artificial change on the input image express the second type of SSL technique. This strategy is inspired by understanding the semantic items and regions inside an image, which can be essential for accurately predicting the transformation. Scholars in
[36][29] conducted analytical research to improve the performance of the model against random initialization and to approach the effectiveness obtained from the initialization with ImageNet pre-trained weights in benchmark computer vision datasets by pre-training a paradigm to predict the relative positions of two image patches.
Some researchers have confirmed the advantages of colored images
[37][30]. In this method, the input image is first changed to grayscale. Next, a trained autoencoder converts the grayscale image back to its original color form by minimizing the average squared error between the reconstructed and original images. The encoder feature representations are considered in the subsequent downstream processes. The numerical RotNet approach
[38][31] is another well-known predictive SSL approach, which represents a practical training process for mathematical schemes to help predict the rotation that is randomly implemented in the input image, as shown in
Figure 54.
Figure 54. Flowchart configuration of the operating principles related to the intelligent RotNet algorithm relying on the SSL approach for accurate prediction results. From Ref.
[35][28], used under Creative Commons CC-BY license.
To improve the performance of the model in a dynamic rotation prediction task, the relevant characteristics that classify the semantic content of the image should first be extracted. Researchers in
[39][32] considered a jigsaw puzzle to forecast the relative positions of the picture partitions using the shuffled SSL model. The Exemplar CNN was also addressed and trained in
[40][33] to predict augmentations that can be applied to images by considering a wide variety of augmentation types. Cropping, rotation, color jittering, and contrast adjustment are examples linked to the enhancement classes gained by the Exemplar CNN model.
An SSL model can learn rich representations of the visual content by completing one of these tasks. However, the network may not be able to perform effectively on all subsequent tasks contingent on the pretext task and dataset. Because the orientation of objects is not as rigorously practical to handle in remote sensing datasets as in object-centric datasets, the prediction of random rotations of an image would not perform particularly well on such a dataset
[41][34].
3.3. Contrastive SSL Paradigms
Forcing the features of various perspectives in a picture to be comparable is another strategy that can result in accurate representations. The resulting representations are independent of the particular enhancements needed to generate various image perspectives. However, the network can be converged to a stable representation that meets the invariance condition but is unrelated to the input image.
One typical approach to achieving this goal via the acquisition of various representations while avoiding the collapse problem is the contrastive loss. This type of loss function can be utilized to train the model to distinguish between views of the same image (positive) and views of distinct images (negative). Correspondingly, it seeks to obtain homogeneous feature representations for pairs with positive values while isolating features for negative pairs. The triplet loss investigated by researchers in
[42][35] is the simplest form of this family. It requires a model to be trained such that the distance between the representations of a given anchor and its positive rates is smaller than the distance between the representations of the anchor and the random negative, as illustrated in
Figure 65.
Figure 65. The architecture of the triplet loss function. From Ref.
[35][28], used under Creative Commons CC-BY license.
In
Figure 65, the triplet loss function is considered helpful in learning discriminative representations by learning an encoder that is able to detect the difference between negative and positive samples. Under this setting, the triplet Loss Function,
𝔉𝐿𝑜𝑠𝑠𝑇𝑟𝑖𝑝𝑙𝑒𝑡, can be estimated using the following relationship:
where:
-
𝑥−—The negative vector value of the anchor x
-
𝑓(.)—The embedding function
-
𝑚—The value of the margin parameter.
In
[43][36], the researchers examined the SimCLR method, which is one of the most well-known SSL strategies. It formulates a type of contrastive representational learning. Two versions of each training batch image can be generated using random-sampling augmentation. After these modified images are fed into the representational method, a prediction network can be utilized to map the representation onto a hypersphere of dimension,
𝐷.
The overall mathematical algorithm is trained to elevate the cosine similarity across the representation parameter,
𝑧, and its corresponding positive counterpart,
𝑧+ (belonging to the same original visual data) and to minimize the similarity between
𝑧 and all other representations in the batch
𝑧−, contributing to the following expression:
where:
-
𝜏 —the temperature variable to scale the levels of similarity, distribution, and sharpness.
-
𝑓(.)—the embedding function.
At the same time, the algebraic correlation connected with the evaluation process of the complete loss function that assesses the cross-entropy of temperature, which can be dominated as normalized temperature cross-entropy, which is denoted by
𝑁𝛩
-
𝔛S, is depicted in the following relation:
where
𝑁 indicates the number of items in the dataset, such as images and textual characters.
Figure 76 shows that the NT-Xent loss
[44][37] acts solely on the direction of the features confined to the D-dimensional hypersphere because the representations are normalized before calculating the function loss value.
Figure 76. Description of contrastive loss linked to the 2-dimensional unit sphere with two negative parameters (
𝑧1− and
𝑧2−) and one positive (
𝑧+) sample from the EuroSAT dataset. From Ref.
[35][28], used under Creative Commons CC-BY license.
By maximizing the mutual data between the two perspectives, this loss ensures that the resulting representations are both style-neutral and content-specific.
In addition to SimCLR, they suggested the momentum contrast (MoCo) technique, which uses a reduced number of batches to calculate the contrastive loss while maintaining the same functional number of negative samples
[45][38]. It employs an exponentially moving average (EMA)-updated momentum encoder whose values are updated by the main encoder’s weights and a sample queue to increase the number of negative samples in each batch, as shown in
Figure 87. To account for the newest positives, the oldest negatives from the previous batch were excluded. Other techniques, such as swapping assignments across views (SwAVs), correlate views to consistent clusters between positive pairs by clustering representations into a shared set of prototypes
[44,46,47,48][37][39][40][41]. The entropy-regularized optimal transport strategy is also used in the same context to move representations between clusters in a manner that prevents them from collapsing into one another
[46,49,50,51,52,53][39][42][43][44][45][46]. Finally, the cross-entropy between the optimal tasks in one branch and the anticipated distribution in the other is minimized by the loss. To feed sufficient negative samples to the loss function and avoid representations from collapsing, contrastive approaches often need large batch sizes.
Figure 87. An illustration of Q samples developed utilizing a momentum encoder whose amounts are modified. From Ref.
[35][28], used under Creative Commons CC-BY license.
As shown in
Figure 87, at each step of the numerical analysis, only the major encoder amounts are updated based on the backpropagation process. The similarity aspects between the queue and encoded batch samples were then employed for contrastive loss.
Compared with traditional prediction methods, joint-embedding approaches tend to generate broader representations. Nonetheless, their effectiveness in downstream activities may vary depending on the augmentation utilized. If a model consistently returns the same representations for differently cropped versions of the same image, it can effectively remove any spatial information about the image and will likely perform poorly in tasks such as semantic segmentation and object detection, which rely on this spatial information. Dense contrastive learning (DCL) has been proposed and considered by various researchers to address this issue
[54,55,56,57][47][48][49][50]. Rather than utilizing contrastive loss on the entire image, it was applied to individual patches. This permitted the contrastive model to acquire representations that are less prone to spatial shifts.
3.4. Non-Contrastive SSL Models
To train self-supervised models, alternative methods within joint-embedded learning frameworks can prevent the loss of contrastive elements. They classified these as approaches that do not rely on contrast. Bootstrap Your Own Latent (BYOL) is a system based on mentor-apprentice pairing
[58,59,60][51][52][53]. The student network in a teacher-student setup is taught to mimic the teacher network’s output (or characteristics). This method is frequently utilized in knowledge distillation when the instructor and student models possess distinct architectures (e.g., when the student model is substantially smaller than the teacher model)
[61][54]. The weights of the instructor network in BYOL are defined as the EMA of the student network weights. Two projector networks,
𝑔𝐴
and
𝑔𝐵, are utilized after the encoders,
𝑓𝐴 and
𝑓𝐵, to calculate the training loss. Subsequently, to extract representations at the image level, they retrain only the student encoder
𝑓𝐴
. Additional asymmetry is introduced between the two branches by a predictor network superimposed on the student projector, as shown in
Figure 98.
Figure 98. Architecture of the non-contrastive BYOL method, considering student
A and lecturer
B pathways to encode the dataset. From Ref.
[35][28], utilized under Creative Commons CC-BY license.
In
Figure 98, the teacher’s values are modified and updated by the EMA technique applied to the student amounts. The online branch is also supported by an additional network,
𝑝𝐴, which is known as the predictor
[60][53].
SimSiam employs a pair of mirror-image networks and a predictor network at the end of a node
[62,63,64][55][56][57]. The loss function employs an asymmetric stop gradient to optimize the pairwise alignments between positive pairs because the two branches have identical weights. Relying on a student-teacher transformer design known as self-distillation, DINO (self-distillation with no labels) defines the instructor as an EMA of the weights in the student network
[65][58]. Next, the teacher network’s centered and sharpened outputs are utilized to train the student network to make exact predictions for a given positive pair.
Another non-contrastive learning model, known as the Barlow Twins, can be offered according to the information bottleneck theory, which eliminates the need for individual amounts for each branch of the teacher-student model considered in BYOL and SimSiam
[66,67][59][60]. This technique enhances the mutual information between two perspectives by boosting the cross-correlation of the matching characteristics provided by two identical networks and eliminating superfluous information in these representations. The Barlow twin loss function was evaluated by the following equation:
where
𝐶 is the cross-correlation matrix calculated by the following formula:
Variance, invariance, and covariance regularization (VICReg) approaches have been recently proposed to enhance this framework
[68,69,70,71][61][62][63][64]. In addition to invariance, which implicitly maximizes alignments between positive pairs, the loss terms are independent for every branch, unlike in low twins. Using distinct regularization for each pathway, this method allows for noncontrastive multimodal pre-training between text and photo pairs.
Most of these techniques train a linear classifier on the priority of representations as the primary performance metric. Researchers in
[70][63] analyzed the beneficial impacts of ImageNet, whereas scholars in
[69,72][62][65] examined CIFAR’s advantages, which help accomplish an active analysis of object-centric visual datasets commonly addressed for the pre-training and linear probing phases of DL. Therefore, these techniques may not apply to image classification.
Scholars are invited to examine devoted review articles for further contributory information and essential fundamentals pertaining to SSL types
[68,73][61][66].
4. Practical Applications of SSL Models
Before introducing the common applications and vital utilizations of SSL models to handle efficacious data classification and identification processes, their critical benefits should be identified as a whole. The commonly-addressed benefits and vital advantages of SSL techniques can be expressed as follows
[74,75][67][68]:
-
Minimizing the massive cost connected with data labeling phases is essential to facilitating a high-quality classification/prediction process.
-
Alleviating the corresponding time needed to classify/recognize vital information in a dataset,
-
Optimizing the data preparation lifecycle is typically a lengthy procedure in various ML models. It relies on filtering, cleaning, reviewing, annotating, and reconstructing processes through training phases.
-
Enhancing the effectiveness of AI models. SSL paradigms can be recognized as functional tools that allow flexible involvement in innovative human thinking and machine cognition.
According to these practical benefits, further workable possibilities and effective prediction and recognition impacts can be explained in the following paragraphs, which focus mainly on medical and engineering contexts.
4.1. SSL Models for Medical Predictions
Krishnan et al. (2022)
[76][69] analyzed SSL models’ application in medical data classification, highlighting the critical challenges of manual annotation of vast medical databases. They addressed SSL’s potential for enhancing disease diagnosis, particularly in EHR and some other visual clinical datasets. Huang et al. (2023)
[20][13] conducted a systematic review affirming SSL’s benefits in supporting medical professionals with precise classification and therapy identification from visual data, reducing the need for extensive manual labeling.
Figure 109 shows the number of DL, ML, and SSL research articles published between 2016 and 2021.
Figure 109. The number of articles on SSL, ML, and DL models utilized for medical data classification
[20][13].
It can be concluded from the statistical data explained in
Figure 109 that the number of research publications addressing ML and DL models’ importance and relevance in the medical classification has been increasing per year. Similarly, the increasing trend was for the overall number of academic articles investigating the SSL, ML, and DL algorithms in conducting high-performance identification of problems in images of patients.
Besides these numeric figures, an explanation of the pre-training process of SSL and fine-tuning can be expressed in
Figure 110.
Figure 110. The two stages of pre-training and fine-tuning are considered in the classification of visual data
[20][13].
It can be inferred from the data explained in
Figure 110 that the pre-training SSL process takes into account four critical types to be accomplished, including (
Figure 110a) innate relationship, (
Figure 110b) generative, (
Figure 110c) contrastive, and (
Figure 110d) self-prediction. At the same time, there are two categories included in the fine-tuning process, which comprise end-to-end and feature extraction procedures.
Before the classification process is done, SSL models are first trained. This step is followed by the encoding of image features. It follows the adoption of the classifier, which is important to enable precise prediction of the medical problem in the image.
In their overview
[20][13], the scholars identified a collection of some medical disciplines in which SSL models can be advantageous in conducting the classification process flexibly, which can be illustrated in
Figure 112.
Figure 121. The major categories correlated with medical classification can be done by SSL models [20]. The major categories correlated with medical classification can be done by SSL models [13].
From the data expressed in
Figure 112, it can be inferred that the possible medical classification types and dataset categories are numerous in SSL models that can be applied reliably for efficient classification. As a result, this aspect makes SSL models more practical and feasible for carrying out robust predictions of problems in clinical datasets.
Various studies have explored the application of SSL models in medical data classification, showcasing their efficacy in improving diagnostic accuracy and efficiency. Azizi et al. (2021)
[77][70] demonstrated the effectiveness of SSL algorithms in classifying medical disorders within visual datasets, particularly highlighting advancements in dermatological and chest X-ray recognition. Zhang et al. (2022)
[78][71] utilized numerical simulations to classify patient illnesses on X-rays, emphasizing the importance of understanding medical images for clinical knowledge. Bozorgtabar et al. (2020)
[79][72] addressed the challenges of data annotation in medical databases by employing SSL methods for anomaly classification in X-ray images. Tian et al. (2021)
[80][73] identified clinical anomalies in fundus and colonoscopy datasets using SSL models, emphasizing the benefits of unsupervised anomaly detection in large-scale health screening programs. Ouyang et al. (2021)
[81][74] introduced longitudinal neighborhood embedding SSL models for classifying Alzheimer’s disease-related neurological problems, enhancing the understanding of brain disorders. Liu et al. (2021)
[82][75] proposed an SSMT-SiSL hybrid model for chest X-ray data classification, highlighting the potential of SSL techniques to expedite data annotation and improve model performance. Li et al. (2021)
[83][76] addressed data imbalances in medical datasets with an SSL approach, enhancing lung cancer and brain tumor detection. Manna et al. (2021)
[84][77] demonstrated the practicality of SSL pre-training in improving downstream operations in medical data classification. Zhao and Yang (2021)
[85][78] utilized radiomics-based SSL approaches for precise cancer diagnosis, showcasing SSL’s vital role in medical classification tasks.
4.2. SSL Models for Engineering Contexts
In the field of engineering, SSL models may provide contributory practicalities, especially when prediction tasks in mechanical, industrial, electrical, or other engineering domains are necessary without the need for massive data annotations to train and test conventional models to accomplish this task accurately and flexibly.
In this context, Esrafilian and Haghighat (2022)
[86][79] explored the critical workabilities of SSL models in providing sufficient control systems and intelligent monitoring frameworks for heating, ventilation, and air-conditioning (HVAC) systems. Typically, ML and DL models may not contribute to noteworthy advantages since complicated relationships, patterns, and energy consumption behaviors are not directly and clearly provided. The controller was created by employing a model-free reinforcement learning technique recognized with a double-deep Q-network (DDQN). Long et al. (2023)
[87][80] proposed an SSL-based defect prognostics-trained DL model, SSDL, addressing the challenges of costly data annotation in industrial health prognostics. SSDL dynamically updates a sparse auto-encoder classifier with reliable pseudo-labels from unlabeled data, enhancing prediction accuracy compared with static SSL frameworks. Yang et al. (2023)
[88][81] developed an SSL-based fault identification model for machine health prognostics, leveraging vibrational signals and one-class classifiers. Their SSL model, utilizing contrastive learning for intrinsic representation derivation, outperformed novel numerical models in fault prediction accuracy during simulations. Wei et al. (2021)
[89][82] utilized SSL models for rotary machine failure diagnosis, employing 1-D SimCLR to efficiently encode patterns with a few unlabeled samples. Their DTC-SimCLR model combined data transformation combinations with a fixed feature encoder, demonstrating effectiveness in diagnosing cutting tooth and bearing faults with minimal labeled data. Overall, DTC-SimCLR had improved diagnosis accuracy and fewer samples.
Figure 132 depicts a low-sample machine failure diagnosis approach.
Figure 132. The formulated system for machine failure diagnosis needs very few samples
[89][82].
Furthermore, the procedure related to the SSL in SimCLR can be expressed in
Figure 143.
Figure 143. The procedure related to the SSL in SimCLR
[89][82].
Simultaneously,
Table 1 indicates the critical variables correlated with the 1D SimCLR.
Table 1. The major variables linked to the 1D SimCLR
[89][82].
No. |
Variable Category |
Magnitude |
1 |
Input Data |
A Length of 1024 Data Points |
2 |
Temperature |
10 |
3 |
Feature Encoder |
Sixteen Convolutional Layers |
4 |
Output Size |
128 |
5 |
Training Epoch |
200 |
Above these examples, Lei et al. (2022)
[90][83] addressed SSL models in predicting the temperature of aluminum correlated with industrial engineering applications. Through their numerical analysis, they examined how changing the temperature of the pot or electrolyte could affect the overall yield of aluminum during the reduction process through their proposed deep long short-term memory (D-LSTM).
On the other side, Xu et al. (2022)
[91][84] identified the contributory rationale of functional SSL models to offer alternative solutions to conventional human defect detection methods that became insufficient. Bharti et al. (2023)
[92][85] remarked that deep SSL (DSSL) contributed to significant relevance in the industry owing to its potency in reducing the time and effort required by humans for data annotation by manipulating operational procedures carried out by robotic systems, taking into account the CIFAR-10 dataset. Hannan et al. (2021)
[93][86] implemented SSL prediction to estimate the state of charge (SOC) correlated with lithium-ion (Li-ion) batteries precisely in electric vehicles (EVs) to ensure their maximum cell lifespan.
4.3. Patch Localization
Regarding the critical advantages and positive gains of SSL models in conducting active processes of patch localization, several authors confirmed the significant effectiveness and valuable merits of innovative SSL schemes in accomplishing optimum activities of recognition and detection related to a defined dataset of patches. For instance, Li et al. (2021)
[94][87] estimated the substantial contributions of SSL in identifying visual defects or irregularities in an image without relying on abnormal training data. The patch localization of visual defects involves grid classes, wood, screws, metal nuts, hazelnuts, and bottles.
Although SSL has made great strides in the field of image classification, there is moderate effectiveness in making precise object recognition. Through their analysis, Yang et al. (2021)
[95][88] aimed to improve self-supervised, pre-trained models for object detection. They proposed a novel self-supervised pretext algorithm called instance localization, proposing an augmentation strategy for the image-bounding boxes. Their results confirmed that their pre-trained algorithm for object detection was improved, but it became less effective in ImageNet semantic classification and more so in image patch localization. Object detection considering the PASCAL VOC and MSCOCO datasets revealed that their method achieved state-of-the-art transfer learning outcomes.
The red box in their result, expressed in
Figure 154, indicates the base truth bounding box linked to the foreground image. However, the right-hand photo shows a group of anchor boxes positioned in the central area related to a singular spatial location. By improving the multiple anchors using variant scales, positions, and aspect ratios, the base truth pertaining to the blue boxes can be augmented, offering an intersection over union (IoU) level greater than 0.5.
Figure 154. Bounding boxes addressed for spatial modeling
[95][88].
To train an end-to-end model for anomaly identification and localization using only normal training data, Schlüter et al. (2022)
[96][89] created a flexible self-supervision patch categorization model called natural synthetic anomalies (NSA). Their NSA harnessed Poisson photo manipulation to combine scaled patches of varying sizes from multiple photographs into a single coherent entity. Compared with other data augmentation methods for unsupervised anomaly identification, this aspect helped generate a wider variety of synthetic anomalies that were more akin to natural sub-image inconsistencies. Natural and medical images were employed to test their proposed technique, including the MVTec AD dataset, indicating the efficient capability of identifying various unknown manufacturing defects in real-world scenarios.
4.4. Context-Aware Pixel Prediction
Learning visual representations from unlabeled photographs has recently witnessed a rapid evolution owing to self-supervised instance discrimination techniques. Nevertheless, the success of instance-based objectives in medical imaging is unknown because of the large variations in new patients’ cases compared with previous medical data. Context-aware pixel prediction focuses on understanding the most discriminative global elements in an image (such as the wheels of a bicycle). According to the research investigation conducted by Taher et al. (2022)
[97][90], instance discrimination algorithms have poor effectiveness in downstream medical applications because the global anatomical similarity of medical images is excessively high, resulting in complicated identification tasks. To address this shortcoming, scholars have innovated context-aware instance discrimination (CAiD), a lightweight but powerful self-supervised system, considering: (a) generalizability and transferability; (b) separability in embedding space; and (c) reusability and systematic reusability. The authors addressed the dice similarity coefficient (DSC) as a measure related to the similarity between two datasets that are often indicated as binary arrays. Similarly, authors in
[98][91] proposed a teacher-student strategy for representation learning, wherein a perturbed version of an image serves as an input for training a neural net to reconstruct a bag-of-visual-words (BoW) representation referring to the original image. The BoW targets are generated by the teacher network, and the student network learns representations while simultaneously receiving online training and an updated visual word vocabulary.
Liu et al. (2018)
[57][50] distinguished some beneficial yields of SSL models in identifying information from defined datasets of context-aware pixel databases. To train the CNN models necessary for depth evaluation from monocular endoscopic data without a priori modeling of the anatomy or coloring, the authors implemented the SSL technique, considering a multiview stereo reconstruction technique.
4.5. Natural Language Processing
Fang et al. (2020)
[15][8] considered SSL to classify essential information in certain defined datasets related to natural language processing. Scholars explained that pre-trained linguistic models, such as bidirectional encoder representations from transformers (BERTs) and generative pre-trained transformers (GPTs), have proved their considerable effectiveness in executing active linguistic classification tasks. Existing pretraining techniques rely on auxiliary prediction tasks based on tokens, which may not be effective for capturing sentence-level semantics. Thus, they proposed a new approach that recognizes contrastive self-supervised encoder representations using transformers (CERTs). Baevski et al. (2023)
[99][92] highlighted critical SSL models’ relevance to high-performance data identification correlated with NLP. They explained that currently available techniques of unsupervised learning tend to rely on resource-intensive and modal-specific aspects. They added that the Data2vec model expresses a practical learning paradigm that can be generalized and broadened across several modalities. Their study aimed to improve the training efficiency of this model to help handle the precise classification of NLP problems. Park and Ahn (2019)
[100][93] inspected the vital gains of SSL to lead to efficient detection of NLP. Researchers proposed a new approach dedicated to data augmentation that considers the intended context of the data. They suggested a label-asked language model (LMLM), which can effectively employ the masked language model (MLM) in data with label information by including label data for the mask tokens adopted in the MLM. Several text classification benchmark datasets were examined in their work, including the Stanford sentiment treebank-2 (SST2), multi-perspective question answering (MPQA), text retrieval conference (TREC), Stanford sentiment treebank-5 (SST5), subjectivity (Subj), and movie reviews (MRs).
4.6. Auto-Regressive Language Modeling
Elnaggar et al. (2022)
[101][94] published a paper shedding light on valuable SSL roles in handling the active classification of datasets connected to the modeling of auto-regressive language. The scholars trained six models, four auto-encoders (BERT, Albert, Electra, and T5), and two auto-regressive prototypes (Transformer-XL and XLNet) on up to 393 billion amino acids from UniRef and BFD. The Summit supercomputer was utilized to train the protein LMs (pLMs), which required 5616 GPUs and a TPU Pod with up to 1024 cores. Lin et al. (2021)
[102][95] performed numerical simulations, exploring the added value of three SSL models, notably (I) autoregressive predictive coding (APC), (II) contrastive predictive coding (CPC), and (III) wav2vec 2.0, in performing flexible classification and reliable recognition of datasets engaged in auto-regressive language modeling. Several any-to-any voice conversion (VC) methods have been proposed, like AUTOVC, AdaINVC, and FragmentVC. To separate the feature material from the speaker information, AUTOVC and AdaINVC utilize source and target encoders. They proposed a new model, known as S2VC, which harnesses SSL by considering multiple features of the source and target linked to the VC model. Chung et al. (2019)
[103][96] proposed an unsupervised auto-regressive neural model to help students learn generalized representations of speech. Their speech representation learning approach was developed to maintain information for various downstream applications to remove noise or speaker variability.
5. Commonly-Utilized Feature Indicators of SSL Models’ Performance
Specific formulas in
[104,105][97][98] were investigated to examine different SSL paradigms in carrying out their classification task, particularly the prediction and identification of faults and errors in machines, which can support maintenance specialists in selecting the most appropriate repair approach. These formulas formulate practical feature indicators to monitor signals that can be prevalently utilized by maintenance engineers to identify the health state of machines. Twenty-four typical feature indicators were addressed, referring to Zhang et al. (2022)
[106][99]. These indices can enable maintenance practitioners to locate optimum maintenance strategies to apply to industrial machinery, helping to handle current failure issues flexibly.