The Rise of Adversarial Machine Learning: History
Please note this is an old version of this entry, which may differ significantly from the current revision.

Internet of Things (IoT) technologies serve as a backbone of cutting-edge intelligent systems. Machine Learning (ML) paradigms have been adopted within IoT environments to exploit their capabilities to mine complex patterns. Despite the reported promising results, ML-based solutions exhibit several security vulnerabilities and threats. Specifically, Adversarial Machine Learning (AML) attacks can drastically impact the performance of ML models. It also represents a promising research field that typically promotes novel techniques to generate and/or defend against Adversarial Examples (AE) attacks.

  • Internet of Things (IoT)
  • Cybersecurity
  • intrusion detection
  • adversarial machine learning (AML)

1. Introduction

Recently, researchers have reported the potential vulnerability of ML models to adversarial attacks [1,3,4]. This has promoted researchers’ efforts toward AML challenges. In fact, AML is concerned with the intersection of computer security and machine learning fields. It analyzes the attacks that aim to degrade the performance of ML-based models. It also investigates the process of generating and detecting the crafted adversarial examples and how eventually incorporate possible defensive mechanisms. The area has been extensively associated with applications dealing with images as primary data modality. On the other hand, it is still growing within the fields of network traffic analysis and IoT [6,7].

2. Understanding Adversarial Examples

2.1. Adversarial Examples Causes and Characteristics

The criticality of adversarial examples has driven many questions about the causes behind them and how they are constructed. The analysis of such problems supports the mitigation efforts provided by researchers to manage this vulnerability. One of the causes can be the inability of the model to generalize and predict accurately the pattern of unseen data. Moreover, Goodfellow et al. [26] investigated the effect of adding perturbations to a regularized model for enhancing prediction performance. However, the reported results did not confirm the expected improvement. Other researchers [27] investigated the non-linearity of ML models in increasing the chance of constructing adversarial examples. They claimed that both linear and non-linear models can be considered for constructing adversarial examples by injecting inputs with small perturbations. According to Goodfellow et al. [26], the linear behavior of the model, in which each individual input feature is normalized, can also yield adversarial examples. Moreover, perturbing one dimension of each input will not affect the classification accuracy as effectively as perturbing all the dimensions of the inputs [28]. When dealing with adversarial examples, there are three main characteristics to be considered as follows [28]:
  • Transferability. Adversarial examples can be constructed and used across several architectures and parameters of ML models which perform the same tasks. This characteristic shows the capability of those examples to be constructed by a known substitute model and then used to attack relevant unknown target models. Transferability can be categorized into two types [29]:
  • Cross-data transferability: This happens when the training of both substitute and target models uses similar machine-learning techniques but with different data.
  • Cross-technique transferability: This happens when the training of both substitute and target models uses the same data but with different machine-learning techniques.
  • Regularization Effect. Adversarial examples can be used to enhance model robustness using adversarial training. Adversarial training solutions is adopted as defense mechanisms by several researchers. However, constructing large adversarial examples is costly in terms of computational power compared to other regularization mechanisms such as dropout [30].
  • Adversarial Instability. Adversarial examples can lose the adversarial characteristics when physical effects are applied including rotation, translation, rescaling, and lighting [31]. This leads to the classification of these examples correctly which motivates attackers to enhance the robustness of adversarial examples construction methods.
However, some limitations can be faced when dealing with adversarial examples. This is inferred from the restriction in the perturbation numbers added where it is preferred to keep it at a low scale. Moreover, there might also be more optimization constraints on crafting the perturbations itself such as original content preserving, non-distinguishable perturbed input sample, and payload-constrained input [32].

2.2. Adversarial Examples Magnitude Measurement

For crafting adversarial examples, gradient-based methods are widely adopted for adding perturbation using specific distance metrics [28,33,34]:
  • The L0 norm of perturbations—measures the number of mismatched (non-zero) elements between original and adversarial samples in the vector where the features’ perturbed number is minimized.
  • The L1 norm of perturbations—measures the total number of absolute values of the differences between original and adversarial samples in which the features’ perturbed number is minimized.
  • The L2 norm of perturbations—measures the Euclidean distance between original and adversarial samples in which the Euclidean distance between those data points is minimized.
  • L∞ norm of perturbations—measures the maximum difference between the original and adversarial samples in which the maximum amount of perturbation is applied on any feature.

2.3. Adversarial Examples Crafting Methods

The crafting methods of adversarial examples have been widely investigated and a framework might be proposed for further clarification. It is worth noting that the efficiency of crafting adversarial examples comes from minimizing the total perturbation as much as possible to avoid being easily detected. Accordingly, there are two possible steps that are repeated iteratively where datapoint X is replaced with X + δX until the adversarial goal is achieved and the perturbation δX is applied.
Direction Sensitivity Estimation.
In this step, the sensitivity of applying changes to inputs features is measured by analyzing the data distributions around specific datapoint where the model decision boundary is possibly affected and changed. Accordingly, there are several techniques for performing direction sensitivity estimation such as the following:
  • Limited-memory Broy-den, Fletcher, Goldforb, Shanno (L-BFGS). This method has been proposed by Szegedy et al. [6] for crafting adversarial examples through a minimization problem. In such a scenario, the adversary constructs with L2-norm an image X′ similar to the original image X where X′ can be labeled as a different class. This is considered a complex problem to be solved due to the use of nonlinear and non-convex functions. They tried to search for an adversarial sample by finding the minimum loss function additions to L2-norm according to the following formula [28]:

min c . X X 2 + l o s s F , l ( X )

where c is a hyper-parameter that is randomly initialized by linear search; ∥∥XX∥∥2 is L2-norm and lossF,l(*)

  • is the loss function. Thus, the problem is transformed into a convex optimization process but with complicated and expensive calculations [28,35].
  • Fast Gradient Sign Method (FGSM). This method has been proposed by Goodfellow et al. [26] for crafting adversarial examples where the cost function is calculated with regard to the gradient direction. FGSM is different from LBFGS since it uses the L1-norm and does not perform iterative processes which makes it an excellent choice when it comes to computational cost and time. In such a scenario, misclassification can occur by adding perturbations according to the following formula [28]:
    X a d v = X + ϵ s i g n ( X J X , Y t r u e )

where Xadv is the adversarial example of X, X is the original sample and ϵ is a hyper-parameter that is randomly initialized to control the amplitude of the disturbance. Additionally, sign (_) is a sign function, and J (_) is the cost function with respect to the original sample with the correct label Ytrue and XJ is the gradient of X. However, this method is subjected to label leaking where other researchers suggest replacing the correct label Ytrue

  • with the predicted label [28,35,36].
  • Iterative Gradient Sign Method (IGSM). This method has been proposed by Kurakin et al. [36] for crafting adversarial examples by optimizing the FGSM method. Perturbations are iteratively applied into several smaller steps followed by clipping the results which guarantees that these perturbations are close to the original samples. It is worth noting that the non-linearity of IGSM is in the gradient direction where multiple iterations are required. This reflects the simplicity of this method compared to L-BFGS and its higher success rate of the resulting adversarial samples compared to FGSM. For each iteration, the following formula is used where ClipX,ϵ(*)

denotes [X − ϵ, X + ϵ] [28]:

X 0 a d v = X , X N + 1 a d v = C l i p X , ϵ { X N a d v + α s i g n ( X J X N a d v , Y t r u e ) }
Moreover, IGSM has two distinct types of adversarial goals: (1) minimizing the confidence of the original prediction and its belongingness to the original class, or (2) maximizing the confidence of the prediction and its belongingness to the class with the lowest probability instead of the correct class [28,35].
  • Iterative Least-Likely Class Method (ILCM). This method has been proposed by Kurakin et al. [37] for crafting adversarial examples by perturbing the target class and replacing it with the least-likely probability class for the dataset disturbance. It leads to a degradation in the classifier performance with significant errors such as misclassifying a dog as a car. The ILCM differs from FGSM and L-BFGS by identifying the exact wrong class for the adversarial examples. Moreover, it is suitable to be used when handling datasets with a considerable number of distinct classes such as ImageNet. In such a scenario, perturbations can be added according to the following formula [28]:

X 0 a d v = X , X N + 1 a d v = C l i p X , ϵ { X N a d v + α s i g n ( X J X N a d v , Y L L ) }

where the least-likely probability class is represented by YLL

  • . There is another option where the least-likely class is replaced with a random class as the target class which is thereby called an iteration random class method [28,35].
  • Jacobian Based Saliency Map (JSMA). This method has been proposed by Papernot et al. [38] for crafting adversarial examples using the model’s Jacobian matrix. It works by using the gradients of relative output and input components to construct a saliency map and build the gradients based on the impact of each pixel. The L0 distance norm is utilized where a limited number of the image pixels are modified, and they represent the most important pixels based on the saliency map. Therefore, gradients are significantly important in perturbing the pixel and making the prediction of the image towards the target classes. It can be performed as follows [28]:
    Firstly: Calculate the forward derivative F(X)

according to the following formula:

F X = F ( x ) X = [ F j ( x ) X i ] i 1 M , j 1 N

  • II.
    Secondly: Construct the saliency map S based on the calculated forward derivative.
    Thirdly: Select the pixel with the highest importance using the saliency map in an iterative manner until either classifying the output as the target class or maximum perturbation is achieved.
It is worth noting that JSMA is used for targeting misclassification attacks with several strengths such as a high success rate, and a high transfer rate, however, it is its main disadvantage related to its high computational cost [28,35].
Perturbation Selection.
In this step, the knowledge of sensitivity is used by the adversary to select the most suitable perturbation for exploiting the model. This includes two methods as below:
Perturb all the input dimensions. Some researchers investigated the manipulation of all input dimensions where direct sensitivity estimation methods are used. In the [26] experiment, FGSM is utilized to evaluate the gradient sign direction for each input dimension and thereby minimizes the Euclidean distance between the original inputs and the related adversarial samples. However, applying such a method can be detected easily since the number of perturbations is large.
Perturb the selected input dimension. Some researchers investigated the perturbation of a selected number of input dimensions with the use of the saliency map [38]. This method contributes to limiting the number of perturbations effectively but at the price of higher computation cost.

3. Modelling the Adversary

The crucial use of ML models has encouraged the specifications of threat models which in turn highlights possible adversarial attack scenarios and conditions. This contributes to enhancing defense mechanisms properly by tailoring them towards specific attacks followed by measuring their performance. Huang et al. [7] introduced the AML concept where adversarial attacks are presented through a taxonomy modeling the adversarial threats according to the following aspects: goals, knowledge, and capabilities.

3.1. Adversarial Capabilities

Adversarial capabilities represent the potential impact of an adversary when attacking the ML models which can be grouped into two categories: Influence, and Specificity [7,29,39].
  • Influence:
This category focuses on the adversary’s influence on certain classification elements such as changing the dataset or the algorithms when running attacks on the target model. Such attacks include causative, evasion, or exploratory attacks where there is an influence over either the training dataset or the testing dataset, or both. Consideration of training and testing phases is used to clarify more about the adversary’s influence according to his/her capabilities in those phases, as follows [29,39]:
  • Training Phase Influence: In this phase, attacks take place by influencing or corrupting the model performance in which the datasets alteration is performed, and can be summarized as follows:
    • Data Injection: The adversary can affect the target model by injecting adversarial samples and inserting them into the training dataset. This can happen with some control over the training dataset but not over the learning algorithm.
    • Label manipulation: The adversary can modify the training labels only and gain the most vulnerable label to degrade the model performance. The label perturbations can happen with some control over the training dataset and can be applied in a random manner to the distribution of training data. An experiment indicates that a random perturbation of the training labels can degrade the performance of shallow ML models significantly [40].
    • Data Manipulation: The adversary can poison the training dataset before it has been used for training the target model. The adversary can modify both the labels and input features of the training data and affect the decision boundary. The training data can be accessed but without the need to access the learning algorithm.
    • Logic Manipulation: The adversary can manipulate the learning algorithm and affect its workflow logic which thereby makes the ML model under his/her control.
  • Testing Phase Influence: In this phase, attacks take place to force the target model to produce incorrect outputs without influencing it. These types of attacks use other techniques to extract useful information rather than influencing the training phases, and can be summarized as follows:
    • Model Evasion: The adversary can evade the target model by crafting adversarial samples during the testing phase.
    • Model Exploratory: The adversary can gain various levels of knowledge about the target model in terms of the learning algorithm and training dataset distribution pattern, as follows:
      Data Distribution Access: The adversary can access the training dataset distribution of the target models. The substitute local model is built to imitate the target model in classifying a set of distribution samples. This helps in generating adversarial samples where they are used on the target model for misclassification purposes.
      Model Oracle: The adversary can only query the target model by inputting a set of samples and checking the related output labels. This access is carried out as an oracle and followed by creating a substitute local model to be used on the obtained results from the query. Then the adversary uses the adversarial samples from the substitute model to affect the target model.
      Input–Output Collection: The adversary can collect from the target model the input—output pairs to analyze the possible patterns. This is carried out without accessing the training dataset distribution.
This category focuses on attack specificity in which the determination of attack effects is clarified. It considers the attacks with multiple vectors alongside a specific vector against the target model. Attacks within this category can be further classified as follows [4,29,39]:
  • Targeted: The adversary defines specific targets when performing attacks causing model misclassification into certain classes.
  • Indiscriminate: The adversary has no defined targets where performing attacks causes general misclassification without specifications.

3.2. Adversarial Knowledge

There are various levels of knowledge about the target model where an adversary can perform the attacks. Possible knowledge elements include training data, learning algorithms, feature space, cost function, and tuned parameters. Accordingly, the adversary knowledge about the target model can fall into three types of categories [39]:
  • Complete Knowledge: It is called White-Box Attack where the adversary has access to the whole learning process including data collection, feature extraction, feature selection, learning algorithm, and model-tuned parameters. In such a scenario, the target model is open source and access to the training dataset may be available or not to the adversary.
  • Partial Knowledge: It is called Grey-Box Attack where an adversary does not have access to the training dataset and is equipped with partial knowledge about the learning process in terms of learning algorithms and the feature space. However, the adversary is not aware of either the training dataset or the tuned parameters.
  • Zero Knowledge: It is called Black-Box Attack where an adversary does not have any knowledge about the majority of learning process elements including the training dataset, learning algorithm, and feature space. In such a scenario, the adversary queries the target model in which feedback on crafted query adversarial samples is used to enhance other substitute models.
The adversary can evolve from a black box to a white box through an iterative learning process with the use of an inference mechanism to reach the required level of knowledge [4,39].

3.3. Adversarial Goals

The adversarial impact on the target ML model is used to clarify the main objectives behind the adversarial attacks. Accordingly, the adversarial goals can be categorized based on the incorrectness of the model, as follows [29]:
  • Confidence Reduction: The adversary reduces the confidence of the target model classification process. This can be seen in an example of an image recognition task where a “stop” sign is recognized with a lower confidence value with regard to the correct class belongingness.
  • Misclassification: The adversary modifies the prediction of an input example and is misclassified on the decision boundary to a different class. This can be seen in an example of an image recognition task where a “stop” sign is recognized in another class that is different from the “stop” sign class.
  • Targeted Misclassification: The adversary works on crafting adversarial examples and modifying the input point to be misclassified by the target model into another specific class. This can be seen in an example of an image recognition task where the “stop” sign is recognized into another specific class like the “go” sign.
  • Source/target Misclassification: The adversary works on crafting adversarial examples and modifying specific input points to be misclassified by the target model into another specific class. This can be seen in an example of an image recognition task where the “stop” sign is recognized into another specific class like the “go” sign.

4. Defenses against Adversarial Examples

Since there are many methods used for crafting adversarial examples, it is essential to ensure the proper robustness of ML solutions against those vulnerabilities. According to the literature, defenses towards adversarial examples can be classified using distinct categories [32].

4.1. Data-Based Modification

This category covers the techniques of modifying the data and its related features during either the training phase or the testing phase based on the attacker’s capabilities. Such techniques include:
  • Adversarial Training:
Adversarial training has been widely adopted in many research studies to enhance the robustness of ML solutions against adversarial attacks and show their defects [26,41]. The main notion behind this technique is reducing the potential misclassification results when data perturbation is fed to the ML solutions. It works on adding adversarial examples to the training data alongside generating new adversarial examples during the training epochs. Through these epochs, the characteristics of adversarial examples are controlled by the loss function where the hyper-parameters are tuned accordingly. By equalizing the number of both original examples and adversarial examples during training, the models can give better adversarial training results.
Adversarial training is used also to handle the regularization problems and thereby avoid overfitting. Since adversarial training is correlated to the training phase where white-box attacks take place, it is not robust to black-box attacks where new adversarial examples are presented [28,32]. Therefore, a study investigated the use of the ensemble adversarial training concept where different training data of pre-trained models are combined and then allow the model to generalize well on unseen inputs [42].
Blocking the Transferability:
Since transferability is a unique characteristic of adversarial examples, defenses in this context prevent the adversarial examples’ transferability and thereby prevent black-box attacks. As mentioned previously, transferability can happen to models with different architectures or trained on different training datasets. A labeling method has been proposed to prevent the transferability between models [44]. It relies on adding a NULL label to the dataset to mitigate the adversarial examples effects where the model can detect them more efficiently. For this purpose, three steps are performed which start with training the target model as an initial step. It is followed by finding the NULL probabilities within the examined dataset and ends with adversarial training. A higher probability is assigned to the NULL label when there is more perturbation, and the probabilities of other labels are also increased for the original labeling. Thus, the method eases the detection of adversarial examples by annotating the perturbation as a NULL label instead of classifying it into one of the ranges of original labels. It also does not affect the model accuracy for the classification process of original datasets [35].
Input Transformation:
Some research studies have found the data transformation technique useful for enhancing the model’s robustness against adversarial attacks such as FGSM [26], DeepFool [79], and universal disturbance attacks [45,46,47]. Such transformation includes data compression, variance minimization, bit-depth reduction, and input reconstruction which are utilized to prevent adversarial perturbations. In terms of compression, JPEG and Display Compression Technology (DCT) compression methods are used in mitigating attacks by performing compression over different image data formats such as JPEG. After training the model on those inputs, the overall accuracy is enhanced, and the attack disturbance effect can be controlled. However, the compression techniques are not well effective against more powerful and advanced attacks like Carlini & Wagner attacks [80,81]. Moreover, the increase of compression amount affects the original accuracy of the models while the decrease of its amount is not sufficient in eliminating the adversarial examples impact.
In terms of input reconstruction, a cleaning process is applied to the adversarial examples to transform them into legitimate examples in a reverse way. After the transformation takes place, the adversarial effect on the models’ classification will be removed. An example of such work is presented using a deep contractive autoencoder technique, where a denoising autoencoder is used for cleaning adversarial examples [50]. However, the transformation techniques can be also used by the adversary to make the adversarial examples further stronger in the face of defense mechanisms [35,82].
Data Randomization:
Research in this category deals with applying different modification operations on adversarial examples such as random resizing and padding. This means that random sequences are added to the adversarial examples which reduces their effectiveness significantly. Some researchers use the two randomization operations including random resizing and random padding in the test phase. Accordingly, both non-iterative and iterative adversarial attacks can be reduced in an effective manner [48]. Other researchers utilize a separate module for conversing data where several operations of randomization are performed such as Gaussian randomization during the training phase which strengthens the model’s capability [49].
Adversarial Robust Features Development:
Some studies focus on utilizing feature space in defending adversarial attempts against the classification process. They investigate developing adversarial robust features by studying the underlying data in terms of natural spectral geometric aspect and its relationship with the metric of interest. Their results reflect the effectiveness of the proposed approach and guarantee that any function of the dataset can have a lower bound of robustness while ensuring variation in outputs [51].

4.2. Model-Based Modification

This category covers the techniques of modifying the model through the methods of parameter tuning, feature selection, and so on. Such techniques include the following:
  • Feature Squeezing:
The typical features space is generally large with several least essential elements that can facilitate exploiting potential vulnerabilities of ML solutions. These vulnerabilities lead to a large interference where the adversaries can craft adversarial examples and benefit from the model’s high sensitivity. Feature squeezing is achieved by reducing the feature and minimizing the complexity of data representation. The first feature squeezing method works on reducing the color-bit depth by having encoded color with fewer values. The second one applies a smoothing filter with the use of local and non-local techniques where mapping several inputs to a single value is performed. This enhances the model performance by reducing its complexity and makes it more robust but at the cost of classification accuracy. Note that some weaknesses of this technique have been recently reported [43].
Feature Masking:
Deep learning models can incorporate several layers when performing classification tasks. There are some sensitive features within the model inputs that need to be hidden from the adversary. Researchers found that adding a masking layer before processing the classified model can avoid potential exploitation. The masking layer works on both the original and the adversarial examples of images and calculates the most sensitive features which have the highest weight. Thus, the technique changes the corresponding weights of this additional layer to zero which leads to more privacy [52].
Gradient Hiding:
Some ML techniques such as decision tree, random forest, and K-Nearest Neighbor (K-NN), yield non-differentiable models which make gradient-based attacks ineffective. This inspires researchers to study the effectiveness of hiding the gradient information of models so they cannot be used to craft adversarial examples. This technique is utilized specifically for mitigating gradient-based attacks by causing numerical instabilities and restraining the adversary attempt in gradient estimation [53]. However, both black-box and white-box scenarios can be applied successfully and defeat this defense mechanism. It happens by training a surrogate classifier which is used for gradient estimation and adversarial examples generation due to the transferability characteristic of those examples [42].
Gradient Regularization:
Another defense solution related to gradient is proposed by the researchers in [57,58]. In particular, they used a regularization of input gradient which can be defined as a robustification method for training differentiable Deep Neural Network (DNN) models that integrate a penalty term with the gradient of loss function. In this scenario, the output variation is due to a change in the input in which a small perturbation does not affect the model output. The authors showed also that the interpretability of adversarial perturbations is increased through the adopted regularization. However, it increases the complexity in terms of computational power by a factor of two.
Defensive Distillation:
The basic idea of distillation was proposed for knowledge transfer from large to small networks [83]. This idea is adopted also as a defense mechanism by using the probability distribution vector produced by the first model as input to the second model. The model with large and intensive computations is simulated by a small model without changing the neural network architecture and degrading the accuracy of model performance. This helps in smoothing the training process and enhancing the model’s ability of generalization to become more resilient against adversarial examples. The steps associated with the defensive distillation experiment can be summarized as follows:
  • First Step: Datasets are labeled using the probability vectors produced by the first DNN. The newly produced labels are soft labels which are different from hard labels.
  • Second Step: the second DNN model is trained using either the soft labels or both hard and soft labels.
Due to the knowledge transfer between models, the second model is simplified in terms of size, computational power, and training overhead keeping at the same time the needed robustness [54]. However, the defensive distillation method is not effective against some stacks such as Carlini and Wagner attack [80,81].
Model Verification:
ML models need a verification step to validate any processed input. As such, the model input is assessed to ensure its adherence to the model’s properties and criteria which in turn supports accurate detection of new unseen adversarial examples. One research study demonstrates that this verification is an NP-complete problem and uses the Satisfiability Modulo Theory (SMT) solver to address it. For robustness, it employed Rectified Linear Unit (ReLU) activation function where neural networks are used with its whole architecture without any simplification. This means not considering only limited input regions and avoiding verification of only a problem approximation [55]. Moreover, another research work [56] investigates the local adversarial robustness of DNNs by means of discretization. Their proposed system indicates the consistency of the output label within a specific neighborhood and is then applied through all the network layers.
Model Masking:
The researchers in [76] investigated other defense mechanisms by presenting a noise-augmented classifier to perform masking and ensure robust classification. Accordingly, they employed a very small noise within the logit output of the DNN model. The authors showed that their incorrect logits mislead the attack. It was effective in mitigating the low-distortion attacks and preserving the accuracy of the model.
Universal Perturbation defense method:
The universal perturbation defense method is used in defeating adversarial examples where the ML model architecture includes a perturbation rectifying network (PRN). This network is located before the input layer as a preprocessing layer. Both clean images and images with perturbations are used for training the network. Another perturbations detector is additionally trained to denoise the inputs and extract features used to differentiate between the PRN inputs and outputs [63]. Although the method gives satisfactory results in identifying adversarial samples, the detector can be evaded using some attacks [84].

4.3. Other Defense Techniques

This category covers the techniques of using other supportive models in addition to the main model for robustness reasons. Such techniques include the following:
  • Ensemble Defenses
This technique combines multiple defense strategies to overcome adversarial attacks. The combination can be made either in a parallel or sequential manner for extra protection [42]. In such context, PixelDefend [59] was introduced through the combination of two defensive mechanisms including adversarial detection and input reconstruction [59]. However, the technique exhibits a considerable weakness against attacks. In particular, the transferability property of adversarial examples impacts the effectiveness of PixelDefend [85].
GAN–based Defenses
The idea of a Generative Adversarial Network (GAN) was first coined by Goodfellow et al. [86]. Two neural networks including generator and discriminator were coupled in a competitive manner in a defensive context against both white-box and black-box attacks. A research study suggests the use of GAN to relocate the input image within the generator range through the reduction of reconstruction error before being fed to the classifier. This leads to reducing the possibility of adversarial examples by having the benign data points closer to the generator range [60]. Another study uses a trained GAN in the cleaning process of adversarial examples within the original dataset. Using different attacks and datasets, the adversarial examples are scored lower than the original data points by the GAN’s discriminator [61]. However, it is notable that the ability of interpretation and generation plays a crucial role in the success of GAN. In addition, GAN requires a sufficient level of training to ensure proper performance.
MagNet [62] can be introduced as a framework that relies on two components to defend against adversarial examples. Namely, these two components are a network of detectors and a network of reformers. The first one is used to detect adversarial examples from the original ones. On the other hand, the second component projects the adversarial examples with small perturbations and transforms them into the original ones using an automatic encoder. This technique shows effectiveness against black-box and gray-box attacks while it shows a performance degradation in white-box attacks. Accordingly, using more several encoders with a random selection might mitigate the adversary’s ability in the scenario of white-box attacks.

5. Evaluation Metrics

5.1. Statistical Measures

The first perspective is evaluating the classifiers’ performance using statistical measures widely used in the field of cyber security. This includes the confusion matrix with its related metrics including accuracy, precision, recall, F-measure, ROC (receiver operating characteristic) curve, and AUC (the area under the ROC curve). In terms of the confusion matrix, it contains several values for classification results, either they are correctly or incorrectly classified based on predefined classes. Most pre-mentioned metrics are inferred from the confusion matrix values with different combinations. The model’s accuracy is mainly concerned with its robustness where models with less vulnerability to adversarial examples are preferred. The robustness of a model can be characterized based on two essential elements:
  • High accuracy results are reached when the model is used on training and test datasets.
  • Input’s classification is consistently predicted the same for a given example.
In the context of adversarial examples, the models’ robustness is affected by the perturbation size in which the model with high robustness means it requires a higher possibility of minimum perturbation to cause misclassification.
As such, these metrics cover the required information about the model and ease the process of comparison between related research works [28].

5.2. Security Evaluation Curves

The second perspective is evaluating the security of classifiers using security evaluation curves. Those curves can be used to investigate the classifiers’ performances when there are multiple attacks with different attackers’ levels of strength and knowledge. As such, those graph-based metrics are quite useful for comparing the different defense mechanisms and reflecting their performance level [39].

5.3. Adversarial Examples-Related Measures

The third perspective is evaluating the security of classifiers using adversarial-related measures. Those measures include:
  • Success Rate: It is associated with the process of generating adversarial examples where the increment of success rate relates to a decrease in the perturbation size. This can be seen when a comparison is made between the generative methods of adversarial examples where the iterative gradient sign method (IGSM) and the Jacobian-based Saliency Map Attack (JSMA) method have a higher success rate than the fast gradient sign method (FGSM). The first two methods generate adversarial examples with lower or specific perturbations while the latter one performs large perturbations with the chance of label leaking. Nevertheless, having adversarial examples with a 100% success rate is quite difficult [28].
  • Transfer Rate: It is associated with the transferability characteristic of adversarial examples where those examples can be transferred across different models. As such, the transfer rate is used for measuring which is the ratio of transferred adversarial examples number to the total adversarial example number generated by the main model. Transferability can be classified into targeted or non-targeted transferability where it is measured by matching rate and accuracy rate, respectively. It depends on two factors where the first one is the model parameters that contain its architecture, capacity, and test accuracy. A better transfer rate when it comes to the first factor can be achieved with similar architecture, small capacity, and high accuracy. The second factor is the adversarial perturbation magnitude where the higher perturbation to the original examples leads to a higher transfer rate [28].

This entry is adapted from the peer-reviewed paper 10.3390/app13106001

This entry is offline, you can click here to edit this entry!