Blackbox Attack: In blackbox attacks the attacker does not know anything about the ML system. The attacker has access to only two types of information. The first is the hard label, where the adversary obtained only the classifier’s predicted label, and the second is confidence, where the adversary obtained the predicted label along with the confidence score. The attacker uses information about the inputs from the past to understand vulnerabilities of the model
[17][70]. Some blackbox attacks are discussed in later sections. Blackbox attacks can further be divided into three categories:
-
In this category of blackbox attack, the adversary has the knowledge of distribution of training data for a model, T. The adversary chooses a procedure, P, for a selected local model, T’, and trains the model on known data distribution using P for T’ to approximate the already learned T in order to trigger misclassification using whitebox strategies
-
-
Adaptive Blackbox Attack: In adaptive blackbox attack the adversary has no knowledge of the training data distribution or the model architecture. Rather, the attacker approaches the target model, T, as an oracle. The attacker generates a selected dataset with a label accessed from adaptive querying of the oracle. A training process, P, is chosen with a model, T’, to be trained on the labeled dataset generated by the adversary. The model T’ introduces the adversarial instances using whitebox attacks to trigger misclassification by the target model T
[17][24][70,76].
-
Strict Blackbox Attack: In this blackbox attack category, the adversary does not have access to the training data distribution but could have the labeled dataset (x, y) collected from the target model, T. The adversary can perturb the input to identify the changes in the output. This attack would be successful if the adversary has a large set of dataset (x, y)
[17][18][70,71].
Graybox attacks: In whitebox attacks the adversary is fully informed about the target model, i.e., the adversary has access to the model framework, data distribution, training procedure, and model parameters, while in blackbox attacks, the adversary has no knowledge about the model. The graybox attack is an extended version of either whitebox attack or blackbox attack. In extended whitebox attacks, the adversary is partially knowledgeable about the target model setup, e.g, the model architecture, T, and the training procedure, P, is known, while the data distribution and parameters are unknown. On the other hand, in the extended blackbox attack, the adversarial model is partially trained, has different model architecture and, hence, parameters
[25][77].
2.2. Anatomy of Cyberattacks
To build any machine learning model, the data needs to be collected, processed, trained, and tested and can be used to classify new data. The system that takes care of the sequence of data collection, processing, training and testing can be thought of as a generic AI/ML pipeline, termed the attack surface
[17][70]. An attack surface subjected to adversarial intrusion may face poisoning attack, evasion attack, and exploratory attack. These attacks exploit three pillars of the information security, i.e., Confidentiality, Integrity, and Availability, known as the CIA triad
[26][78]. Integrity of a system is compromised by the poisoning and evasion attacks, confidentiality is subject to intrusion by extraction, while availabilty is vulnerable to poisoning attacks. The entire AI pipeline, along with the possible attacks at each step, are shown in
Figure 1.
Figure 1.
ML Pipeline with Cyberattacks Layout.
2.3. Poisoning Attack
Poisoning attack occurs when the adversary contaminates the training data. Often ML algorithms, such as intrusion detection systems, are retrained on the training dataset. In this type of attack, the adversary cannot access the training dataset, but poisons the data by injecting new data instances
[27][28][29][35,37,40] during the model training time. In general, the objective of the adversary is to compromise the AI system to result in the misclassification of objects.
Poisoning attacks can be a result of poisoning the training dataset or the trained model
[1]. Adversaries can attack either at the data source, a platform from which a defender extracts its data, or can compromise the database of the defender. They can substitute a genuine model with a tainted model. Poisoning attacks can also exploit the limitations of the underlying learning algorithms. This attack happens in federated learning scenarios where the privacy on individual users’ dataset is maintained
[30][47]. The adversary takes advantage of the weakness of federated learning and may take control of both the data and algorithm on an individual user’s device to deteriorate the performance of the model on that device
[31][48].
2.4. Model Inversion Attack
The model inversion attack is a way to reconstruct the training data, given the model parameters. This type of attack is a concern for privacy, because there are a growing number of online model repositories. Several studies related to this attack hve been under both the blackbox and whitebox settings. Yang et al.
[32][121] discussed the model inversion attack in the blackbox setting, where the attacker wants to reconstruct an input sample from the confidence score vector determined by the target model. In their study, they demonstrated that it is possible to reconstruct specific input samples from a given model. They trained a model (inversion) on an auxiliary dataset, which functioned as the inverse of the given target model. Their model then took the confidence scores of the target model as input and tried to reconstruct the original input data. In their study, they also demonstrated that their inversion model showed substantial improvement over previously proposed models. On the other hand, in a whitebox setting, Fredrikson et al.
[33][122] proposed a model inversion attack that produces only a representative sample of a training data sample, instead of reconstructing a specific input sample, using the confidence score vector determined by the target model. Several related studies were proposed to infer sensitive attributes
[33][34][35][36][122,123,124,125] or statistical information
[37][126] about the training data by developing an inversion model. Hitaj et al.
[18][71] explored inversion attacks in federated learning where the attacker had whitebox access to the model.
Several defense strategies against the model inversion attack have been explored that include L2 Regularizer
[38][49], Dropout and Model Staking
[39][50], MemGuard
[40][51], and Differential privacy
[41][52]. These defense mechanisms are also well-known for reducing overfitting in the training of deep neural network models.
2.5. Model Extraction Attack
A machine learning model extraction attack arises when an attacker obtains blackbox access to the target model and is successful in learning another model that closely resembles. or is exactly the same as, the target model. Reith et al.
[42][54] discussed model extraction against the support vector regression model. Juuti et al.
[43][127] explored neural networks and showed an attack, in which an adversary generates queries for DNNs with simple architectures. Wang et al., in
[44][128], proposed model extraction attacks for stealing hyperparameters against a simple architecture similar to a neural network with three layers. The most elegant attack, in comparison to the others, was shown in
[45][129]. They showed that it is possible to extract a model with higher accuracy than the original model. Using distillation, which is a technique for model compression, the authors in
[46][47][130,131], executed model extraction attacks against DNNs and CNNs for image classification.
To defend against model extraction attacks, the authors in
[22][48][49][53,132,133] proposed either hiding or adding noises to the output probabilities, while keeping the class label of the instances intact. However, such approaches are not very effective in label-based extraction attacks. Several others have proposed monitoring the queries and differentiating suspicious queries from others by analyzing the input distribution or the output entropy
[43][50][127,134].
2.6. Inference Attack
Machine learning models have a tendency to leak information about the individual data records on which they were trained. Shokri et al.
[38][49] discussed the membership inference attack, where one can determine if the data record is part of the model’s training dataset or not, given the data record and blackbox access to the model. According to them, this is a concern for privacy breach. If the advisory can learn if the record was used as part of the training, from the model, then such a model is considered to be leaking information. The concern is paramount, as such a privacy beach not only affects a single observation, but the entire population, due to high correlation between the covered and the uncovered dataset
[51][135]. This happens particularly when the model is based on statistical facts about the population.
Studies in
[52][53][54][136,137,138] focused on attribute inference attacks. Here an attacker gets access to a set of data about a target user, which is mostly public in nature, and aims to infer the private information of the target user. In this case, the attacker first collects information from users who are willing to disclose it in public, and then uses the information as a training dataset to learn a machine learning classifier which can take a user’s public data as input and predict the user’s private attribute values.
In terms of potential defense mechanisms, methods proposed in
[55][56][55,139] leveraged heuristic correlations between the records of the public data and attribute values to defend against attribute inference attacks. They proposed modifying the identified k entries that have large correlations with the attribute values to any given target users. Here k is used to control the privacy–utility trade off. This addresses the membership inference attack.