1. Introduction
Machine learning (ML) is typically described as a branch of “Artificial Intelligence” that is closely related to data mining, computational statistics and analytics and data science, particularly focusing on allowing systems to learn from historical data
[1][2][67,68]. As a result, machine learning models are often made up of a set of rules, procedures or complex functions and equations. These features can be used to uncover intriguing data patterns, recognize sequences or anticipate behavior
[3][69]. As a result, ML could be useful in the field of cybersecurity.
Machine learning algorithms—in this case, shallow models—are further classified into and supervised learning and unsupervised learning. In supervised learning, the models usually do not have a dependent variable and mostly rely on the internal patterns available in the dataset to group the data into different categories. This can be achieved using different algorithms, such as K-means, Sequential Pattern Mining, DB scan
[4][70] and the a priori algorithm
[5][71]. In supervised learning, the models usually have class labels to verify the predictions. Naïve Bayes, for example, uses probabilistic distribution to identify to which category a class label belongs. Decision trees create a tree-like structure based on a training set. For prediction, once the tree is built, any unknown record can be sorted based on the tree structure. Random forest
[6][72] uses a similar approach, but instead of building one decision tree, it builds multiple decision trees and then uses a voting scheme to classify a record. Because of the collective nature of the decision-making process, random forest usually has higher classification accuracy. Support vector machine (SVM)
[7][73] works by creating a linear decision boundary from the dataset. This can be compared to a binary classification. SVMs are also capable of transforming the data using a kernel trick. This allows SVMs to classify nonlinear datasets as well.
Deep learning models can also classify or cluster algorithms. However, their approach is quite different from machine learning models. Unlike their counterparts, they do not have a fixed algorithm for prediction. Hence, they are also known as black box models because they analyze the data, identify patterns and use the patterns for production. Deep learning models use artificial neural networks, which are built using several perceptrons. These perceptrons are connected in a randomized fashion during the onset of model training. By looking at the data and training over a given time, the perceptrons gain values, also known as weights, that are better suited for classifying the dataset at hand. There are different varieties of deep learning models. Convolutional neural networks find application in classifying image data. They have also been used to classify cybersecurity datasets by transforming the data into a format that resembles an image. Recurrent neural networks (RNN) find application in classifying data that have a temporal aspect. Several improved versions of RNN include LSTM or long short-term memory, as well as Bi-LSTM. Unsupervised learning with deep learning includes autoencoders as well as generative adversarial networks. Autoencoders mostly use feature reduction, where information is transformed into a compressed form before further processing. They are well adapted to compressing information in a meaningful way, thereby increasing the prediction accuracy.
Reinforcement learning explores a different approach of training a model to differentiate between long-term and short-term goals. Agents interact with their environment and, based on their actions, they are rewarded or penalized. The reward is variable, thereby teaching the model to become better. A popular example is DQN
[8][74] or Deep Q Networks. In DQN, the mapping function between states and actions is accomplished using deep learning, thereby reducing the requirement for a large table of Q-learning (TQL). A variant of DQN is QR-DQN
[9][75], which uses quantile regression to model the possible distributions, instead of returning the mean distribution. This can be compared to the difference between decision trees and random forest, summarized above.
2. Stages of a Cyber-Attack
Organizations can assess the cybersecurity risk to them and can identify certain security threats. They can then implement security controls or measures against these threats. They can utilize the National Institute of Standards and Technology (NIST) Special Publications, although they may not be a US federal agency or related contractor
[10][76]. NIST Special Publications provide step-by-step guidance for applying a risk management framework to federal information systems. In this guidance, a set of security issues are identified and common controls or measures against these security issues/threats are listed. In a recent study, machine learning tools were suggested as efficient controls or measures
[11][77]. Such measures can be applicable to all five phases of a cyber-attack.
There are five phases of a cyber-attack. They are reconnaissance, scan, attack (denial-of-service attacks, gain access using application and operating system attacks, network attacks), maintain access (using Trojans, backdoors, rootkits, etc.) and cover tracks and hiding. An interruption at any phase can either interrupt or halt the entire process of attack. Machine learning algorithms can be used in all of these phases to help fight against cyber-attacks by disrupting the attacker’s workflow.
During the reconnaissance or preparation phase of the attack, an adversary uses techniques such as a social engineering attack (phishing, malicious call, etc.). Machine learning algorithms can look for email signatures and detect malicious or phishing email signatures and block them. There are cases when an attacker calls the target organization and impersonates a third party to obtain valuable information (known as “voice phishing” or “vishing”). Call source analysis using machine learning algorithms can flag and block such calls. Another example use of machine learning is scanning any external devices connected to the organization’s property, e.g., a USB device. Such a scan prevents malicious software from propagating through such devices. Another example is when the adversary wishes to guess the access password to obtain unauthorized access (violating confidentiality)
[12][78]. Rule-based machine learning algorithms can detect the most common passwords that are used by the organization’s employees and can recommend a list of unrecommended passwords. This will hinder the reconnaissance step. Such machine learning algorithms can be placed in strategic locations, e.g., key machines and networks.
During the scan phase, sometimes called “Weaponization”, the cyber-attacker or adversary exploits the vulnerabilities of the target system. The attacker uses automated tools, such as Metasploit, AutoSploit and Fuzzers
[12][78]. Machine learning algorithms can be used to automatically scan and find the vulnerabilities by an ethical hacker before the adversary can. For example, a machine learning-based penetration test can be implemented, specifically by integrating the algorithms into the penetration testing tools, e.g., Metasploit. Such algorithms, upon being used by a pen tester, can find novel weaknesses.
Machine learning algorithms are a strong measure against the attacks (phase 3 of cyber-attack). Machine learning algorithms that can be used to provide cybersecurity are linear regression, polynomial regression, logistic regression, naïve Bayes classifier, support vector machine, decision tree, nearest neighbor, clustering, dimensionality reduction, linear discriminant analysis and boosting
[13][4]. The applications of these algorithms as a measure against cybersecurity problems are spam detection (includes phishing), malware detection, denial-of-service attacks (including DDoS) and network anomaly detection. Other forms of attacks are associated with biometric recognition, authentication, identity theft detection and social media analytics. Information leakage detection, APT or advanced persistent threat detection, hidden channel detection and software vulnerability detection are also some modern threats that need addressing.
During phase four of a cyber-attack, malware is used to maintain access by the attacker, e.g., Trojans, backdoors or rootkits. Machine learning algorithms can detect such malware traffic packets when the malware contacts the attacker and vice versa. For example, for malware detection, support vector machines (SVM) are an efficient option
[14][79]. SVM was implemented using Weka to detect Android OS malware (260 samples), using static features analysis. Here, a black box method was used by analyzing the malware’s behavior rather than executing the malware. In the first step, a Python code was used to extract Android application packages’ (APK) features, one package at a time. Both malicious (201 samples) and benign (59 samples) APKs were selected. In the second step, an SVM classifier (Weka and LibSVM classifier) was trained by these features, to identify malware from these APKs. In the testing phase, the used APKs were downloaded from the repositories: Android, Malware Dataset, Kaggle, Drebin, Android Malshare and APKPure. The receiver operating characteristic or ROC curve was used to present the result. An enhancement of this application used the dynamic features of malware, as malware keeps changing its features. An SVM model can be trained to perform binary classification from a set of features of network packets. The trained classifier can detect a DDoS attack by identifying normal vs. abnormal network traffic, especially for IoT devices. Examples of the features used to train machine learning algorithms include the destination IP address, sequence number and minimum, maximum and average packets for each destination IP address, received signal strength indication, network allocation vector, value injection rate and inter-arrival time between consecutive frames, etc. The traffic information was collected by placing sensors at significant points of the network, e.g., at the gateway level, for a traffic session of 15–20 min. This classifier can be used as an extra security layer in IDS.
Another example is the application of various clustering techniques (K-means, DBSCAN and Hierarchical)
[14][79]. Clustering is useful for malware detection, phishing attack detection, spam filtering and detecting the larger family of software bugs known as side-channel attack detection. In
[14][79], both malware and goodware Android APKs were installed in an Android emulator. Then, their resource usage statistics (CPU and RAM) were recorded for all three clustering techniques. For all three clustering algorithms, a total of 217 data instances were used, with 145 for training and 72 for testing. The conclusion was that CPU-RAM usage statistics are not an efficient feature for clustering malware and goodware. The nearest neighborhood (NN) search is used in access control. For example, an NN is used to identify actual vs. forged biometrics (e.g., fingerprints) through classification based on their patterns
[14][79]. The CSD200 model was used as the fingerprint scanner to take 100 samples (10 people, total 100 fingers). MATLAB was used to convert these images into an array or matrix. Such a machine learning algorithm can automatically make decisions as to whether a biometric is forged or original. In
[14][79], decision trees were used, e.g., Iterative dichotomizer 3 (ID 3) and its successor, C4.5, to identify malware efficiently. The dataset used was from the Cardiff University Research Portal. In another research work
[15][80], anomalous services running into the computer systems, both offline and online, were identified using a neural network-based model (NARX-RNN), AI-based multi-perspective SVM, principal component analysis (PCA) and hierarchical process tree-based reinforcement learning techniques.
During phase five or the covering tracks phase, the attacker wishes to confirm that their identification is not being tracked. They employ different techniques, including corrupting machine learning tools’ training data to misidentify their data. The machine learning algorithms themselves can be robust but their training data may not be. Deceptive training data make the algorithm inefficient. This process of forging training data is called adversarial machine learning (AML). The severity is serious for cybersecurity applications. The countermeasures against such polluted data include game theory (non-cooperative game/Nash equilibrium, Zero-Sum Versus Non-Zero Sum Game, simultaneous move vs. sequential game or Bayesian Game)
[16][81]. An example of AML is network traffic classification. Performing deep packet inspection is hard when the traffic payload is encrypted
[17][82]. For such traffic, it is possible that the machine learning classifier (e.g., network scanning detector) is deceived by an adversary, to tag malware or botnet communications or NMap network scanning traffic as benign. It is possible that the adversary can mimic the features of benign traffic and can infer the classification output. What happens if the adversary’s traffic is classified as malicious? The adversary does not obtain any feedback but their traffic will probably be blocked. This will give them an indication that their traffic has been classified as malicious and prompt the adversary to change the traffic signature. Improved machine learning techniques exist that can work as a measure against adversarial attacks
[18][83]. For example, an activation clustering method was introduced that identifies the hidden layer of a deep neural network where an adversarial trigger lies. Using empirical learning algorithms, poisonous data points can be identified when poisoning attacks happen against an SVM.
3. Supervised Learning
Supervised learning relies on useful information in historical labeled data. Supervised learning is performed when targets are predefined to reach from a certain set of inputs, i.e., task-driven approach. Classification and regression methods are the most popular supervised learning techniques
[19][84]. These methods are well known for classifying or predicting the target variable for a particular security threat. For example, classification techniques can be used in cybersecurity to indicate a denial-of-service (DoS) attack (yes, no) or identify distinct labels of network risks, such as scanning and spoofing. Naive Bayes
[20][85], support vector machines (SVM)
[21][86], decision tree
[22][23][87,88], K-nearest neighbors
[24][89], adaptive boosting
[25][90] and logistic regression
[26][91] are some of the most well-known classification techniques in shallow models.
Naive Bayes finds a good amount of use in cybersecurity. The
resea
rcheuthors in
[27][92] used the naive Bayes classifier from the Weka package and KDD’99 data for training and testing. Data were grouped into the four attack types (probe and scan, DoS, U2R and R2L) and their classifier achieved 96%, 99%, 90% and 90% testing accuracy, respectively. The cumulative false positive rate was 3%. The
resea
rcheuthors in
[28][93] developed a framework using a simple form of Bayesian network using the KDD’99 data and used categories to depict different attack scenarios. Solving an anomaly detection problem, the reported results were 97%, 96%, 9%, 12% and 88% accuracy for normal, DoS, R2L, U2R and probe or scan categories, respectively. The false positive rate was not reported but can be inferred to be less than 3%. Naive Bayes was also used as one of the methods in
[29][94] to solve a DoS problem, which attempted to resolve the botnet traffic in filtered Internet Relay Chat (IRC), therefore determining the botnet’s existence and origin. The study conducted used TCP-level data that were collected from 18 different locations on the Dartmouth University campus’ wireless network. This data collection occurred over a span of four months. A filter layer was used to extract IRC data from the network data. Labeling was a challenge so the study utilized simulated data for the experiments. The performance of the Bayesian network showed 93% precision with a false positive rate of 1.39%. C4.5 decision trees were also used for comparison and achieved 97% precision, but the false positive rates were higher, at 1.47% and 8.05%, respectively.
In
[30][95], the
resea
rcheuthors used an SVM classifier to detect DDoS attacks in a software-defined network. Experiments were conducted on the DARPA dataset, comparing the SVM classifier with other standard classification techniques. Although the classifier had higher accuracy, the SVM took more time, which is an obvious flaw. In
[31][96], the
resea
rcheuthors used a least-squares SVM to decrease the training time on large datasets. Using three different feature extraction algorithms, they reduced the number of features from 41 to 19. The data were resampled to have around 7000 instances for each of the five classes of the KDD’99 dataset. Overall, the classification was reported at 99% for DoS, probe or scan, R2L and normal classes and 93% for U2R. Research in
[32][97] utilized a robust SVM, which is a variation of SVM where the discriminating hyperplane is averaged to be smoother and the regularization parameter is automatically determined. Preprocessing training and testing of data was done on the Basic Security Module from the DARPA 1998 dataset. It showed 75% accuracy with no false positives and 100% accuracy with 3% false positives.
In
[33][98], the
resea
rcheuthors utilized decision trees to generate detection rules against denial-of-service and command injection attacks on robotic vehicles. Results showed that different attacks had different impacts on robotic behavior. A decision tree-based intrusion detection system that may change after intrusion by analyzing the behavior data through a decision tree was implemented in
[34][99]. The model was used to prevent advanced persistent threat (APT) attacks, which use social engineering to launch various forms of intrusion attacks. The detection accuracy in their experiments was 84.7%, which is very high for this experiment. Decision trees were also used in
[35][100], where the
resea
rcheuthors replaced the misuse detection engine of SNORT with decision trees. SNORT is a known tool that follows the signature-based approach. The
researcheauthors utilized clustering of rules and then derived a decision tree using a version of an ID3 algorithm. This rule clustering reduced the number of comparisons to determine which rules were triggered by given input data, and the decision tree located the most discriminating features of the data, thereby achieving parallel feature evaluation. This method achieved superior performance to that of SNORT when applied to the 1999 DARPA intrusion detection dataset. The results varied drastically depending on traffic type. The fastest were up 105%, with an average of 40.4% and minimum of 5% above the normal detection speed of SNORT. The number of rules was also increased from 150 to 1581, which resulted in a pronounced speed-up versus SNORT.
Ensemble learning techniques such as random forest (RF) have also been explored in cybersecurity research. Random forest uses multiple decision trees to draw a conclusion and can be more accurate than a single decision tree. In
[36][101], the
resea
rcheuthors employed a random forest algorithm on the KDD dataset for misuse, anomaly and hybrid-network-based intrusion detection. Patterns were created by the random forest and matched with the network for misuse detection. To detect anomalies, the random forest detected novel intrusions via outliers. Using the patterns built by the model, new outliers were also discovered. The study implemented a full system solution including anomaly detection. Data were grouped into majority attacks and minority attacks. This hybrid system achieved superior performance for misuse detection, where the error rate on the original dataset was 1.92%, and it was 0.05% for the balanced dataset.
Regression algorithms are useful for forecasting a continuous target variable or numeric values, such as total phishing attacks over a period of time or network packet properties. In addition to detecting the core causes of cybercrime, regression analysis can be utilized for various risk analysis forms
[37][102]. Linear regression
[1][67], support vector regression
[21][86] and random forest regressor are some of the popular regression techniques. The main distinction between classification and regression is that, in regression, the output variable is numerical or continuous, but in classification, the projected output is categorical or discrete. Ensemble learning is an extension of supervised learning that mixes different shallow models, e.g., XGBoost and random forest learning
[6][72], to complete a specific security task.
4. Unsupervised Learning
The main goal of unsupervised learning, also known as data-driven learning, is to uncover patterns, structures or relevant information in unlabeled data
[38][121]. Risks such as malware can be disguised in a variety of ways in cybersecurity, including changing their behavior dynamically to escape detection. Clustering techniques, which are a form of unsupervised learning, can aid in the discovery of hidden patterns and insights in datasets, as well as the detection of indicators of sophisticated attacks. In addition, clustering algorithms can effectively spot anomalies and policy violations, recognizing and eliminating noisy occurrences in data, for example. K-means
[39][122] and K-medoids
[40][123] are clustering algorithms used for partitioning, while single linkage
[41][124] and complete linkage
[42][125] are some of the most widely utilized hierarchical clustering methods in various application sectors. Furthermore, well-known dimensionality reduction techniques such as linear discriminant analysis (LDA), principal component analysis (PCA) and Pearson correlation, as well as positive matrix factorization, can be used to handle such problems
[1][67]. Another example is association rule mining, in which machine learning-based policy rules can learn to avert cyber incidents. The rules and logic of an expert system are normally manually documented and implemented by a knowledge engineer working with a domain expert
[38][43][44][30,121,126]. In contrast, association rule learning identifies the rules or associations among a set of security aspects or variables in a dataset
[45][127]. Different statistical analyses are performed to quantify the strength of associations
[37][102]. In the domain of machine learning and data mining, various association rule mining methods have been presented, such as tree-based
[46][128], logic-based
[47][129], frequent pattern-based
[48][49][50][130,131,132], etc. Moreover, a priori
[48][130], a priori-TID and a priori-Hybrid
[48][130], AIS
[45][127], FP-Tree
[46][128], RARM
[51][133] and Eclat
[52][134] have been used extensively for association rule learning. These algorithms are able to resolve such difficulties by creating a set of cybersecurity policy rules.
The
resea
rcheuthors in
[53][104] applied sequential pattern mining on the DARPA 1999 and 2000 data to reduce the redundancy of alerts and minimize false positives. Using the a priori algorithm, they discovered attack patterns and created a pattern visualization for the users with a support threshold of 40%. They were able to detect 93% of the attacks in twenty seconds. The study reported a real-time scenario with 84% attack detection, with the main variation coming from the support threshold. In
[54][105], DBSCAN was used as a clustering method to group normal versus anomalous network packets in the KDD’99 dataset. The dataset was preprocessed and features were selected using correlation analysis.
TIn this study, the performance was shown at 98% for attack or no-attack detection. The
researcheauthors in
[55][106] took a different data mining approach to create an intrusion detection system called ADMIT. This system does not rely on labeled data; rather, it builds user profiles incrementally using a dynamic clustering technique. It uses sequences, which are a list of tokens collected from users’ data, to build a profile. These sequences are classified as normal or anomalous based on using a modified form of K-means clustering. They set a value of 20 for the maximum sequence length, which resulted in 80% performance accuracy with a 15% false positive rate. A unique algorithm based on the signature a priori algorithm was used to find new signatures of attacks from existing attack signatures, proposed in
[56][107]. Here, the
resea
rcheuthors compared their algorithm processing time to that of the a priori and found that their algorithm was much faster.
5. Artificial Neural Networks (ANN)
ANN is a segment of machine learning, in the area of Artificial Intelligence, that is a computationally complex model inspired by the biological neural networks in the human brain
[1][67]. The basic idea behind ANN was first introduced in the 1940s, with ANNs becoming a popular idea and technology in the late 1970s and continuing into the 1990s. Although SVMs were prominent in the 1990s, overshadowing the ANNs, they have received popularity recently and are steadily increasing in use. ANNs consist of multiple neurons that work together in layers to extract information from the dataset. The primary difference is the performance of ANN versus shallow machine learning as the amount of security data grows. The number of studies of ANN-based intrusion detection systems has increased rapidly from 2015 to the present. In
[57][135], the
resea
rcheuthors utilized ANNs to detect misuse. Using data generated by a RealSecure network monitor, ten thousand events were collected, of which 3000 were simulated attacks by programs. Preprocessing of the data was performed and ten percent of the data were selected randomly for testing, with the rest used for training the ANN. The error rates for training and testing were 0.058 and 0.070, respectively. Each packet was categorized as normal or attack. A combination of keyword selection and ANN was proposed in
[58][136] to improve IDS. Keyword selection was performed on Telnet sessions and statistics were derived from the number of times that the keywords occurred (from a predefined list). These statistics were given as input to the ANN and the output identified the probability of an attack. The
resea
rcheuthors in
[59][137] compared fuzzy logic and artificial neural networks to develop comprehensive intrusion detection systems and tested them using the KDD’99 dataset. They presented a detailed discussion on the preprocessing, training and validation of the proposed approach. The “class” characteristic in this dataset, which is made up of around 5 million data instances with 42 properties, determines whether a particular instance is a typical connection instance or one of the four types of attacks that need to be recognized. Five different machine learning approaches were compared, among which the FC-ANN-based approach
[60][138] and the hierarchical SOM-based approach
[61][139] were the best detectors.
Deep learning models, which are a form of ANN, learn feature representations directly from the original data, such as photos and texts, without the need for extensive feature engineering. As a result, deep learning algorithms are more effective and need less data processing. For large datasets, deep learning methods have a significant advantage over classical machine learning models. Some of the widely used deep learning techniques in cybersecurity include deep belief networks (DBNs), convolutional neural networks (CNNs) and recurrent neural networks (RNNs) as supervised learning models. Several variants of autoencoders, restricted Boltzmann machines (RBMs) and generative adversarial networks (GANs) have been used with success for unsupervised learning. The
resea
rcheuthors in
[62][140] used DBNs to detect malware with an accuracy of 96.1%. The DBNs used unsupervised learning to discover layers of features and then used a feed-forward neural network to optimize discrimination. DBNs can learn from unlabeled data, so, in the experiments, DBNs provided a better classification result than SVM, KNN and decision tree. DBNs were also used in
[63][141] to create an IDS. The proposed four-layer model was used on the KDD’99 Cup dataset, where the
resea
rcheuthors reported accuracy, precision and false acceptance rate (FAR) of 93.49%, 92.33% and 0.76%, respectively. In
[64][142], a DBN-based ad hoc network intrusion detection model was developed with an experiment on the Network Simulator (NS2) platform. The experiment showed that this method can be added to the ad hoc network intrusion detection technology. Accuracy and FAR were reported as 97.6% and 0.9%, respectively. A deep learning approach called DeepFlow was proposed in
[65][143] to directly detect malware in Android applications. The architecture consisted of three components for feature extraction, feature coarse granularity and classification. Two modules were used to assess malware sources from the Google Play Store. Experiments showed that DeepFlow outperformed SVM, ML-based algorithms and multi-layer perceptron (MLP).