Cyberspace has become an indispensable factor for all areas of the modern world. The world is becoming more and more dependent on the internet for everyday living. The increasing dependency on the internet has also widened the risks of malicious threats. On account of growing cybersecurity risks, cybersecurity has become the most pivotal element in the cyber world to battle against all cyber threats, attacks, and frauds. The expanding cyberspace is highly exposed to the intensifying possibility of being attacked by interminable cyber threats. The objective of this survey is to bestow a brief review of different machine learning (ML) techniques to get to the bottom of all the developments made in detection methods for potential cybersecurity risks. These cybersecurity risk detection methods mainly comprise of fraud detection, intrusion detection, spam detection, and malware detection. In this review paper, we build upon the existing literature of applications of ML models in cybersecurity and provide a comprehensive review of ML techniques in cybersecurity. To the best of our knowledge, we have made the first attempt to give a comparison of the time complexity of commonly used ML models in cybersecurity. We have comprehensively compared each classifier’s performance based on frequently used datasets and sub-domains of cyber threats. This work also provides a brief introduction of machine learning models besides commonly used security datasets. Despite having all the primary precedence, cybersecurity has its constraints compromises, and challenges. This work also expounds on the enormous current challenges and limitations faced during the application of machine learning techniques in cybersecurity.
Researchers are investigating machine learning techniques to detect different cybercrimes in cybersecurity. We have provided a detailed discussion of various cyber threats in Section 2. Furthermore, we have briefly presented an overview of frequently used security datasets in Section 2. This section provides a comprehensive survey of each ML model applied to deal with different cyber threats. Subsequent lines will explain the description of each column in Table 1, Table 2, Table 3, Table 4, Table 5 and Table 6. The ML technique columns describe the considered machine learning model. We have considered six ML models for this study: random forest, support vector machine, naïve Bayes, decision tree, artificial neural network, and deep belief network.
Table 1. Evaluation of SVM in Cybersecurity.
ML Technique | Domain | Dataset | Reference | Year | Approach/Domain | Results | ||
---|---|---|---|---|---|---|---|---|
Accuracy | Precision | Recall |
Table 3. Evaluation of DBN in Cybersecurity.
Table 2. Evaluation of Decision Tree in Cybersecurity.
ML Technique | Domain | Dataset | Reference | Year | Approach/Domain | Results | ||
---|---|---|---|---|---|---|---|---|
Results | ||||||||
Year | Approach/Domain | Accuracy | Precision | Recall | ||||
Accuracy | Precision | Recall | ||||||
Results | ||||||||
Accuracy | Precision | Recall | ||||||
Decision Tree |
Table 5. Evaluation of Random Forest in Cybersecurity.
Table 6. Evaluation of Naïve Bayes in Cybersecurity.
ML Technique | Domain | Dataset | Reference | Year | Approach/Domain | Results | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | Precision | Recall | ||||||||||||||||||||
] | ||||||||||||||||||||||
2005 | [ | Hybrid-Based | 43 | 99.85% | 99.70% | ] | 98.10% | 2015 | Hybrid-Based | 96.70% | 97.90% | |||||||||||
[3] | 2014 | Hybrid-Based | 82.37% | |||||||||||||||||||
Naïve Bayes | IDS | DARPA | [79] | 2010 | Anomaly-Based | 91.60% | 61.60% | |||||||||||||||
[80] | 2007 | |||||||||||||||||||||
74% | ||||||||||||||||||||||
[ | 82% | |||||||||||||||||||||
Misuse-Based | 99.90% | 99.04% | 99.50% | 25] | 2017 | Hybrid-Based | 86.29% | 78% | ||||||||||||||
NSL-KDD | [44] | 2017 | Anomaly-Based | |||||||||||||||||||
NSL-KDD | [29 | Anomaly-Based | ] | 90.40% | 88.60% | 2015 | 95.10% | 92.50% | Misuse-Based | 81.66% | DARPA | [4] | 2007 | Hybrid-Based | ||||||||
NSL-KDD | 69.80% | [26 | ] | 2014 | ||||||||||||||||||
2015 | Anomaly-Based | 99.64% | Anomaly-Based | 80% | - | 80% | ||||||||||||||||
[69] | 2019 | Hybrid-Based | 75.30% | 81.40% | 75.30% | [5] | 2014 | Anomaly-Based | 95.11% | - | ||||||||||||
[27] | 2017 | Hybrid-Based | [ | 90.30% | 23] | 2018 | Misuse-Based | 99.82% | 91.15% | |||||||||||||
[ | - | 70] | 90.31% | 2017 | Hybrid-Based | 97.10% | - | |||||||||||||||
KDD CUP99 | [6] | 2011 | Hybrid-Based | 95.72% | ||||||||||||||||||
[7 | ||||||||||||||||||||||
[28] | 2019 | Hybrid-Based | 93.40% | KDD CUP99 | [] | 2015 | Hybrid-Based | |||||||||||||||
KDD CUP99 | 96.08% | [ | ||||||||||||||||||||
KDD CUP99 | 55]29 | - | ] | 2015 | ||||||||||||||||||
Misuse-Based | 95.09% | [8] | 2014 | Hybrid-Based | 99.30% | - | ||||||||||||||||
[30] | 2016 | Hybrid-Based | 99.62% | Malware | Custom Dataset | [9] | 2019 | Static | 95.17% | 95.57% | ||||||||||||
[31 | 95% | |||||||||||||||||||||
] | 2018 | Hybrid-Based | 92.87% | 99.90% | [10] | 2018 | Static | 89.91% | 88.84% | |||||||||||||
Malware | Custom | [32] | 2016 | 2017 | Static | 84.7% | ] | 2015 | Email Spam | 79.50% | 79.02% | 68.67% | ||||||||||
Static | 99.90% | 99.40% | [11] | 2018 | Dynamic | 96.27% | 96.16% | 93.71% | ||||||||||||||
95.30% | Malware Dataset | [34] | 2014 | Static | 97.90% | 96.70% | ||||||||||||||||
[53] | 2014 | Anomaly-Based | 97.53% | - | - | |||||||||||||||||
[4] | 2014 | Hybrid-Based | 97.06% | - | - | [45 | 99.45% | 99.20% | ||||||||||||||
DARPA | 99.70% | |||||||||||||||||||||
[ | 54] | ISCX Dataset | [46] | 2015 | Misuse-Based | 99.18% | - | - | ||||||||||||||
Malware | DLL | [47] | 2008 | Static | 89.90% | 87.40% | 98.80% | |||||||||||||||
[ | 2009 | 68 | Anomaly-Based | - | ] | 97.89% | 98.94% | |||||||||||||||
2019 | Anomaly-Based | 96.30% | 99.80% | Custom | [56][ | 2012 | ||||||||||||||||
[71 | 48] | 2016 | Anomaly-Based | ] | 2016 | Static | 62.90% | 89.03% | 83% | 98.18% | ||||||||||||
- | - | |||||||||||||||||||||
Anomaly-Based | - | 98.10% | 98.10% | [48] | 2016 | Dynamic | Malware | VX Heavens | [ | 71% | 57 | 78.08% | ] | 2012 | Hybrid | 59.09% | 88.89% | 88.89% | - | |||
[70] | 2017 | Hybrid-Based | 98.10% | - | - | [58] | 2012 | Static | 92.19% | - | - | Malware Dataset | [12] | 2017 | Static | 94.37% | ||||||
[35] | 2013 | Static | 92.34% | - | 93% | |||||||||||||||||
[36] | 2013 | Dynamic | 88.47% | |||||||||||||||||||
SMOTE | [37 | |||||||||||||||||||||
[48] | 2016 | Hybrid | 96.76% | 95.77% | 97.84% | |||||||||||||||||
KDD CUP99 | [49] | 2015 | Hybrid | 91.40% | - | 95.34% | ||||||||||||||||
Spam | TARASSUL | [50] | 2016 | Email Spam | 96.40% | 95.31% | 93.59% | |||||||||||||||
[50] | 2016 | Email Spam | 97.50% | 98.39% | 98.02% | [13] | 2013 | Dynamic | 95% | |||||||||||||
Enron | [45] | 2016 | Email Spam | 95.86% | 96.49% | 95.61% | [14] | 2015 | Dynamic | 97.10% | ||||||||||||
Enron | ||||||||||||||||||||||
[81] | 2012 | Anomaly-Based | 36% | Malware | Custom Dataset | [9] | 2019 | Static | 98.63% | 98.58% | 98.69% | [59] | 2013 | Static | 88.31% | - | - | |||||
[11] | 2018 | Dynamic | 96.34% | 96.59% | 93.46% | Enron | [60] | 2018 | Dynamic | |||||||||||||
35% | 80% | |||||||||||||||||||||
[81] | 2012 | Anomaly-Based | 99% | 83% | 78.90% | Malware Dataset | [72] | 2016 | Dynamic | 96.14% | ||||||||||||
[34] | 2014 | Hybrid | Image Spam | 93.70% | ||||||||||||||||||
Random Forest | IDS | KDD | [66] | 2019 | ||||||||||||||||||
82.79% | ||||||||||||||||||||||
87% | ||||||||||||||||||||||
94% | ||||||||||||||||||||||
Anomaly-Based | 99.95% | 99.95% | ||||||||||||||||||||
KDD CUP99 | [82] | 2004 | Anomaly-Based | 99.27% | ||||||||||||||||||
[80] | 2007 | Anomaly-Based | 96% | 99.80% | ||||||||||||||||||
[79] | 2018 | Signature-Based | 99.72% | 100% | ||||||||||||||||||
Malware | VX Heaven | [83] | 2015 | Static | 88.80% | |||||||||||||||||
NSL-KDD | [84] | 2013 | Hybrid | 99.50% | - | |||||||||||||||||
[ | - | |||||||||||||||||||||
85] | 2007 | Hybrid | 99% | Comodo | [61] | 2016 | Static | |||||||||||||||
Malware Dataset | 92.02% | [35 | - | 96.50% | ] | 2013 | Hybrid | 97.30% | 89.81% | |||||||||||||
- | 90% | [73] | 2017 | Hybrid | 91.40% | 89.80% | 91.10% | Spambase | [63] | 2016 | Email Spam | 91% | ||||||||||
[67] | 2016 | Anomaly-Based | 88.65% | - | - | |||||||||||||||||
Spam | Spam-Archive | [ | ||||||||||||||||||||
[86] | 2015 | Hybrid | 95.90% | 95.90% | 95.90% | [16] | 2007 | Email Spam | 97.43% | 94.94% | 96.47% | VirusShare | [74] | 2009 | - | - | [15] | 2016 | Static | 91% | ] | 84.74% |
94.62% | Static | 95.60% | 100% | |||||||||||||||||||
2018 | Dynamic | 92.82% | ||||||||||||||||||||
[ | ||||||||||||||||||||||
[34] | 64] | 2018 | Email Spam | 92.41% | 92.40% | 92.40% | ||||||||||||||||
[ | ||||||||||||||||||||||
NSL-KDD | [ | 62] | 2011 | 2014 | 96% | |||||||||||||||||
Hybrid | 97.50% | 67.40% | Spambase | [ | Spam | 51] | 2018 | Email Spam | 89.20% | SMS Collection | 96% | [17] | 2014 | SMS Spam | 97.18% | 97.30% | 97.20% | |||||
Spam | SMS Collection | [17] | 2014 | SMS Spam | 97.52% | 97.50% | 97.50% | [16] | [ | 2007 | 3765 | Static | 96.92% | ] | ] | 2018 | 92.74% | 97.27% | ||||
Dynamic | 95.75% | |||||||||||||||||||||
[51] | 2018 | Email Spam | 2013 | 90.69% | 97% | Hybrid | 93.71% | |||||||||||||||
Spambase | [ | 95% | - | |||||||||||||||||||
Twitter Dataset | ||||||||||||||||||||||
Spambase | [75] | 2013 | 19 | Email Spam | 99.54% | ] | 2011 | Email Spam | 99.46% | 99.66% | 98.46% | [38] | 2012 | [20] | Static | 96.62% | ||||||
2018 | Spam Tweets | 91.18% | 91.80% | 91.18% |
[ | |||||||||||||||||||
76 | |||||||||||||||||||
] | |||||||||||||||||||
2010 | |||||||||||||||||||
[ | |||||||||||||||||||
18 | |||||||||||||||||||
] | |||||||||||||||||||
Email Spam | |||||||||||||||||||
95.43% | |||||||||||||||||||
2015 | Email Spam | 76.24% | 70.59% | 72.05% | Spam | SMS Collection | [17] | 2014 | [41] | 2013 | Email Spam | 93.89% | |||||||
[19] | 2011 | Email Spam | 96.90% | 93.12% | |||||||||||||||
[33] | Spam | SMS Collection | SMS Spam | ||||||||||||||||
[87] | 2015 | 96.60% | 96.50% | 96.60% | Email Spam | 95.87% | 94.10% | ||||||||||||
84% | 89% | 78% | 95% | ||||||||||||||||
Enron | [15] | 2016 | Twitter Dataset | [77] | Email Spam | 96% | 98% | 94% | 2011 | Spam Tweets | 95% | 95.70% | 95.70% | ||||||
Twitter Dataset | [41] | 2013 | Spam Tweets | 92% | Twitter Dataset | [20 | |||||||||||||
91.60% | 91.4% | [15] | [] | 78 | 2018 | Spam Tweets | 2016 | Email Spam | 98% | 94% | |||||||||
] | ] | 2016 | |||||||||||||||||
[20] | Spam Tweets | Spambase | [39] | 2014 | Email Spam | 92.08% | 91.51% | 88.08% | |||||||||||
2019 | Anomaly-Based | 96% | 93.14% | 92.91% | 93.14% | ||||||||||||||
96.20% | 98.60% | 75.50% | |||||||||||||||||
2018 | Spam Tweets | 92.06% | 91.69% | 91.96% | [21] | 2015 | Spam Tweets | 95.20% | [20] | 2018 | Spam Tweets | 93.43% | [40] | 2014 | Email Spam | 94.27% | 91.02% | ||
[41] | 2013 | Email Spam | 92.34% | 93.90% | 93.50% |
Table 4. Evaluation of ANN in Cybersecurity.
ML Technique | Domain | Dataset |
---|---|---|
93.25% | ||
93.43% | ||
We focus on three critical cyber threats, namely intrusion detection, spam detection and malware detection. The domain columns state the significant cybersecurity threats considered for this review. The reference number and year columns depict the citation number of each article and published year, respectively. The values of approach or sub-domain columns are different for each cyber threat. IDS domain has three values that are anomaly-based, signature/misuse-based and hybrid-based. Malware has three further sub-classifications that are static, dynamic and hybrid. In the case of spam, sub-domains correspond to the medium in which the authors tried to identify the spam such as image, video, email, SMS and tweets. A description of each sub-domain/approach has been provided in Section 2. Finally, the result attribute presents the evaluation of each classifier applied in a particular sub-domain of cyber threat on a specific dataset and provided in the cited paper mentioned in the reference column.
The principle superiority of support vector machine (SVM) is that it produces the most successful results for cybersecurity tasks. SVM distributes each data class on both sides of the hyperplane. SVM separates the classes based on the notation to the margin. Support vector points are those points that lie on the border of the hyperplane. The major drawback of the support vector machine is that it consumes an immense amount of space and time. SVM requires data trained on different time intervals to produce better results for a dynamic dataset [88].
SVM showed an accuracy of 99.30% with KDD Cup 99 dataset for IDS [8]. 96.92% is the best reported accuracy for malware detection using Enron dataset [16] and 96.90% with Spambase to classify spam emails [19]. The best reported recall for SVM to detect intrusion is 82% [3], malware is 100% [15], and spam is 98.60% [17]. SVM has obtained best precision while detecting the intrusion is 74% [24], malware is 96.16% [11], and spam is 98.60% [17]. A detailed performance comparison of SVM to various cyber threats on the frequently used dataset is presented in Table 1.
Decision tree (DT) belongs to the category of supervised machine learning. DT consists of a path and two nodes: root/intermediate and leaf. Root or intermediate node presents an attribute that followed a path that corresponds to the possible value of an attribute. Leaf node represents the final decision/classification class. A decision tree is used to find the best immediate node by following the if-then rule [89]. Further, 99.96% is the reported accuracy of DT while detecting the anomaly-based IDS with KDD dataset [23]. With standard SMOTE dataset, DT shows an outstanding accuracy of 96.62% for malware detection [38]. With the Enron dataset, DT correctly classified ham emails with an accuracy of 96% [15]. The best reported recall for DT to detect intrusion is 98.10% [24], malware is 96.70% [34], and spam is 96.60% [17]. DT has obtained best precision while detecting the intrusion is 99.70% [24], malware is 99.40% [32], and spam is 98% [15]. A detailed performance comparison of decision tree to various cyber threats on the frequently used dataset is presented in Table 2.
A deep belief network (DBN) consists of various middle layers of restricted Boltzmann machine (RBM) organized greedily. Every layer communicates with the layers behind it and the layers ahead of it. There is no lateral communication between the nodes within a layer. Every layer serves as both an input layer and an output layer, except the first and the last layers. The last layer functions as a classifier. The primary purpose of a deep belief network is image clustering and image recognition. It deals with motion capture data. Deep belief network has shown the accuracy of 97.50% for IDS [42], 91.40% for malware detection [90] and 97.43% for spam detection [91] with KDD, KDD CUP99, and Spambase datasets, respectively. The best reported recall for DBN to detect intrusion is 99.70% [45], malware is 98.80% [47], and spam is 98.02% [50]. DBN obtained the best precision while detecting the intrusion is 99.20% [45], malware is 95.77% [48], and spam is 98.39% [50]. A detailed performance comparison of DBN to various cyber threats on the frequently used dataset is presented in Table 3.
An artificial neural network (ANN) classier consists of hidden neuron input and output layers and performs in two stages. The first stage is called feedforward. In this stage, each hidden layer receives some input nodes and based on the input layer and activation function, the error is calculated. In the second stage, namely feedback stage, the error is sent back to the input layer and process is continued in iterations until the correct result is gained [60]. The artificial neural network showed an accuracy of 97.53% for IDS [53], 92.19% for malware detection [58], and 92.41% for spam detection with NSL-KDD, VX Heavens, and Spambase datasets, respectively. The best reported recall for ANN to detect an intrusion is 98.94% [55], and spam is 94% [62]. ANN has obtained best precision while detecting the intrusion is 97.89% [55], malware is 88.89% [57], and spam is 95% [65]. A detailed performance comparison of ANN to various cyber threats on the frequently used dataset is presented in Table 7.
Random forest (RF) follows through the task by combing different predictions generated by joining different decision trees. RF raised a hypothesis to obtain a result [91]. RF falls under the category of ensemble learning. RF also termed as random decision forest. RF is considered as an improved version of CART that is a sub-type of a decision tree.
RF has shown an accuracy of 99.95% with IDS [66], 95.60% with malware detection [74] and 99.54% for spam detection [75] with KDD, VirusShare, and Spambase datasets, respectively. The best reported recall for RF to detect intrusion is 99.95% [66], malware is 97.30% [34], and spam is 97.20% [17]. RF obtained the best precision while detecting the intrusion is 99.80% [68], malware is 98.58% [9], and spam is 98.60% [78]. A detailed performance comparison of RF to various cyber threats on the frequently used dataset is presented in Table 5.
The major limitation for Naïve Bayes (NB) classifier is that it assumes that every attribute is independent, and none of the attributes has a relationship with each other. This state of independence is technically impossible in cyberspace. Hidden NB is an advanced form of Naïve Bayes, and it gives 99.6% accuracy [92]. Naïve Bayes showed an accuracy of 99.90% with DARPA dataset for IDS [80]. 99.50% is the best reported accuracy for malware detection using NSL-KDD dataset [86]. With Spambase dataset, Naïve Bayes showed considerable accuracy of 96.46 % to classify spam or ham email [19]. The best reported recall for NB to detect intrusion is 100% [79], malware is 95.90% [86], and spam is 98.46% [19]. NB obtained the best precision while detecting the intrusion is 99.04% [80], malware is 97.50% [34], and spam is 99.66% [19]. A detailed performance comparison of NB to various cyber threats on the frequently used dataset is presented in Table 6.