IDS Using Feature Extraction with ML in IoT

IDS Using Feature Extraction with ML in IoT: Comparison

Please note this is a comparison between Version 1 by Atta ur Rahman and Version 2 by Catherine Yang.

With the continuous increase in Internet of Things (IoT) device usage, more interest has been shown in internet security, specifically focusing on protecting these vulnerable devices from malicious traffic. Such threats are difficult to distinguish, so an advanced intrusion detection system (IDS) is becoming necessary. Machine learning (ML) is one of the promising techniques as a smart IDS in different areas, including IoT. However, the input to ML models should be extracted from the IoT environment by feature extraction models, which play a significant role in the detection rate and accuracy.

intrusion detection system
Internet of Things
feature extractors
machine learning

1. Introduction

The Internet of Things has recently been one of the most important research topics. The IoT is a new technological paradigm defined as a global network of connected electronic devices. It aims to improve daily life by automating normal daily operations in all aspects of life without human intervention. The number of devices connected to IoT has been raised significantly, and an increase in attacks against IoT devices has accompanied this growth. Security concerns about the impact of these attacks on connected devices have naturally increased. In addition to the sensitivity of the information available on IoT devices, it was necessary to find solutions to detect and respond to these attacks [1].

Because of its weaknesses, the Internet of Things is vulnerable to assaults and security threats ^{[2][3][4][5][6][7][8][9][10][11]}[2,3,4,5,6,7,8,9,10,11]. Researchers attempted to categorize attacks, vulnerabilities, and security concerns on the Internet of Things so that researchers could more easily identify answers. For example, according to the layers of the IoT architecture, the researchers categorized the vulnerabilities, and physical security hardening is lacking. Unconfident data storage and transfer, shortage of clarity and device management, botnets, insecure passcodes, ecosystem interfaces, and AI-based assaults have all been concerns for devices on the IoT. While some academics emphasized IoT’s vulnerabilities and security risks, others did not. Among these issues, the researchers pointed out that because IoT employs traditional network architecture, it inherits its flaws [6]. In addition, the rise in terminal devices (end nodes) with limited processing capabilities was one of the most significant and powerful vulnerabilities exploited by attackers [7]. End node manufacturers design them to work without paying attention to security concerns. As a result, these devices must be monitored and managed to protect networks from threats and reach a higher degree of IoT security [2]. The Open Web Application Security Project (OWASP) produced a comprehensive record of IoT attack areas and locations in IoT systems as a section of its Internet of Things plan. In addition, it recorded applications where alleged unauthorized actions could be encountered [12]. Following is a summary of the IoT attack areas:

Devices are probably the principal mechanism from which attacks could be launched, such as memory, firmware, physical interface, web interface, or network services. In addition, weak settings, out-of-date components, and insufficient update procedures are critical parts of any device attackers could utilize.
Communication channels could also be points of attack in IoT systems. Usually, protocols contain security failings that impact the whole system, such as denial of service (DoS) attacks and spoofing.
Applications and software, generally, could be hacked due to a shortcoming in web applications or software systems. This approach is usually utilized to steal user passwords or spread malware files or programs.

Intrusion detection systems (IDSs) are essential security techniques to conserve network security, and they are installed at a fatal location in the network ^[12][13][12,13]. Traditional systems contain source and preliminary processing of data and a decision-making technique. This process contains the collection of raw data from host or network traffic. By analyzing the network data traffic, an intrusion detection system can classify the network behavior as normal or abnormal [12] and then process the features passed by the decision-making method to recognize threats [13]. Three main ways to detect intrusions are signature-based IDS, anomaly-based IDS, and a hybrid of signature- and anomaly-based IDS [13]. Dynamic anomaly-based network detection systems are flexible and superior to static signature-based network intrusion systems because the former can detect new attacks [14]. They use artificial intelligence (AI) algorithms that are made of both machine learning (ML) and deep learning (DL) architectures. On the other hand, IDSs detect signatures and patterns and then match them with the predefined signature of misuses, which could be worthless with unknown attacks [13]. The three significant categories of intrusion detection systems are host intrusion detection systems (HIDSs), network intrusion detection systems (NIDSs), and network node intrusion detection systems (NNIDSs) [13]. The HIDS is installed on the entire network of machines and other parts of the physical and virtual networks and protocols. The NIDS protects vulnerable network parts where the attack opportunities are high. IDSs consider network or host-based methods to recognize and distract attacks. These methods search for attack signatures with patterns that indicate malignant action or suspicious activity. Based on where an IDS is searching for the pattern, either in network traffic or log files, it is classified as network- or host-based [15].

Machine learning methods are extensively used to build network intrusion detection systems because of their capability to grasp new intrusions [16]. To develop accurate algorithms that can cluster, classify, and predict, it is vital to utilize considerable-size data sets using supervised machine learning techniques such as SVM and naïve Bayes. In addition, decision trees demonstrate their simplicity, rapid adaptability, and accuracy. In addition, neural networks have been widely used to characterize anomaly and misuse patterns ^[12][16][12,16]. Accuracy and interpretability are essential factors of artificial intelligence models. To achieve accuracy and interpretability, machine learning and deep learning techniques must be considered. For example, black-box algorithms provide higher accuracy, while white-box algorithms provide feature engineering [14].

2. Machine Learning Techniques used in IoT Traffic

Rose et al. [17] generated a dataset and developed a model to detect and investigate the possibilities of utilizing network profiling and machine learning to protect IoT against cyber-attacks. The authors suggested anomaly-based intrusion detection system profiles and monitoring all networked devices constantly and aggressively to identify IoT device tampering attempts and suspicious network transactions. They evaluated the suggested methodology’s performance using regular and malicious network traffic on the Cyber-Trust testbed. The experimental findings reveal that the suggested anomaly detection system produces good results, with a 98.35% accuracy and 98.35% false-positive alerts. Ali et al. [18] present a general machine learning strategy for identifying IoT devices and evaluating the trained models against four publicly available datasets. NFStream extracted 85 attributes from packet capture (.pcap) files to better identify IoT devices in the network using machine learning models. The authors used the information gain approach to choose 20 characteristics and trained six machine learning models in the tests. In the training phase, the authors achieved high accuracy, reaching 99% for IoT device identification using random forest and naïve Bayes classifiers. El-Sayed et al. [19] examined and compared seven different supervised learning algorithms with various difficulty levels to pick the best one. The seven algorithms were separated into two groups: The category of CNN classifiers included two-layer CNN, four-layer CNN, VGG16 and logistic regression, support vector machine, and K-nearest neighbors, and the category of ordinary classifiers included logistic regression, support vector machine, and K-nearest neighbors. Experimental findings reveal that the SVM algorithm obtains the maximum performance of 94% on MobileNetv2 features because of its rapid and steady training performance with fewer resources compared with other models. Le K-H et al. [20] present IMIDS, an intelligent intrusion detection system (IDS) for IoT devices. IMIDS’s core is a lightweight convolutional neural network model that can categorize numerous cyber threats and surpasses its competitors with an average F-measure of 97.22%. Furthermore, after being further educated by the data supplied by the assault data generator, IMIDS’s detection performance significantly increased. These findings show that IMIDS may be used as an IDS in IoT. Joo et al. [21] proposed a deep learning-based IoT intrusion detection system. The categorization was performed with a CNN; the best score was 86.2%. Second, machine learning classifiers were employed for the hybrid technique instead of ultimately linked layers from the vanilla CNN, which delivered roughly 87% with the additional tree classifier. Finally, the Xception model was merged with the bidirectional GRU, yielding the best accuracy at 95.6%. For quicker identification and classification of new malware, Bendiab et al. [22] propose a unique IoT technique that analyzes malware traffic based on DL and visual representation (zero-day malware). The suggested technique detects fraudulent network traffic at the package level, lowering detection time and optimistic outcomes thanks to the deployed deep learning. To test the efficacy of the proposed technique, the authors created a dataset of 1000 .pcap files of benign and virus traffic obtained from several network traffic sources. The Residual Neural Network (ResNet50) trial findings are quite encouraging, with a detection rate of 94.50% for malware traffic. Six machine learning (ML) approaches were tested for their ability to identify MQTT-based attacks [23]. Packet-based, unidirectional, and bidirectional flow characteristics were evaluated at three abstraction levels. An MQTT simulated dataset was created and used for the training and assessment operations. The experimental findings showed that the suggested ML models were sufficient for the IDS needs of MQTT-based networks. Furthermore, the findings highlight the significance of employing flow-based characteristics to distinguish MQTT-based attacks from innocuous traffic, whereas packet-based features are sufficient for typical networking assaults. The results reveal that the model has the highest accuracy of 99.04%. Sapre et al. [24] employed the KDDCup99 and the NSLKDD, two widely used intrusion detection datasets, in their study. Their major objective was to thoroughly compare both datasets by analyzing the performance of multiple machine learning (ML) classifiers trained on them using a more extensive range of classification criteria than prior studies. Because the classifiers trained on the KDDCup99 dataset were 20.18% less accurate on average, the authors concluded that the NSL-KDD dataset is of better quality than the KDDCup99 dataset. This is because classifiers trained on the KDDCup99 dataset were biased toward redundancy, allowing them to attain a higher accuracy of 96.83%. Liu et al. [25] looked at assaults that might affect sensors and networks in IoT scenarios using the NSL-KDD dataset. Moreover, the authors investigated eleven machine learning techniques and provided the findings to identify the introduced assaults. They showed that tree-based approaches and ensemble methods surpass the other machine learning methods evaluated through numerical analysis. With 97% accuracy, 90.5% Matthews correlation coefficient (MCC), and 99.6% area under the curve (AUC), XGBoost is the best of the supervised algorithms. Furthermore, the expectation-maximization (EM) technique, which is an unsupervised approach, performs exceptionally well in identifying assaults in the NSL-KDD dataset and beats the naïve Bayes classifier by 22.0% in terms of accuracy. To distinguish benign from malicious nodes, Amouri et al. [26] used a methodology that consists of two stages: in the first stage, the data are collected by dedicated sniffers (DSs), and then the CCI is generated and is regularly sent to the super node (SN). After that, in the second stage, the SN processes a linear regression method on the collected CCIs from different DSs to distinguish benign from malicious nodes. Using two mobility models, namely random waypoint (RWP) and Gauss Markov, the detection characterization is shown for several extreme cases in the network (GM). The black hole and distributed denial of service (DDoS) assaults are two harmful activities utilized at work. Nodes with high-velocity situations showed detection rates of over 98%, while nodes with low-velocity scenarios showed detection rates of approximately 90%. Fenanir et al. [27] created a lightweight intrusion detection system (IDS) using two machine learning techniques: the filter-based method was used to pick features due to its cheap computational cost. A comparison of logistic regression (LR), naïve Bayes (NB), decision tree (DT), random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and multilayer perceptron yielded the feature classification approach to the system (MLP). Finally, the DT method was chosen for the system due to its excellent performance across various datasets. The study’s outcomes might help choose the optimum feature selection approach for machine learning; the data suggest that the best results are 98% accuracy. Islam et al. [28] pointed out numerous types of IoT threats and discussed shallow IDSs in the IoT environment (such as decision tree (DT), random forest (RF), and support vector machine (SVM)), as well as DL (deep neural network (DNN), deep belief network (DBN), long short-term memory (LSTM), stacked LSTM, and bidirectional LSTM (Bi-LSTM))-based IDSs. The models’ execution was assessed using five standard datasets: NSL-KDD, IoTDevNet, DS2OS, IoTID20, and the IoT Botnet dataset. The performance of shallow/deep machine learning-based IDSs was evaluated using several performance indicators such as accuracy, precision, recall, and F1-score. According to the research, a machine learning IDS surpasses shallow machine learning in detecting IoT threats; the most remarkable outcome of the studies is the accuracy of 98.79%. Using characteristics from the UNSW-NB15 dataset, Ahmad et al. [29] suggest feature clusters regarding its flow, Message Queuing Telemetry Transport (MQTT), and Transmission Control Protocol (TCP). Overfitting, the curse of dimensionality, and an unbalanced dataset are no longer issues. The proposed method used supervised machine learning (ML) methods such as random forest (RF), support vector machine, and artificial neural networks on the clusters. The model reaches 98.67% and 97.37% accuracy using RF in binary and multiclass classification. Utilizing RF on flow and MQTT features, TCP features, and top features from both clusters, classification accuracies of 96.96%, 91.4%, and 97.54% were obtained using cluster-based approaches. A two-stage hybrid technique was proposed by Saba et al. in [30]. To increase the accuracy of the suggested system, the genetic algorithm (GA) is first used to pick acceptable characteristics. The support vector machine (SVM), ensemble classifier, decision tree, and other well-known machine learning (ML) algorithms are then used. Using the NSL-KDD database, they attained a 99.8% accuracy using 10-fold cross-validation. Based on a hybrid convolutional neural network model, Smys et al. [31] suggested an intrusion detection system for IoT networks that can identify many forms of assaults. The proposed paradigm may be used in a variety of IoT scenarios. The proposed study is validated and compared to machine learning and deep learning models. The suggested hybrid model is more sensitive to threats in the IoT network, with a 98.6% accuracy rate. Papafotikas et al. [32] propose a digital system incorporating a machine learning (ML)-based clustering method for identifying suspected activities while using current supply characteristic dissipation. The K-means clustering algorithm accompanied by supervised training is used in this prototype system. This research demonstrated the successful identification of suspicious activity in intelligent IoT devices. Similarly, a study in [33] proposed an IDS approach using a fused machine learning model. Three datasets, namely KDD, CUP-99, and NetML-2020, were fused under a novel-built machine learning-based architecture. The trained model was promising in terms of accuracy of 95.18%. Further, several researchers in the literature have comprehensively surveyed and emphasized the significance of machine learning and deep learning models in the IDSs involving IoT networks ^{[34][35][36][37]}[34,35,36,37], especially in conjunction with cloud computing, namely the Cloud of Things security aspect [38]. This is mainly because it involves several intermediate public networks and stakeholders, making it more vulnerable to attacks. Table 1 summarizes related work approaches, including the techniques used, dataset type, and the respective study’s advantages and disadvantages.

Table 1.

Summary of the research.

Ref.	Year	Dataset Type	Algorithms	Key Results	Advantages	Disadvantages
[17]	2021	Image dataset	A novel IDS in the ML framework	98.35%	A new approach to detect malware and intrusion using image processing for IoT	Need resources and computing time to achieve results and to understand what each image means
[18]	2021	Numerical dataset	Random forest and naïve Bayes algorithms in the ML model	99%	Using different types of machine learning to understand the problem	Too expensive in terms of resources and computing time, and the authors did not propose any base model for the future
[19]	2021	Image dataset	Two-layer CNN, four-layer CNN, VGG16, CNN classifiers, SVM, and K-NN in the ML model	94%	Using a hybrid technique to extract the best features from the images using VGG and CNN	Too heavy classification network and no improvement in key results
[20]	2021	Numerical dataset	Novel convolutional neural network in ML model	97.22%	A faster IDS	Despite the fast performance, the model is light with the heavy traffic amount
[21]	2021	Image dataset	Convolutional neural network (CNN)	95.6%	Connected layers and faster learning	The model uses heavy resources and without achieving higher accuracy
[22]	2021	Image dataset	Residual Neural Network (ResNet50)	94.50%	The model used a hybrid approach to detect intrusions	The authors used a small amount of the data to achieve the findings without testing on the larger networks
[23]	2020	Numerical dataset	LR, NB, k-NN, SVM, DT, and RF	99.04%	Using different ML algorithms	The model was focused on flow-based detection, not packet-based detection
[24]	2019	Numerical dataset	ANN, SVM, NBC, and random forest	91.5%	The authors used different dataset types	The model was trained separately on the dataset, and without finding a model that is trained on different intrusion features
[25]	2020	Numerical dataset	Tree-based XGBoost, MCC, AUC, ME, and naïve Bayes classifier	97%	The model used an XGBoost algorithm	The model was applied to a couple of scenarios without generating a model that could handle several scenarios
[26]	2020	Numerical dataset	Tree-based; among the supervised algorithms, XGBoost ranks first, followed by Matthew’s correlation coefficient (MCC), area under the curve (AUC), expectation-maximization (EM) algorithm, naïve Bayes classifier	90%	The model used different approaches to tackle the IDS problems	The model was focused on DDOS and DOS attacks.
[27]	2019	Numerical dataset	LR, NB, DT, RF, KNN, SVM, and MLP	98%	Applied different machine learning algorithms	More feature selection processes are needed to achieve better findings
[28]	2021	Numerical dataset	DT, RF, SVM, (DNN), deep belief network (DBN), long short-term memory (LSTM), stacked LSTM, bidirectional LSTM (Bi-LSTM))	98.2%	Applied different machine learning algorithms using different datasets	The proposed model did not recommend specific datasets or algorithms to be used as a base model
[29]	2021	Numerical dataset	RF, SVM, and ANN	96.96%	Applied different machine learning algorithms	More feature selection processes are needed to achieve better findings
[30]	2021	Numerical dataset	GA, SVM, ensemble classifier, and DT	99.8%	Applied hyperparameter and K-fold	The model proposed a multiclass without presenting the features or the class features
[31]	2020	Numerical dataset	Different ML algorithms in hybrid convolutional neural network module	98%	The model takes advantage of deep learning feature extraction	The model was applied to a couple of scenarios without generating a model that could handle several scenarios
[32]	2019	Numerical dataset	K-means algorithms	-	Using clustering to tackle the problem	The model still needs to be trained from supervised algorithms to achieve findings
[33]	2023	Numerical dataset	Fused machine learning	95.18%	Used machine learning fusion with three datasets (KDD, CUP-99, NetML-2020)	Accuracy can be further fine-tuned