Intrusion detection systems and intrusion prevention systems are to prevent network intruders' attack and malicious compliance. Network communities have produced benchmark datasets available for researchers to improve the accuracy of intrusion detection systems. The scientific community has presented data mining and machine learning-based mechanisms to detect intrusion with high classification accuracy.
1. Introduction
An expeditious rise in the development of network and communication technologies leads to an immense amount of network data generated from a wide range of services. For instance, pervasive computing networks such as the Internet of Things (IoT) generate enormous data
[1][2][3]. A wide range of network applications is developed in every domain of life, including business, healthcare, smart homes, and smart cities, to name a few
[4][5][6][7]. The plethora of high-dimensional data increases the need for analysis tools based on advanced data mining and statistical methods
[8][9]. There is a dire need to tune the contemporary data mining and statistical methods to address the challenges of the growing internet applications, such as bandwidth handling, network intrusion detection, and scalability. Network applications and resources’ security using intrusion detection systems, intrusion prevention systems, and hybrid systems are becoming more challenging due to the enormous number of diverse networking applications. However, the rule-based approach for the analysis of enormous data has many limitations. The existing state-of-the-art intrusion detection-based systems focus on increasing the reliability aspect of these applications
[10]. An efficient intrusion detection system can strengthen the defense system of such applications against anomalies and network intrusion attacks. The intrusion detection system also provides real-time analysis of the collected critical reconnaissance data during defensive attacks. Intrusion detection systems based on artificial intelligence(AI) hold a significant potential to enhance the performance of detection mechanisms by learning from historical data and real-time data patterns.
Scientific community has presented various machine learning-based intrusion detection systems such as support vector machine (SVM)
[11], Naive Bayes (NBs)
[12], clustering
[13], artificial neural network (ANN), and deep learning network (DNN)
[14]. Conventional machine learning algorithms can better classify small and low dimension datasets. However, the classification accuracy of these algorithms deteriorates when it comes to addressing problems involving high dimensionality and nonlinearity. Hence, the need for intrusion detection models to address the classification accuracy problem increases as AI advances. For example, a convolutional neural network (CNN)
[15] and long short-term memory (LSTM)
[16] have been applied in natural language processing (NLP) and computer vision applications. The problem with deep learning techniques such as CNN and LSTM is adaptability to nonlinear and high-dimensional data. The issue of nonlinearity has been addressed in CNN and LSTM for modeling nonlinear systems
[17][18][19][20][21][22]. In literature, the issues of high dimensional data are handled in CNN, and LSTM using a deep learning paradigm
[23][24][25][26]. Automated machine learning (autoML) is a newly emerged subfield of machine learning and data science. The feasible adaptability of autoML makes it equally useful for trainees of machine learning, data scientists, and machine learning engineers. Research articles demonstrate that autoML can revolutionize constructing machine learning models without machine learning expertise and knowing technical specifications. AutoML architectures produce a code pipeline by suggesting and selecting a model from a list of machine learning model-based input datasets
[27]. The selection is performed based on the accuracy of these machine learning models. AutoML results in coding the pipeline of the best performing model, which will be very difficult to find using manual configurations of the models’ parameters.
2. Anomaly Detection in Network Intrusion Environments
Artificial intelligence is taking over the current era and is changing the current era into a revolutionary practical world. Data analysis, predictive analytics and optimization models are used for many real-life applications
[28][29][30]. Anomaly detection is a type of data analysis used to identify irregular and abnormal data from a given data set. Anomaly detection is the approach used in data mining applications for discovering and finding patterns inside the data
[31]. It is also used as a standalone module in many studies related to machine learning and statistics applications. Deviation detection, outlier detection, and exception mining are related terms used for anomaly detection
[32]. Narayana et al. defined anomaly as a mechanism generated from the deviation of several observations
[33]. Anomaly detection is used in several scientific domains such as healthcare, intrusion detection, sensor network, and fraud detection, to name a few. Detecting irregularities in the network, identifying anomalies in financial transactions, detecting fraudulent activities, and detecting anomalies in medical images are some anomaly detection applications
[34]. In networks, anomaly patterns can be identified based on the classification of packet data containing abnormal patterns.
Xie et al. published a survey study related to intrusion detection in wireless sensor networks
[35]. According to most of the studies, intrusion detection depends on the communication medium; for example, wired connection-based techniques cannot be applied to the wireless communication medium. The survey emphasizes the need for standard anomaly detection techniques for all types of networks. One challenge for detecting anomalies in the network is the lack of a comprehensive dataset. Most of the current anomaly detection systems are based on supervised approaches that use labeled data knowledge. During the past few years, research has been conducted in network intrusion detection segregated into audit source, network behavior, detection method, location, frequency of usage, and detection method. In
[36], Debar et al. presented a standard technique based on the extension of transaction-based detection paradigm. Axelsson et al.
[37] proposed a study based on detection principle and focus on operational aspects. Furnell et al.
[38] proposed an intrusion matrix based on the data scale and output type. Estevez-Tapiador et al. presented a wired-based network intrusion detection based on anomaly detection
[39]. Boukerche et al. presented an outlier-based classified detection approach using the unsupervised and supervised models
[40]. Under the supervised category, a proximity-based technique has been used recently
[41].
Chandola et al. also presented another detailed survey study on anomaly detection
[42]. Their study presents different techniques related to intrusion detections. Some studies proposed several anomaly detection techniques based on supervised, unsupervised, and clustering methods
[43][44][45][46][47]. The lack of discussion and research problems in the available datasets are one of the research gaps that need to be addressed. The most used datasets for network anomaly detection are the DARPA/KDD, which developed in 2013. Various variants of datasets are developed based on this dataset to address the causes of data errors and inconsistency. As network anomaly detection based on the aforementioned dataset has no significant performance improvements; therefore, more anomaly detection datasets have been introduced recently to improve intrusion detection system efficiency. Some research surveys focused on these dataset issues and challenges to develop an efficient intrusion detection system
[48]. The network attack profile feature relies on classification-based techniques and the size of the data
[49]. The intrusion detection system process is based on the signature of the attack and the capability of intrusion detection system to detect the attack from data patterns
[50]. The intrusion detection engine can also enhance the defense system using intelligent mechanisms for various attacks’ variants. This process is quite expensive for creating a new attack in case of loss or replacement
[51]. Furthermore, the regular traffic does not contain the knowledge base attack, and it will be raising the wrong alarms.
In summary, anomaly detection mechanisms are costly in terms of time and are relying on the existing network traffic dataset. Furthermore, keeping the standard profile up-to-date is very difficult in today’s network. The network traffic analysis dataset does not have easy access due to privacy limitations. Examples of benchmark datasets for intrusion detection are DARPA/KDD, UNSW-NB15, CICIDS2017, and CSE-CIC-IDS2018
[52]. The main challenge that needs to be addressed is improving intrusion detection systems’ accuracy on these benchmarks’ datasets.
Table 1 presents a summary of existing intrusion detection and prevention systems organized as applications, datasets, models, and relative demerits.
Table 1. Summary of existing intrusion detection and prevention systems.
Application |
Datasets |
Model |
Relative Demerits |
Anomaly Detection [53] |
InSDN |
TRW-CB algorithm |
Standardized programmability and can predict anomalies in SOHO Network |
DoS attacks detection [54] |
KDD-99 |
Self-organizing maps, ANN |
Lightweight DDoS Flooding Attack but do not have any flow rules installed. |
Anomaly Detection [55] |
NSL-KDD |
DNN approach |
Does not scale well for commercial product but is a good alternative solution for signature-based intrusion detection system |
DDoS Detection System [56] |
Simulated data |
Stack auto-encoder and DNN |
Detect all DDoS attack, but has a Controller bottleneck in a wide networks. |
Intrusion Detection [57] |
Simulated data |
Self organizing map and learning vector quantization |
Detect U2R attacks but limited to deep packet inspection technique. |
Monitor traffic flow [58] |
Simulated data |
Flow analysis tool |
Improve computation time of flow but difficult to handle due to batch processing. Flow analysis tools are not compatible with the MapReduce interface. |
P2P botnet detection [59] |
CAIDA, simulated data |
Random forest |
Process high bandwidth and efficiently analyze malicious traffic data. However, the high drop rate of packets and delay in detection make it inefficient for new complex threats. |
Intrusion detection [60] |
NSL-KDD 99 |
NB tree, random forest |
Improved performance accuracy reduces false-positive rate for hybrid approaches, but the false-positive rate is high for non-hybrid approaches. |
Phishing-based attack detection [61] |
Simulated data |
Collaborative mechanism |
Practical method for generalization to any attacks but no validation with real datasets. |
Intrusion detection [62] |
KDD 99, CMDC 2012 |
OneR algorithm, KNN, SVM |
Faster but feature reduction and training mechanism is real overhead. |
Malware detection [63] |
Simulated data |
Choi–Williams distribution |
Effective for Kelihos injection but not tested with real datasets. |
Intrusion detection system [64] |
Simulated data |
RSFSA, fuzzy logic based SVM |
Faster mechanism for decision attributes and log data reduction though not tested with real datasets. |
Network traffic monitoring [65] |
CAIDA |
IP Trace Analysis System |
Useful for passive analysis but does not provide a fine-grained analysis. |
3. Conclusions
An expeditious rise in the development of network applications leads to an immense amount of network data generated from a wide range of services for large user groups. Safeguarding network applications and things connected to the internet has always been a point of interest for researchers. Many studies propose solutions for intrusion detection systems and intrusion prevention systems. Nevertheless, there is a dire need to tune the contemporary data mining and statistical methods to address the challenges of the growing internet applications, such as bandwidth handling, network intrusion detection, and scalability. We present an intrusion detection system based on the ensemble of prediction and learning mechanisms to improve anomaly detection accuracy in a network intrusion environment. Case studies of intrusion detection are implemented using publicly available benchmark intrusion detection datasets UNSW-NB15 and CICIDS2017. The performance of the proposed model is compared with some contemporary models, including DNN, autoML, and other algorithms from the literature on these benchmark datasets. The performance evaluation is compared in terms of accuracy, precision, recall, and F1 score. The proposed model accuracy for the UNSW-NB15 dataset is 98.801 percent, and the CICIDS2017 dataset is 97.02 percent. The performance comparison analysis shows significant improvements in the intrusion accuracy, detection rate, and F1 score. As part of future work, the proposed intrusion detection model will be leveraged for IoT-cloud applications for detecting anomalies in the sensing data.