Intrusion Detection and Datasets

Intrusion Detection and Datasets: Comparison

Please note this is a comparison between Version 1 by Joaquín Gaspar Medina-Arco and Version 2 by Wendy Huang.

With the significant increase in cyber-attacks and attempts to gain unauthorised access to systems and information, Network Intrusion-Detection Systems (NIDSs) have become essential detection tools. Anomaly-based systems use machine learning techniques to distinguish between normal and anomalous traffic. They do this by using training datasets that have been previously gathered and labelled, allowing them to learn to detect anomalies in future data. However, such datasets can be accidentally or deliberately contaminated, compromising the performance of NIDS.

anomaly detection
NIDS
deep learning
datasets
network traffic
labelling

1. Introduction

Network Intrusion-Detection Systems (NIDSs) represent a primary cybersecurity mechanism for identifying potential attacks on a communication network. To accomplish this goal, they analyse the network traffic passing through the system, regardless of whether it is internally generated or originated from external entities targeting the network. Detecting intrusions allows network administrators to become aware of system vulnerabilities and to make quick decisions to abort or mitigate attacks. Additionally, NIDSs allow them to implement measures to strengthen the system in the future [1].

NIDSs can be categorised into various typologies based on two fundamental principles: architecture and techniques employed. Focusing on the architecture, NIDS can be classified as host-based, network-based, and collaborative approaches between different components. According to the detection technique, the classification may be signature-based, Stateful Protocol Analysis-based, or anomaly detection-based NIDSs [2].

Signature-based NIDSs possess a repository of network patterns representing prevalent network attacks. Their operating mode is to match the network sequence they examine with their knowledge base to detect potential attacks [3].

Alternatively, Stateful Protocol Analysis-based NIDSs rely on their comprehensive understanding of the monitored protocol. They analyse all interactions to identify a sequence of actions that might result in a vulnerability or insecurity [3].

In contrast, anomaly-detection-based NIDSs employ mechanisms to detect abnormal network traffic behaviour. These anomalous activities typically correspond to network traffic patterns that have a significantly low likelihood of occurring or are markedly misaligned with normal traffic. Acutely objective, anomaly detection allows for the handling of novel or previously unknown attacks (zero days). This is because such attacks generate traffic patterns that have not been found before, and this type of NIDS often relies on the use of machine learning techniques to carry out anomaly detection. When this approach is followed, the subjective evaluation of attacks is effectively circumvented.

Different strategies have been employed to detect anomalies in NIDS through various machine learning techniques ^[4][5][4,5], including statistical techniques like Principal Component Analysis (PCA) [6] or Markov models ^[7][8][7,8]; classification techniques like Artificial Neural Networks (ANNs) ^{[9][10][11][12]}[9,10,11,12], Support Vector Machines (SVMs) [6], deep learning models ^[13][14][13,14] including Autoencoders ^[9][15][9,15], or Decision Trees including Random Forest [16]; and clustering like outlier detection [17]. Using these techniques requires a multi-perspective approach to tackling the problem, which can be categorised as supervised, semi-supervised, or unsupervised, depending on the specific technique chosen [18].

Regardless of the technique used for anomaly detection in NIDS, the underlying models must be trained to distinguish normal traffic from anomalous traffic. This training process utilises datasets comprising real, synthetic, or a combination of both network traffic. To be more concise,

Synthetic traffic datasets are created by generating traffic in a controlled environment that emulates a real-world setting. The generated traffic may include traffic related to known attacks, providing enough samples for machine learning models to competently identify and detect such anomalies. This enables the optimisation of the dataset regarding the size and balance between regular and irregular traffic samples. It also ensures the correct labelling of each observation as it has been intentionally and deliberately generated. Such observations can be, for instance, the traffic flows seen in the network. However, a potential issue is that it may not accurately reflect the network traffic patterns observed in a genuine environment.
Real traffic datasets capture all network communications within a real productive environment. This implies access to the patterns of network traffic consumption and usage that take place in an actual scenario and potentially any cyber-attacks that may occur. Unlike synthetic datasets, real traffic samples may be biased or imbalanced, with the presence of anomalous traffic often being minimal or completely absent. It is necessary to carry out a subsequent process to assign a normality or attack label to each flow for its use in machine learning models during training phases.
Composite datasets are the ones generated by combining real environment data and synthetic traffic to introduce attack patterns.

Regardless of the AI model used in a NIDS, the dataset’s labelling accuracy is crucial to maintaining high model performance. This principle applies equally to supervised and unsupervised learning. In supervised learning, labelling is necessary to enable models to learn how to identify anomalous traffic. In contrast, unsupervised learning generally assumes that the training dataset consists of normal traffic only and is, therefore, free of anomalies.

2. Datasets for Network Security Purposes

To effectively train any AI model, especially those constituting NIDSs based on anomaly detection, a prerequisite is a comprehensive dataset. This dataset should encompass a sufficient number of samples that represent all the various classes or patterns, whether benign or malicious. This foundational dataset enables the model to learn and predict accurately during subsequent training phases. In the specific case of NIDSs, a large and correctly labelled dataset is assumed ^[19][23]. The quality of the trained models depends to some extent on the quality of the data on which they were trained ^[20][25], so it is important to make a thorough analysis of the typology of datasets available in the NIDS domain. Before reviewing the different datasets available in the field of cybersecurity, it is necessary to define the criteria according to which these datasets will be analysed:

Availability: Understood as free access (Public) to the dataset or, on the contrary, of reserved access, by means of payment or explicit request (Protected).
Collected data: Some datasets collect traffic packet for each packet (e.g., PCAP files), others collect information associated with traffic flows between devices (e.g., NetFlow), and others extract features from the flows by combining them with data extracted from the packets.
Labelling: This refers to whether each observation in the dataset has been identified as normal, anomalous, or even belonging to a known attack. Or, conversely, no labelling is available, in which case they are intended for unsupervised learning models.
Type: The nature of a dataset may be synthetic, where the process and environment in which the dataset is generated are controlled, or it may be the result of capturing traffic in a real environment.
Duration: Network traffic datasets consist of network traffic recorded over a specific time interval, which may range from hours to days, months, or even years.
Size: the depth of the dataset in terms of the number of records or the physical size and their distribution across the different classes.
Freshness: It is also important to consider the year in which the dataset was created, as the evolution of attacks and network usage patterns may not be reflected in older datasets, thus compromising their validity in addressing current issues.

A summary of the datasets analysed according to the characteristics described above is shown in Table 1.

Table 1.

Overview of available network datasets.

Dataset	Availability	Collected Data	Labeled	Type	Duration *	Size **	Year	Freshness	Balanced
DARPA ^[21][26]	Public	packets	yes	synthetic	7 weeks	6.5TB	1998–1999	questioned	no
NSL-KDD ^[22][27]	Public	features	yes	synthetic	N.S.	5M o.	1998–1999	questioned	yes
Kyoto 2006+ ^[23][28]	Public	features	yes	real	9 years	93M o.	2006–2015	yes	yes
Botnet ^[24][29]	Public	packets	yes	synthetic	N.S.	14GB p.	2010–2014	yes	yes
UNSW-NB15 ^[25][30]	Public	features	yes	synthetic	31 hours	2.5M o.	2015	yes	no
UGR’16 ^[26][31]	Public	flows	yes	real	6 months	17B f.	2016	yes	no
CICIDS2017 ^[27][32]	Protected	flows

^39][44] that ensures the cohesion of the resulting data. The result is a dataset of tagged network packets with a total of almost 14 GB of information and a balance between normal and anomalous traffic of almost 55% and 45%, respectively, which is quite balanced.

2.6. UNSW-NB15

The Cyber Range Lab at the Australian Centre for Cyber Security generated the synthetic UNSW-NB15 ^[25][30] dataset in 2015 using the IXIA Perfect Storm traffic generator. The simulation environment used to generate the samples consists of three servers, two of which generate benign traffic, while the third is used to generate traffic associated with various attacks such as DoS, exploits, and rootkits. The dataset size is reduced, reaching a total of 31 h in two subsets of 16 and 15 h, respectively, with just under 2.5 million observations, 12% of which correspond to anomalies or attacks. Labels are available for each flow, indicating whether it is normal or not, as well as the attack category to which it belongs. Finally, the data are available in packet format (PCAP) as a version of 49 features extracted from the captured flows.

2.7. UGR’16

The UGR’16 dataset ^[26][31] was created by the University of Granada in 2016 as a result of capturing the real network traffic of a medium-sized ISP between March and June 2016. Subsequently, during the months of July and August, different attacks such as DoS, botnet, or port scanning were deliberately generated on the same ISP to capture all the traffic so that this subset could be used as a test. The dataset consists of NetFlow traffic flows with almost 17 billion different connections, of which more than 98% were normal traffic, making it very imbalanced. After the traffic was captured, state-of-the-art anomaly detection and network attack identification techniques were employed to tag the dataset. This involved assigning each record a label indicating the type of attack to which it belonged. Given the size of the dataset and its temporal proximity, it is an updated and current dataset for use in building or training AI and NIDS models. As can be seen in Section 3.3, this dataset presents some labelling problems.

2.8. CIC Datasets

The Canadian Institute for Cybersecurity (CIC) has generated several datasets to validate the performance of NIDS or to train the models underlying these NIDS. Among the various datasets available, the following should be highlighted:

CICIDS2017 ^[27][32]: Generated in 2017, it is a synthetic network traffic dataset generated in a controlled environment for a total of 5 days, available on request (it is protected). The captured data are in packet and flow formats, although they are also available in extracted feature format with a total of 80 different features. The captured traffic is tagged, and the different attacks that each record corresponds to, including DoS, SSH, and botnet attacks, are marked in the tag.
CSE-CIC-IDS2018 ^[28][33]: This is a synthetic dataset generated in 2018 specifically based on network traffic intrusion criteria. It includes DoS attacks, web attacks, and network infiltration, among others, recorded on more than 400 different hosts. As with CICIDS2017, the data are in packet and flow formatw but with a version containing 80 extracted features, and access requires a prior request (protected). Unlike CICIDS2017, it is modifiable and extensible.

yes
synthetic
5 days
3.1M f.
2017	yes	no
IDS2018 ^[28][33]	Protected	features	yes	synthetic	10 days	1M o.	2018	yes	no
NF-UQ-NIDS ^[29][34]	Public	flows	yes	synthetic	N.S.	12M f.	2021	yes	no

* N.S. means not specified. ** Expressed in flows (f.), observations (o.), or packets (p.). An observation denotes a data point with all specified features.

2.1. DARPA Datasets

Created by MIT’s Lincoln Laboratory, the DARPA datasets, with KDD datasets, are perhaps the most widely used in the field of intrusion-detection systems ^[30][35]. There are two versions, one created in 1998 and the other in 1999. Both collect synthetically generated network packets in controlled network environments simulating network traffic patterns previously observed in production environments. In the case of the 1998 version, the duration of the training subset is seven weeks of data, while in the 1999 version, the training subset consists of only three weeks of observations. In both cases, two weeks of observed network traffic is reserved for validation. All observations are labelled and contain a total of 200 observations of up to 58 attacks of different typologies, including different versions of denial of service (DoS), port scanning, and user-to-root (U2R) or remote-to-local attacks (R2L) ^[21][26]. These datasets, despite the year they were built, are still used today in various scenarios and their usefulness seems to be proven ^[31][36], although there are some studies that question their reliability ^[32][37].

2.2. KDD Dataset

KDD99 ^[22][27] is a dataset created for the Third International Knowledge Discovery and Data Mining Tools Competition based on the DARPA dataset. Unlike the latter, KDD99 is a dataset whose format is based on the extraction of features (up to 41 ^[33][38]) from network flows rather than the recording of raw observed data. It is a synthetic dataset but takes into account the actual traffic observed in military network environments. Access to the dataset is open, and, despite its longevity, it is still available. In terms of size, the dataset contains almost 5 million observations, including the same typology of attacks as DARPA, i.e., DoS, port scanning and privilege escalation attacks. Similar to DARPA, although it is a widely employed dataset, criticisms have emerged regarding its usability. Specifically, concerns have been raised about the lack of consistency between the number of attack types in the training subset and those available in the validation subset ^[34][39]. Additionally, the dataset is deemed outdated in the context of contemporary world communications.

2.3. NSL-KDD Dataset

In 2009, to reduce the original DARPA and KDD problems, Tavallaee et al. ^[23][28] created a new version of KDD called NSL-KDD ^[23][28]. In this version, the authors removed all redundant records and added new synthetic ones based on the correctly labelled records of the original dataset, so that those record types with a lower presence in the original dataset had a higher presence in the new dataset and vice versa. As for the test dataset, it was completely regenerated. The result is a public dataset that is slightly more balanced, but with a very significant reduction in size, with just over 125 K observations in the training and 22.5 K in the testing set. Even with the revision of the KDD dataset and the application of techniques to rebalance and address consistency issues, it continues to share the problems of its KDD and DARPA predecessors. Specifically, it relies on 1998 network traffic, rendering it outdated in the context of modern network communications and contemporary cyber-attacks.

2.4. Kyoto 2006+ Dataset

Given the shortcomings of datasets such as DARPA and KDD with their variants related to the longevity of their data, in 2006, Song et al. ^[35][40] published a new dataset called Kyoto 2006+, the result of recording real traffic from 32 honeypots with different characteristics from November 2006 to August 2009 (almost three years), totalling more than 93 million observations ^[35][40]. Since its initial publication, the authors have expanded the dataset to cover a total of nine years of traffic (up to 2015), adding more honeypots to reach the final figure of 348, including DNS servers to generate benign traffic. Each record in the dataset provides a total of 24 features associated with the captured network traffic flows, of which a total of 14 are present in datasets such as DARPA or KDD, while the remaining 10 are new additions, including the labelling of the records, as well as the typology of the detected attack. This dataset is probably the public dataset of real traffic with the greatest historical depth on record, but, in spite of this, it is still quite balanced.

2.5. Botnet Dataset

Biglar Beigi et al. ^[24][29] have developed a public dataset focused on botnet attacks, as they believed that this type of attack is currently the most challenging ^[24][29]. This dataset contains a total of 16 different botnet attack typologies, covering both centralised and decentralised attack strategies. In order to construct this synthetic dataset, the researcheuthors analysed different datasets by combining subsets of three different datasets (ISOT ^[36][41], ISCX 2012 IDS dataset ^[37][42], and Botnet Traffic Generated by the Malware Capture Facility Project or CTU-13 ^[38][43]) using the overlay methodology described in ^[

2.9. NF-UQ-NIDS

Sarhan et al. ^[29][34] have created a synthetic dataset specifically created for machine learning-based NIDSs ^[29][34]. This dataset is the result of combining four datasets used in the NIDS domain but transformed into a netflow version. Two of the datasets used have been analysed previously in this woresearchk (UNSW-NB15 ^[25][30] and CSE-CIC-IDS2018 ^[28][33]), while the other two (BoT-IoT ^[40][45] and ToN-IoT ^[41][46]) are datasets generated by the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). The result is a dataset that contains flows from different networks with different configurations, making it more universal than the datasets of which it is composed. The original dataset to which each flow belongs is available, allowing us to know under which scenario or network a NIDS trained with NF-UQ-NIDS can be more or less effective. The dataset contains almost 12 M records, 76.77% of which correspond to normal traffic, while the remaining 23.33% correspond to the 20 types of attacks it contains, making it an imbalanced dataset. It was published in 2021, so it is a dataset that can be considered up-to-date and incorporates the latest types of attacks.

3. Dealing with Labelling Problems in Datasets and the Techniques to Address Them

Classification problems, whether supervised or unsupervised learning, require a sufficiently large dataset that is correctly labelled. In the case of supervised learning, the labelling is used so that the model learns to distinguish the different classes that make up the universe being treated. However, when the problem is approached from an unsupervised learning perspective, such as anomaly detection, the training dataset is expected to belong to the same class. This setup enables the model to learn to identify anomalies by recognizing deviations from the patterns present in the training set. The process of creating a dataset is therefore very important, as it determines the potential success of the machine learning models that will use it. The processes of tagging the data that make up a dataset involve the application of automated techniques, as well as manual processes, which together can be subject to error ^[42][47]. To mitigate this problem, some papers present methods or techniques to reduce the mislabelling that occurs. For example, Kremer et al. ^[43][48] propose a model that tries to detect the noise in the labelling based on loss functions that are insensitive to noise and at the same time tries to infer the possible noise in the labelling and in the classification itself ^[43][48]. On the other hand, Zhang et al. ^[44][49] propose a framework called Adaptive Voting Noise Correction (AVNC), which aims to identify and correct incorrect labelling ^[44][49]. However, even the application of these techniques does not guarantee the correct labelling of the dataset. When the labelling of the data that make up a dataset is performed manually, there is a risk of unintentional bias that is intrinsic to the observed data. To address this scenario, a methodology is proposed in ^[45][50], whose aim is to relabel the data, eliminating the possible bias of the initial labelling, achieving good results in a computational perception problem on galaxy detection. The impact of noise on labelling in artificial intelligence models has also been analysed in several works in a way that relativises its impact. For example, Natarajan et al. ^[46][51] propose in ^[46][51] a simple loss estimator that is unbiased and minimises the risk of the presence of mislabelled data. Another approach, as proposed by Patrini et al. ^[47][52], focuses on tackling the issue of noise in labelling, particularly in scenarios involving deep learning models, including recurrent neural networks. The researcheuthors suggest two procedures to correct the loss function in instances of mislabelled data ^[47][52]. More recent is the work of Wei et al. ^[48][53], who this problem and propose two datasets with noise in the labelling to serve as a benchmark to measure how robust the models or techniques are to errors in the labelling ^[48][53]. Of particular relevance is the work of Northcutt et al. ^[42][47], which analyses the quality of labelling in test subsets of 10 datasets, as opposed to the work presented above, which focuses on the quality of labelling of the training data. This approach is particularly interesting as the test subsets are assumed to be perfectly labelled, as they are the test and evaluation mechanism by which the models are tested and validated ^[42][47]. Labelling errors in such a dataset can destabilise the performance of machine learning models. The datasets tested are those commonly used in the field of computational perception (such as MNIST or ImageNet), in the field of language processing (such as IMDB or Amazon Reviews), and finally in the field of audio processing (AudioSet). The results obtained show that there are labelling errors that, in some cases, reach up to 10% of the labelling error. Confident Learning (CL) is a subfield of machine learning between supervised and semi-supervised learning that focuses on characterising noise in the labelling to find and correct errors in the labelling in order to train robust models. To achieve this, they use data-pruning techniques to clean the dataset before training the models. In ^[49][54], a generalised CL strategy is proposed that is able to find the errors in the labelling by estimating the correct distribution of correct and incorrect labels. Furthermore, it is tested on image datasets, yielding models with higher performance than some of the best state-of-the-art models. Müller and Markert ^[50][55] propose a tool to detect errors in the labelling of image, text and numerical datasets ^[50][55]. As a result of the application of this tool, the set of observations of the dataset with a high probability of being mislabelled is obtained. This method has been tested on a total of 29 different datasets, both real and synthetic and, according to its authors, has been able to find mislabelling in some of them that had not been detected before. The application of computational perception techniques in medicine is also subject to the risks associated with mislabelling, especially when the goal is to detect the presence of possible tumours. In ^[51][56], the researcheuthors addressed this problem by proposing a methodology to identify labelling errors in images associated with the presence of breast cancer. To achieve this, they propose a function that measures the deviation between the prediction made by the model and the real value of the sample (called Cross-Entropy loss). Additionally, they put forward another function that assesses the model’s dependence on the dataset, known as the Influence function. The method is evaluated on a set of 10,500 images in which up to 98% of labelling errors are detected. Another methodology in the field of image processing is proposed in ^[52][57], where the aim is to train a deep learning model with a dataset where there is no confidence in the labelling of the data. To do this, the model adjusts the internal parameters of the neural network while learning the distribution of noise in the labelling and testing it against classical back-propagation models where the goodness of the labelling is assumed. In the specific area of datasets aimed at addressing cybersecurity or network traffic problems, previous work is more limited, as the generation of these datasets has additional complications with respect to the more general use cases. In ^[53][58], Cordero et al. ^[53][58] the problem is reviewed through a comprehensive analysis of various datasets intended for NIDS. The researcheuthors put forth an enhancement to the Intrusion-Detection Dataset Toolkit (ID2T) dataset generation methodology. Subsequently, they evaluate the effectiveness of the proposed ID2T improvement by assessing datasets generated after its application. The problem of labelling in the field of network traffic is more complex, since it requires specific low-level knowledge of the traffic in order to be able to correctly classify each flow. In ^[54][59], an analysis of the methods used for labelling this type of dataset, both automatic and manual, is carried out, identifying the weaknesses of each of the techniques along with their advantages and disadvantages. Finally, to conclude this analysis of the state of the art in dataset quality, in ^[55][60], an approach to measuring the quality of a network traffic dataset is presented. This quality is used to compare two datasets, to decide if they are equivalent, or if a better quality dataset is found, whether or not it is appropriate to retrain the machine learning models. The proposal for measuring the quality of a dataset is based on the criteria: (i) completeness as the probability that a dataset record can occur in the domain of the machine learning model to be built and (ii) reliability as the probability of occurrence of misclassified or mislabelled data for each possible class. Based on these two criteria, the applicability of a network traffic dataset to a particular problem can be determined.