With the significant increase in cyber-attacks and attempts to gain unauthorised access to systems and information, Network Intrusion-Detection Systems (NIDSs) have become essential detection tools. Anomaly-based systems use machine learning techniques to distinguish between normal and anomalous traffic. They do this by using training datasets that have been previously gathered and labelled, allowing them to learn to detect anomalies in future data. However, such datasets can be accidentally or deliberately contaminated, compromising the performance of NIDS.
1. Introduction
Network Intrusion-Detection Systems (NIDSs) represent a primary cybersecurity mechanism for identifying potential attacks on a communication network. To accomplish this goal, they analyse the network traffic passing through the system, regardless of whether it is internally generated or originated from external entities targeting the network. Detecting intrusions allows network administrators to become aware of system vulnerabilities and to make quick decisions to abort or mitigate attacks. Additionally, NIDSs allow them to implement measures to strengthen the system in the future
[1].
NIDSs can be categorised into various typologies based on two fundamental principles: architecture and techniques employed. Focusing on the architecture, NIDS can be classified as host-based, network-based, and collaborative approaches between different components. According to the detection technique, the classification may be signature-based, Stateful Protocol Analysis-based, or anomaly detection-based NIDSs
[2].
Signature-based NIDSs possess a repository of network patterns representing prevalent network attacks. Their operating mode is to match the network sequence they examine with their knowledge base to detect potential attacks
[3].
Alternatively, Stateful Protocol Analysis-based NIDSs rely on their comprehensive understanding of the monitored protocol. They analyse all interactions to identify a sequence of actions that might result in a vulnerability or insecurity
[3].
In contrast, anomaly-detection-based NIDSs employ mechanisms to detect abnormal network traffic behaviour. These anomalous activities typically correspond to network traffic patterns that have a significantly low likelihood of occurring or are markedly misaligned with normal traffic. Acutely objective, anomaly detection allows for the handling of novel or previously unknown attacks (zero days). This is because such attacks generate traffic patterns that have not been found before, and this type of NIDS often relies on the use of machine learning techniques to carry out anomaly detection. When this approach is followed, the subjective evaluation of attacks is effectively circumvented.
Different strategies have been employed to detect anomalies in NIDS through various machine learning techniques
[4][5][4,5], including statistical techniques like Principal Component Analysis (PCA)
[6] or Markov models
[7][8][7,8]; classification techniques like Artificial Neural Networks (ANNs)
[9][10][11][12][9,10,11,12], Support Vector Machines (SVMs)
[6], deep learning models
[13][14][13,14] including Autoencoders
[9][15][9,15], or Decision Trees including Random Forest
[16]; and clustering like outlier detection
[17]. Using these techniques requires a multi-perspective approach to tackling the problem, which can be categorised as supervised, semi-supervised, or unsupervised, depending on the specific technique chosen
[18].
Regardless of the technique used for anomaly detection in NIDS, the underlying models must be trained to distinguish normal traffic from anomalous traffic. This training process utilises datasets comprising real, synthetic, or a combination of both network traffic. To be more concise,
-
Synthetic traffic datasets are created by generating traffic in a controlled environment that emulates a real-world setting. The generated traffic may include traffic related to known attacks, providing enough samples for machine learning models to competently identify and detect such anomalies. This enables the optimisation of the dataset regarding the size and balance between regular and irregular traffic samples. It also ensures the correct labelling of each observation as it has been intentionally and deliberately generated. Such observations can be, for instance, the traffic flows seen in the network. However, a potential issue is that it may not accurately reflect the network traffic patterns observed in a genuine environment.
-
Real traffic datasets capture all network communications within a real productive environment. This implies access to the patterns of network traffic consumption and usage that take place in an actual scenario and potentially any cyber-attacks that may occur. Unlike synthetic datasets, real traffic samples may be biased or imbalanced, with the presence of anomalous traffic often being minimal or completely absent. It is necessary to carry out a subsequent process to assign a normality or attack label to each flow for its use in machine learning models during training phases.
-
Composite datasets are the ones generated by combining real environment data and synthetic traffic to introduce attack patterns.
Regardless of the AI model used in a NIDS, the dataset’s labelling accuracy is crucial to maintaining high model performance. This principle applies equally to supervised and unsupervised learning. In supervised learning, labelling is necessary to enable models to learn how to identify anomalous traffic. In contrast, unsupervised learning generally assumes that the training dataset consists of normal traffic only and is, therefore, free of anomalies.
2. Datasets for Network Security Purposes
To effectively train any AI model, especially those constituting NIDSs based on anomaly detection, a prerequisite is a comprehensive dataset. This dataset should encompass a sufficient number of samples that represent all the various classes or patterns, whether benign or malicious. This foundational dataset enables the model to learn and predict accurately during subsequent training phases. In the specific case of NIDSs, a large and correctly labelled dataset is assumed
[19][23]. The quality of the trained models depends to some extent on the quality of the data on which they were trained
[20][25], so it is important to make a thorough analysis of the typology of datasets available in the NIDS domain.
Before reviewing the different datasets available in the field of cybersecurity, it is necessary to define the criteria according to which these datasets will be analysed:
-
Availability: Understood as free access (Public) to the dataset or, on the contrary, of reserved access, by means of payment or explicit request (Protected).
-
Collected data: Some datasets collect traffic packet for each packet (e.g., PCAP files), others collect information associated with traffic flows between devices (e.g., NetFlow), and others extract features from the flows by combining them with data extracted from the packets.
-
Labelling: This refers to whether each observation in the dataset has been identified as normal, anomalous, or even belonging to a known attack. Or, conversely, no labelling is available, in which case they are intended for unsupervised learning models.
-
Type: The nature of a dataset may be synthetic, where the process and environment in which the dataset is generated are controlled, or it may be the result of capturing traffic in a real environment.
-
Duration: Network traffic datasets consist of network traffic recorded over a specific time interval, which may range from hours to days, months, or even years.
-
Size: the depth of the dataset in terms of the number of records or the physical size and their distribution across the different classes.
-
Freshness: It is also important to consider the year in which the dataset was created, as the evolution of attacks and network usage patterns may not be reflected in older datasets, thus compromising their validity in addressing current issues.
A summary of the datasets analysed according to the characteristics described above is shown in
Table 1.
Table 1.
Overview of available network datasets.
Dataset |
Availability |
Collected Data |
Labeled |
Type |
Duration * |
Size ** |
Year |
Freshness |
Balanced |
DARPA [21][26] |
Public |
packets |
yes |
synthetic |
7 weeks |
6.5TB |
1998–1999 |
questioned |
no |
NSL-KDD [22][27] |
Public |
features |
yes |
synthetic |
N.S. |
5M o. |
1998–1999 |
questioned |
yes |
Kyoto 2006+ [23][28] |
Public |
features |
yes |
real |
9 years |
93M o. |
2006–2015 |
yes |
yes |
Botnet [24][29] |
Public |
packets |
yes |
synthetic |
N.S. |
14GB p. |
2010–2014 |
yes |
yes |
UNSW-NB15 [25][30] |
Public |
features |
yes |
synthetic |
31 hours |
2.5M o. |
2015 |
yes |
no |
UGR’16 [26][31] |
Public |
flows |
yes |
real |
6 months |
17B f. |
2016 |
yes |
no |
CICIDS2017 [27][32] |
Protected |
flows |
39][44] that ensures the cohesion of the resulting data. The result is a dataset of tagged network packets with a total of almost 14 GB of information and a balance between normal and anomalous traffic of almost 55% and 45%, respectively, which is quite balanced.
2.6. UNSW-NB15
The Cyber Range Lab at the Australian Centre for Cyber Security generated the synthetic UNSW-NB15
[25][30] dataset in 2015 using the IXIA Perfect Storm traffic generator. The simulation environment used to generate the samples consists of three servers, two of which generate benign traffic, while the third is used to generate traffic associated with various attacks such as DoS, exploits, and rootkits. The dataset size is reduced, reaching a total of 31 h in two subsets of 16 and 15 h, respectively, with just under 2.5 million observations, 12% of which correspond to anomalies or attacks. Labels are available for each flow, indicating whether it is normal or not, as well as the attack category to which it belongs. Finally, the data are available in packet format (PCAP) as a version of 49 features extracted from the captured flows.
2.7. UGR’16
The UGR’16 dataset
[26][31] was created by the University of Granada in 2016 as a result of capturing the real network traffic of a medium-sized ISP between March and June 2016. Subsequently, during the months of July and August, different attacks such as DoS, botnet, or port scanning were deliberately generated on the same ISP to capture all the traffic so that this subset could be used as a test. The dataset consists of NetFlow traffic flows with almost 17 billion different connections, of which more than 98% were normal traffic, making it very imbalanced. After the traffic was captured, state-of-the-art anomaly detection and network attack identification techniques were employed to tag the dataset. This involved assigning each record a label indicating the type of attack to which it belonged. Given the size of the dataset and its temporal proximity, it is an updated and current dataset for use in building or training AI and NIDS models.
As can be seen in Section 3.3, this dataset presents some labelling problems.
2.8. CIC Datasets
The Canadian Institute for Cybersecurity (CIC) has generated several datasets to validate the performance of NIDS or to train the models underlying these NIDS. Among the various datasets available, the following should be highlighted:
-
CICIDS2017
[27][32]: Generated in 2017, it is a synthetic network traffic dataset generated in a controlled environment for a total of 5 days, available on request (it is protected). The captured data are in packet and flow formats, although they are also available in extracted feature format with a total of 80 different features. The captured traffic is tagged, and the different attacks that each record corresponds to, including DoS, SSH, and botnet attacks, are marked in the tag.
-
CSE-CIC-IDS2018
[28][33]: This is a synthetic dataset generated in 2018 specifically based on network traffic intrusion criteria. It includes DoS attacks, web attacks, and network infiltration, among others, recorded on more than 400 different hosts. As with CICIDS2017, the data are in packet and flow formatw but with a version containing 80 extracted features, and access requires a prior request (protected). Unlike CICIDS2017, it is modifiable and extensible.
yes |
synthetic |
5 days |
3.1M f. |
2017 |
yes |
no |
IDS2018 [28][33] |
Protected |
features |
yes |
synthetic |
10 days |
1M o. |
2018 |
yes |
no |
NF-UQ-NIDS [29][34] |
Public |
flows |
yes |
synthetic |
N.S. |
12M f. |
2021 |
yes |
no |