With the significant increase in cyber-attacks and attempts to gain unauthorised access to systems and information, Network Intrusion-Detection Systems (NIDSs) have become essential detection tools. Anomaly-based systems use machine learning techniques to distinguish between normal and anomalous traffic. They do this by using training datasets that have been previously gathered and labelled, allowing them to learn to detect anomalies in future data. However, such datasets can be accidentally or deliberately contaminated, compromising the performance of NIDS.
1. Introduction
Network Intrusion-Detection Systems (NIDSs) represent a primary cybersecurity mechanism for identifying potential attacks on a communication network. To accomplish this goal, they analyse the network traffic passing through the system, regardless of whether it is internally generated or originated from external entities targeting the network. Detecting intrusions allows network administrators to become aware of system vulnerabilities and to make quick decisions to abort or mitigate attacks. Additionally, NIDSs allow them to implement measures to strengthen the system in the future
[1].
NIDSs can be categorised into various typologies based on two fundamental principles: architecture and techniques employed. Focusing on the architecture, NIDS can be classified as host-based, network-based, and collaborative approaches between different components. According to the detection technique, the classification may be signature-based, Stateful Protocol Analysis-based, or anomaly detection-based NIDSs
[2].
Signature-based NIDSs possess a repository of network patterns representing prevalent network attacks. Their operating mode is to match the network sequence they examine with their knowledge base to detect potential attacks
[3].
Alternatively, Stateful Protocol Analysis-based NIDSs rely on their comprehensive understanding of the monitored protocol. They analyse all interactions to identify a sequence of actions that might result in a vulnerability or insecurity
[3].
In contrast, anomaly-detection-based NIDSs employ mechanisms to detect abnormal network traffic behaviour. These anomalous activities typically correspond to network traffic patterns that have a significantly low likelihood of occurring or are markedly misaligned with normal traffic. Acutely objective, anomaly detection allows for the handling of novel or previously unknown attacks (zero days). This is because such attacks generate traffic patterns that have not been found before, and this type of NIDS often relies on the use of machine learning techniques to carry out anomaly detection. When this approach is followed, the subjective evaluation of attacks is effectively circumvented.
Different strategies have been employed to detect anomalies in NIDS through various machine learning techniques
[4][5], including statistical techniques like Principal Component Analysis (PCA)
[6] or Markov models
[7][8]; classification techniques like Artificial Neural Networks (ANNs)
[9][10][11][12], Support Vector Machines (SVMs)
[6], deep learning models
[13][14] including Autoencoders
[9][15], or Decision Trees including Random Forest
[16]; and clustering like outlier detection
[17]. Using these techniques requires a multi-perspective approach to tackling the problem, which can be categorised as supervised, semi-supervised, or unsupervised, depending on the specific technique chosen
[18].
Regardless of the technique used for anomaly detection in NIDS, the underlying models must be trained to distinguish normal traffic from anomalous traffic. This training process utilises datasets comprising real, synthetic, or a combination of both network traffic. To be more concise,
-
Synthetic traffic datasets are created by generating traffic in a controlled environment that emulates a real-world setting. The generated traffic may include traffic related to known attacks, providing enough samples for machine learning models to competently identify and detect such anomalies. This enables the optimisation of the dataset regarding the size and balance between regular and irregular traffic samples. It also ensures the correct labelling of each observation as it has been intentionally and deliberately generated. Such observations can be, for instance, the traffic flows seen in the network. However, a potential issue is that it may not accurately reflect the network traffic patterns observed in a genuine environment.
-
Real traffic datasets capture all network communications within a real productive environment. This implies access to the patterns of network traffic consumption and usage that take place in an actual scenario and potentially any cyber-attacks that may occur. Unlike synthetic datasets, real traffic samples may be biased or imbalanced, with the presence of anomalous traffic often being minimal or completely absent. It is necessary to carry out a subsequent process to assign a normality or attack label to each flow for its use in machine learning models during training phases.
-
Composite datasets are the ones generated by combining real environment data and synthetic traffic to introduce attack patterns.
Regardless of the AI model used in a NIDS, the dataset’s labelling accuracy is crucial to maintaining high model performance. This principle applies equally to supervised and unsupervised learning. In supervised learning, labelling is necessary to enable models to learn how to identify anomalous traffic. In contrast, unsupervised learning generally assumes that the training dataset consists of normal traffic only and is, therefore, free of anomalies.
2. Datasets for Network Security Purposes
To effectively train any AI model, especially those constituting NIDSs based on anomaly detection, a prerequisite is a comprehensive dataset. This dataset should encompass a sufficient number of samples that represent all the various classes or patterns, whether benign or malicious. This foundational dataset enables the model to learn and predict accurately during subsequent training phases. In the specific case of NIDSs, a large and correctly labelled dataset is assumed
[19]. The quality of the trained models depends to some extent on the quality of the data on which they were trained
[20], so it is important to make a thorough analysis of the typology of datasets available in the NIDS domain.
Before reviewing the different datasets available in the field of cybersecurity, it is necessary to define the criteria according to which these datasets will be analysed:
-
Availability: Understood as free access (Public) to the dataset or, on the contrary, of reserved access, by means of payment or explicit request (Protected).
-
Collected data: Some datasets collect traffic packet for each packet (e.g., PCAP files), others collect information associated with traffic flows between devices (e.g., NetFlow), and others extract features from the flows by combining them with data extracted from the packets.
-
Labelling: This refers to whether each observation in the dataset has been identified as normal, anomalous, or even belonging to a known attack. Or, conversely, no labelling is available, in which case they are intended for unsupervised learning models.
-
Type: The nature of a dataset may be synthetic, where the process and environment in which the dataset is generated are controlled, or it may be the result of capturing traffic in a real environment.
-
Duration: Network traffic datasets consist of network traffic recorded over a specific time interval, which may range from hours to days, months, or even years.
-
Size: the depth of the dataset in terms of the number of records or the physical size and their distribution across the different classes.
-
Freshness: It is also important to consider the year in which the dataset was created, as the evolution of attacks and network usage patterns may not be reflected in older datasets, thus compromising their validity in addressing current issues.
A summary of the datasets analysed according to the characteristics described above is shown in Table 1.
Table 1. Overview of available network datasets.
Dataset |
Availability |
Collected Data |
Labeled |
Type |
Duration * |
Size ** |
Year |
Freshness |
Balanced |
DARPA [21] |
Public |
packets |
yes |
synthetic |
7 weeks |
6.5TB |
1998–1999 |
questioned |
no |
NSL-KDD [22] |
Public |
features |
yes |
synthetic |
N.S. |
5M o. |
1998–1999 |
questioned |
yes |
Kyoto 2006+ [23] |
Public |
features |
yes |
real |
9 years |
93M o. |
2006–2015 |
yes |
yes |
Botnet [24] |
Public |
packets |
yes |
synthetic |
N.S. |
14GB p. |
2010–2014 |
yes |
yes |
UNSW-NB15 [25] |
Public |
features |
yes |
synthetic |
31 hours |
2.5M o. |
2015 |
yes |
no |
UGR’16 [26] |
Public |
flows |
yes |
real |
6 months |
17B f. |
2016 |
yes |
no |
CICIDS2017 [27] |
Protected |
flows |
yes |
synthetic |
5 days |
3.1M f. |
2017 |
yes |
no |
IDS2018 [28] |
Protected |
features |
yes |
synthetic |
10 days |
1M o. |
2018 |
yes |
no |
NF-UQ-NIDS [29] |
Public |
flows |
yes |
synthetic |
N.S. |
12M f. |
2021 |
yes |
no |
This entry is adapted from the peer-reviewed paper 10.3390/s24020479