BGP Dataset-Based Malicious User Activity Detection

BGP Dataset-Based Malicious User Activity Detection: Comparison

Please note this is a comparison between Version 1 by Dongkyoo Shin and Version 2 by Rita Xu.

Recent advances in the Internet and digital technology have brought a wide variety of activities into cyberspace, but they have also brought a surge in cyberattacks, making it more important than ever to detect and prevent cyberattacks.

anomaly detection
machine learning
BGP dataset preprocessing

1. Introduction

In recent decades, the advancement of computer and network technology has facilitated various activities in cyberspace, and as a result, the majority of interactions are also conducted in cyberspace. But over the past year, cyberattacks have increased at an alarming rate. From 2021 to 2022, cross-border cyberattacks increased by a whopping 28% [1]. The gravity of the situation is recognized by the US Department of Defense, and cyberspace has been designated as the fifth battlefield, with substantial amounts of money being invested to prepare for and detect cyberattacks [2]. However, due to the unrealistic nature of defending against all cyberattacks and the continuous creation of new attack methods every day, a 60% to 70% attack detection rate is achieved in some organizations’ information protection systems, with approximately 30% of systems showing false positives [3]. To address the previously mentioned problems, a group of cyberattacks is selected, and their BGP data are collected and analyzed through machine learning (ML) to detect anomalies in their IPs and AS (Autonomous System). BGP data pose a challenge for machine learning models as they contain a mix of text and numerical data, making direct model training difficult. Additionally, there is a limitation in the number of abnormal dataset samples available for anomaly detection. These factors can introduce difficulties during both model training and the anomaly detection process. To solve this problem, BGP data, encompassing routing information from global networks, are collected and preprocessed to enable smooth model training. The performance of the models was evaluated by inputting the preprocessed data into the ML models and comparing and quantifying various metrics, including the confusion matrix.

2. Border Gateway Protocol (BGP) Data

Much research has been conducted on detecting anomalies in cyberspace. BGP is regarded as one of the main routing protocols on the Internet, used for exchanging routing information between multiple AS and determining communication paths on the Internet. BGP is primarily used for exchanging route and accessibility information for a network. This information is utilized to determine the best route between various AS to a destination. BGP operates in a transitive, self-replicating fashion and is equipped with a variety of properties and mechanisms that are employed to update routing tables and respond to a wide range of network changes [4]. Large amounts of routing information are contained in BGP data, and a critical role is played in network behavior as routing information is exchanged between various AS around the world. However, due to the complexity and size of BGP data, effectively analyzing them and detecting anomalies is a challenging task, and various researchers are working on it.

3. Research on Cyber Anomaly Detection

Machine learning has been utilized to detect anomalies in cyberspace, along with BGP data and other types of cyber data. Lad M. et al. [5] use a method that collects BGP routing data to detect possible hijack takeovers in real time and notify the owner. As an anomaly detection method, AS with cyberattack cases is selected, and the path of the data is continuously tracked. If a new type of path pattern is consistently detected in an existing path pattern, it is identified as an anomaly, and the security fence is promptly notified. The study found that anomalies can be detected based solely on changes in AS by continuously tracking AS that have been involved in cyberattacks. However, its limitations are shown by not providing performance indications for detecting anomalous behavior in cyberspace. Comarela G. et al. [6] analyze BGP data for the purpose of identifying anomalous AS based on anomalous relationships. However, due to the presence of missing values in BGP data, inferring the precise relationships between AS became challenging. As a result, a preprocessing step was introduced to the BGP data with the aim of detecting anomalies regardless of noise interference. Moreover, the concept of “(λ, ν)-event” was employed to extract data exhibiting abrupt changes through tensor analysis when provided with information on prefixes, AS, and time. This study demonstrates the feasibility of anomaly detection utilizing AS and time data, highlighting that data demonstrating swift changes are well-suited for the purpose of anomaly detection. McGlynn K. et al. [7] studied a model to detect anomalies using Autoencoder (AE) [8] with AS paths from BGP routing data. The experimental results were expressed as an F1-Score and showed good performance of 82% and 75%, respectively. However, as the number of data increased, performance tended to decrease. Copstein R. et al. [9] compared and analyzed BGP data using three different temporal representations using Naïve Bayes (NB) [9] and decision trees to detect anomalies. The evaluation results showed that a high accuracy of 84% and recall of 85% were achieved by using redundant packet buffer data in BGP data. Choudhary S. et al. [10] proposed extracting key features from multiple network data to form training data, which were then input to the Deep Neural Network (DNN) [11]. The good detection rates of 95% on different datasets, such as UNSW-NB15 [12], NLS-KDD [13], and KDD-Cup’99 [14], are consistently shown in the above papers. JI Y. et al. [15] conducted an experiment to determine normal and abnormal data by receiving sensor data of vehicle control functions instead of cyber data. Sensor data from the vehicle were collected and preprocessed to extract key features related to control unit malfunctions, forming the training and experimental data. The data were fed into a One-Class Support Vector Machine (One-SVM) [16], and they were classified into normal and abnormal data, achieving excellent results of TRP 0.81 and TNR 1.0. The above experiments demonstrate that anomalies and normal data can be detected using One-SVM, and the performance of the algorithm is validated with AUROC. Halbouni A. et al. [17] preprocessed the CIC-IDS2017 [18], UNSW-NB15, and WSN-DS [19] datasets to construct training and experimental data. The data were input into a CNN–LSTM [20], which is a fusion of a CNN and an LSTM, with the LSTM handling temporal information and the CNN handling spatial information. From the above experiments, it can be observed that better performance can be achieved by combining LSTMs and CNNs and leveraging the strengths of Logistic Regression (LR) [21] and decision tree (DT) [22]. Anton S.D.D. et al. [23] conducted experiments to detect network attacks through time series analysis of network data. Datasets DS1 [24], based on Modbus, and DS2 [25], based on OPC UA, were preprocessed to retain only the core features of the data. A Random Forest (RF) [26] and a Support Vector Machine (SVM) [27] were trained on the above data, and an accuracy of 0.92 for the SVM and 0.99 for the RF was found. From the above experiment, it is evident that RF and SVM exhibited the best performance, with RF demonstrating a high detection accuracy of 0.99. Related studies have proposed methods for detecting cyberattacks and anomalies, as shown in Table 1. CNN–LSTM, RF, One-SVM, etc. are the ML models used to detect anomalies. However, most of them use historical data rather than the latest updated data, which means that they cannot keep up with the rapidly changing trends of cyberattack methods. In addition, BGP data, which are real-time data, have many limitations due to the lack of diversity in ML models and detailed evaluation indicators.

Table 1. Anomaly detection algorithms.

Year	Study	Data	Detection Technique	Performance
2006	Lad M. et al. [5]	BGP Data [4]	No technique	No Performance
2014	Comarela G. et al. [6]	BGP Data [4]	No technique	No Performance
2019	McGlynn K. et al. [7]	BGP Data [4]	AE [8]	F1-Score: 0.82
2020	Copstein R. et al. [9]	BGP Data [4]	NB [28]	Accuracy: 0.84 Recall: 0
2020	Choudhary S. et al. [11]	UNSW-NB15 [13], NSL-KDD [14], KDD-Cup’99 [15]	DNN [12],	Accuracy: 0.96 AUROC: 0.96
2022	Jl Y. et al. [16]	Sensor Data [16]	One-SVM [17]	TRP: 0.81 TNR: 1.0
2022	Halbouni A. et al. [18]	UNSW-NB15 [13], CIC-IDS2017 [19], WSN-DS [20]	CNN–LSTM [21], NB [10], LR [22], DT [23]	Accuracy: 0.98
2019	Anton S.D.D et al. [24]	DS1 [25], DS2 [26]	RF [27], SVM [29]	SVM Accuracy: 0.92, RF Accuracy: 0.99