The history of intrusion detection systems (IDSs) and intrusion prevention systems (IPSs) can be traced back to an academic paper written in 1986
[14][15]. The Stanford Research Institute developed the Intrusion Detection Expert System (IDES) using statistical anomaly detection, signatures, and profiles to detect malicious network behaviors. In the early 2000s, IDSs became a security best practice, with few organizations adopting IPSs due to concerns about blocking harmless traffic. The focus was on detecting exploits rather than vulnerabilities. The latter part of 2005 saw the growth of IPS adoption, with vendors creating signatures for vulnerabilities rather than individual exploits
[15][16]. The capacity of IPSs increased, allowing for more network monitoring.
Next-generation intrusion prevention systems (NGIPSs), which include capabilities like application and user control, were developed during this time period, marking a significant turning point. Sandboxing and emulation features were added to fulfill the requirement for defense against zero-day malware. By 2016, most businesses had deployed next-generation firewalls (NGFWs), which contain IDS/IPS functionality. High-fidelity machine learning is the current focus for tackling threat detection and file analysis
[16][17].
The groundbreaking academic publication “An Intrusion-Detection Model” by Dorothy E. Denning, which inspired the creation of IDES, is one example of earlier studies that addresses intrusion detection in networks. To identify hostile network behaviors, the Stanford Research Institute used statistical anomaly detection, signatures, and profiles. Significant turning points in the development of IPS technology, such as the switch to NGIPSs and NGFWs, have been reached
[17][18].
2. Convolutional Neural Networks
CNNs are a specialized subclass of artificial neural networks (ANNs) that are particularly well-suited for analyzing visual data. CNNs are designed to automatically and adaptively learn spatial hierarchies of features. This is particularly beneficial for tasks like image recognition, object detection, and even medical image analysis. The concept of residual learning, as introduced by Kaiming He et al., further enhances the capabilities of CNNs by allowing them to benefit from deeper architectures without the risk of overfitting or vanishing gradients
[18][19].
As opposed to ANNs, CNNs employ local connectivity by linking each neuron to a localized region of the input space. This is in stark contrast to traditional ANNs, where each neuron is connected to all neurons in the preceding and following layers. Yann LeCun’s paper emphasizes that this local connectivity is crucial for the efficient recognition of localized features in images
[19][20]. Furthermore, they use shared parameters across different regions of the input, which significantly reduces the number of trainable parameters. This is in contrast to traditional ANNs, where each weight is unique, leading to a much larger number of parameters and higher computational costs. CNNs are inherently designed to recognize the same feature regardless of its location in the input space. This is a crucial advantage over traditional ANNs, which lack this form of spatial invariance. Notably, they often employ deeper architectures, which are made computationally feasible through techniques like residual learning, as discussed in Kaiming He et al.’s paper.
The Inception architecture, introduced by Christian Szegedy et al., is another example of a deep yet computationally efficient network
[20][21]. CNNs are designed to be computationally efficient, particularly when dealing with high-dimensional data. The architecture leverages local connectivity and parameter sharing to reduce computational requirements. The concept of residual learning, as discussed in the paper by Kaiming He et al., allows CNNs to be trained more efficiently, even when the network is very deep.
Notably, several unique architectural elements are associated with CNNs, and these include filters, kernels, and pooling layers. Filters and kernels use learnable weight matrices that are crucial for feature extraction. They slide or convolve across the input image to produce feature maps. Yann LeCun’s paper highlights the effectiveness of gradient-based learning techniques in training these filters
[21][22]. Pooling layers serve to reduce the spatial dimensions of the input, thereby decreasing computational complexity and increasing the network’s tolerance to variations in the input. They are particularly useful in making the network robust to overfitting.
CNNs can be effectively combined with other types of neural networks, like recurrent neural networks (RNNs), for sequential data processing tasks such as video analysis and natural language processing. Additionally, CNNs can be integrated with traditional machine learning algorithms, like support vector machines (SVMs), for tasks like classification, thereby creating a hybrid model that leverages the strengths of both methodologies. In summary, CNNs offer a robust, adaptable, and computationally efficient approach to a wide range of machine learning tasks. Their unique architecture, as validated by seminal research papers, makes them highly effective for tasks involving spatial hierarchies and structured grid data.
3. Extreme Gradient Boosting
XGBoost is an optimized distributed gradient boosting approach designed to be highly efficient and flexible. It has gained immense popularity in machine learning competitions and is widely regarded as the “go-to” algorithm for structured data. XGBoost has been optimized for both computational speed and model performance, making it highly desirable for real-world applications
[22][23]. There are several advantages of decision-tree-based techniques
[23][24].
One of the most significant advantages of decision trees is their ease of interpretation. They can be visualized, and the decision-making process can be easily understood, even by non-experts. Decision trees are computationally inexpensive to build, evaluate, and interpret compared to algorithms like support vector machines (SVMs)
[24][25] or ANNs. Unlike other algorithms that require extensive pre-processing, decision trees can handle missing values without imputation, making them more robust. Decision trees can also capture complex non-linear relationships in the data, which linear models may not capture effectively. Further, this approach can be used for both classification and regression tasks, making it very versatile.
Gini impurity is a metric used to quantify the disorder or impurity of a set of items. It is crucial for the “criterion” parameter in the decision tree algorithm. Lower Gini impurity values indicate more “pure” nodes. The Gini impurity is used to decide the optimal feature to split on at each node in the tree.
Further advantages of using XGBoost are that of ensemble learning
[25][26]. Ensemble methods, particularly boosting algorithms like XGBoost, are less susceptible to the overfitting problem compared to single estimators due to their ability to optimize on the error. By combining several models, ensemble methods can average out biases and reduce the variance, thus minimizing the risk of overfitting. Ensemble methods often achieve higher predictive accuracy than individual models. XGBoost, in particular, has been shown to outperform deep learning models in certain types of data sets, especially when the data are tabular.
The objective function optimized by XGBoost includes both a loss term and a regularization term, making it adaptable to different problems:
4. Metaheuristic Optimization
Metaheuristic optimization algorithms have gained significant attention in the field of computational intelligence for their ability to solve complex optimization problems that are often NP-hard. Traditional optimization algorithms, such as gradient-based methods, often get stuck in local optima and are not well suited for solving problems with large, complex search spaces. In contrast, metaheuristics offer several advantages
[26][27].
Additionally, addressing the challenges of multi-objective optimization problems has been a focal point for many works, leading to the development of various multi-objective evolutionary algorithms
[27][28]. However, a common hurdle in these algorithms is the delicate balance required between diversity and convergence. This balance critically impacts the quality of solutions derived from the algorithms
[28][29].
Designed to explore the entire solution space, metaheuristics often find a near-optimal solution within reasonable time periods. They are problem-independent, meaning they can be applied to a wide range of optimization problems without requiring problem-specific modifications. Metaheuristics are highly scalable and can handle problems with a large number of variables and constraints. They are less sensitive to the initial conditions and can adapt to changes in the problem environment. Metaheuristics can find near-optimal solutions to NP-hard problems in polynomial time, which is a significant advantage over traditional methods that often fail to find feasible solutions within a reasonable time frame.
Algorithms often draw inspiration from various natural phenomena, social behaviors, and physical processes. Some notable examples include the genetic algorithms (GAs)
[29][30], inspired by the process of natural selection and genetics; particle swarm optimization (PSO)
[30][31], based on the social behavior of birds flocking or fish schooling; the ant colony optimization (ACO)
[31][32] algorithm, which mimics the foraging behavior of ants in finding the shortest path; and the firefly algorithm (FA)
[32][33], which draws inspiration from courting rituals of fireflies. Additional recent examples include the salp swarm optimizer (SSA)
[33][34], the whale optimization algorithm (WOA)
[34][35], and the COLSHADE
[35][36] optimization algorithm.
Metaheuristics are a popular approach for researchers used to improve hyperparameter selections. Many examples exist in the literature, with some interesting examples originating from medical applications
[36][37]. Further applications include time-series forecasting
[37][38][38,39] and computer security
[39][40][41][40,41,42]. Hybridization techniques have also shown great promise when applied to metastatic algorithms, often producing algorithms that demonstrate performance improvements on given tasks
[42][43].