Intrusion detection systems (IDS) are widely used to improve security posture in an IT infrastructure. An IDS is considered a suitable and practical approach to detect attacks and assure network security by safeguarding against intrusive hackers 
. Anomaly-based IDS approaches can efficiently detect zero-day (unknown) attacks 
. An intrusion can be defined as a sequence of unexpected activities locally or globally, harming network confidentiality, integrity, and/or availability (i.e., the CIA triad) 
. The network traffic consists of packets associated with packet header fields. Features related to those instances are important to define the purpose of detecting anomalies. The purpose of an IDS is to detect and/or prevent abnormal misbehavior (i.e., unauthorized use), both passive and active network intruder activities, and thus improve CIA.
In recent times, machine learning (ML)-based approaches have been employed for intrusion detection in IoTs IDSs 
. Existing IDSs assume that the IoT devices have the same feature pattern and packet types. However, IoT devices vary in some respects, such as hardware characteristics and functionality, computational capability, and different abilities for generating various features 
. The features become sparse when nodes are aggregated to create data, and the irrelevant features (attributes) are set to either nulls or zeros. Data sparsity is one of the disadvantages that affect the accuracy and efficiency of data modeling. Feature selection, an important part of a machine learning-based solution, plays an important role in increasing detection accuracy and speed of the training phase. Several feature selection techniques have been proposed to improve detection of anomalous behavior variants such as Flexible Mutual Information-based Feature Selection (FMIFS) 
, Modified Mutual Information-Based Feature Selection (MMIFS) with Support Vector Machine (SVM) 
, and SVM with Neural Networks (NN) 
. Those approaches/models and other recent state-of-the-art studies have been presented in the related work section. Detection accuracy of anomaly-based IDSs is considered the main challenge in the IoT ecosystem due to the constantly evolving nature of the IoT environment 
2. Feature Selection
IoT datasets are of intrinsically high dimensionality represented by n
instances and m
columns (features) 
. The data matrix is Χ ∈ ℝN×M,
and the Y is the target variable(s) (class(es)). A target instance (class) may be either discrete or continuous, and the model can also be dynamic or static. A feature selection (FS) enhances model performance by reducing dimensionality. FS can be defined as a subset of P ≪ M features, i.e., ΧFS
, where p are relevant features of the target class. In this research, we endeavored to find an optimal method to detect security violations in the IoT ecosystem; efficient, accurate, and general. What follows provides the rationale we used to find what we claim is optimal.
Feature selection endeavors to eliminate irrelevant and redundant features and to choose the most pertinent and important features. Furthermore, the FS process usually improves the general performance and data dimensionality, reducing the cost of classification and prediction by reducing the time complexity for building the model. On the other hand, applying all features in the IDS model includes several drawbacks: (i) the computational overhead is increased, and training and testing time are slower, (ii) storage requirements increase due to the large number of features, (iii) the error rate of the model increases because irrelevant features diminish the discriminating power of the relevant features as well as reduce accuracy. FS approaches can be characterized into five categories: (i) filter-based, (ii) wrapper-based, (iii) embedded-based, (vi) hybrid-based, and (v) learning-based. The filter method gives weights to each feature (i.e., dimension), sorts them based on these weights, and then uses those subsets of features to train the model for either classification or prediction. Therefore, the process of feature selection is independent of the classification/prediction techniques. Numerous statistical measures are used in filtering methods to obtain feature subsets.
The model, using a particular FS method, initially uses all features but subsequently omits unrelated features to address the curse of dimensionality
problem. This refining is designed to acquire the best subset of features based on statistical gauges such as information gain (IG) and gain ratio (GR), Pearson’s correlation (PC) 
, chi-square (Chi12) 
, and mutual information (MI) 
. The wrapper method is considered a black box technique 
. Inductive algorithms are used to select feature subsets in the wrapper method, whereas filter methods are independent of the inductive algorithm. In addition, wrapper methods are more complex and expensive computationally than filter methods because they rely on iterating the learning systems (i.e., ML-derived models) several times until a subset of relevant features is reached. Moreover, the wrapper method accounts for the influence of the model performance on the feature subsets and strives to achieve high classification accuracy.
Embedded methods are incorporated with ML algorithms to select a feature subset during the learning process. The blending of feature selection approaches is used during the learning process to achieve advantages by improving classification, accuracy, and computational cost. Embedded methods can avoid retraining the model when the model needs to add a new feature to the subset. Concerning the structure of the embedded approach, the feature selection process is integrated with the classification algorithm and simultaneously performs feature selection such as random forest, LASSO (Least Absolute Shrinkage and Selection Operator), and L1 regularization 
. Embedded methods are computationally less intensive than wrapper methods. However, they still have high computational complexity. Furthermore, the selected feature subset result depends on the chosen learning algorithm. Thus, embedded methods endeavor to find the best feature subset during model building by selecting each feature individually. Furthermore, they derive significant advantages in terms of model interaction, accuracy, fewer variables, and computational cost than previous approaches.
is one of the most widely used approaches in preparing features from a filter-based approach. That is, IG provides a classification ranking of all attributes (features) related to the target (class). Then a threshold is assigned to select several features according to the order obtained. Accordingly, a feature that strongly correlates with the target is considered a relevant feature and irrelevant (or redundant) otherwise. However, a weakness of the IG criterion is a bias favoring features with more values, especially when they are not more informative. Thus, IG between the feature in X
and the variable (target) y
is given here in Equation (1):
) is the entropy of x
. The entropy of y
is defined by Equation (2):
) is the marginal probability of y
on all values of Y
. Note, Y
is a finite set. Moreover, the conditional entropy of Y
given the random variable X
is shown in Equation (3):
) is the conditional probability of y
IG is a symmetrical measure such as IG (x
) = IG (y
), as shown in Equation (1).
The information gained about Y
after observing X
is equal to the information gained about X
after observing Y
is the non-symmetrical measure introduced to compensate for the bias of the IG attribute evaluation. The GR formula is given in Equation (4):
3. IoT Intrusion Detection Using Feature Selection Method
Significant and fruitful efforts have endeavored to address the security concerns of recent years for the IoT ecosystem. Several new IoT security technologies were established by pairing artificial intelligence techniques and cybersecurity virtues. Several promising state-of-the-art studies have been conducted for IoT security using machine learning (ML) and deep learning (DL) techniques 
. However, only a few were developed by investigating the impact of using different feature selection approaches to improve prediction and classification accuracy. For instance, Albulayhi et al. 
have proposed and implemented a new minimized redundancy discriminative feature selection (MRD-FS) technique to resolve the issue of redundant features. The discriminating features have been selected based on two criteria, i.e., representativeness and redundancy. Their model was evaluated utilizing the BoT-IoT dataset. Ambusaidi et al. 
presented a flexible, mutual information-based feature selection technique (FMIFS) that chooses the best features to enhance the classification algorithm. The proposed model was evaluated using three datasets (NSL-KDD, KDD Cup 99, and Kyoto 2006). The Least Square Support Vector Machine-based IDS (LSSVM-IDS) was used to measure performance. Ambusaidi et al. 
showed 99.79% accuracy, 99.46% detection rate (DR), and 0.13% FPR over the KDD99 dataset. However, their employed datasets are not up-to-date (date back to 2009, 1999, and 2006 for NSL-KDD, KDD-Cup99, and Kyoto datasets, respectively) and do not fully represent the IoT cyberattacks.
Similarly, Amiri et al. 
proposed a modified mutual information-based feature selection technique (MMIFS) applied with the SVM to improve the accuracy performance of the classification and to (highly) efficiently detect the various attack types. They demonstrated how high data dimensionality could be enhanced using the feature selection technique. Note, high dimensionality, even if applied to a high-quality ML approach, produces poor detection rate and accuracy performance. MMIFS can reduce features to only eight features (out of 41). For instance, MMIFS with SVM using only eight features, and DR achieved 86.46%. In the first phase, data normalization and reduction are applied by dividing every attribute (feature) value by its maximum value. In the next phase, feature selection is applied based on the imported training data. Further, MMIFS initially takes the feature set as the empty set. In more detail, it calculates the mutual information of the features concerning the class target and then picks the first feature with the maximum mutual information value.