The efficiency and the effectiveness of a machine learning (ML) model are greatly influenced by feature selection (FS), a crucial preprocessing step in machine learning that seeks out the ideal set of characteristics with the maximum accuracy possible. Due to their dominance over traditional optimization techniques, researchers are concentrating on a variety of metaheuristic (or evolutionary) algorithms and trying to suggest cuttingedge hybrid techniques to handle FS issues. The use of hybrid metaheuristic approaches for FS has thus been the subject of numerous research works.
1. Introduction
Feature selection (FS) is a method that aims to choose the minimum required features that can represent a dataset by selecting those features that add the most to the estimation variable that falls within the user’s field of interest ^{[1]}. The volume of data available has risen significantly in recent years due to advancements in data gathering techniques in different fields, resulting in increased processing time and space complexity needed for the implementation of architectures in the realms of machine learning (ML). The collected data in many domains typically is of high dimensionality, making it impossible to select an optimum range of features and exclude unnecessary ones. The employed ML models are forced to learn insignificantly as a result of inappropriate features in the dataset, which leads to a poor recognition rate and a large drop in outcomes. By removing unnecessary and outdated features, FS reduces the dimensionality and improves the quality of the resulting attribute vector ^{[2]}^{[3]}^{[4]}. FS has been used for various purposes, including cancer classification (e.g., to improve the diagnosis of breast cancer and diabetes ^{[5]}), speech recognition ^{[6]}, gene prediction ^{[7]}, gait analysis ^{[8]}, and text mining ^{[9]}, etc.
FS has a pair of essential opposing goals, namely, reducing the number of needed characteristics and maximizing the performance of classification to overcome the curse of dimensionality. The three principal kinds of any FS strategy are filter, wrapper, and embedded methods, which integrate both filters and wrappers ^{[10]}^{[11]}. A filter technique is independent of any ML algorithm. It is appropriate for datasets containing fewer features, and it often requires lowperformance computing capabilities. In filtering approaches, the association among classifiers and attributes is not considered, and thus filters often fail to detect the samples correctly during the learning process.
Many studies have used wrappers to address these problems. A wrapper technique frequently alters the training process and uses classifiers as assessment mechanisms. Thus, wrapper techniques for FS often affect the training algorithm and produce more precise results than filters. Wrappers put effort into training the employed ML algorithm by using only a subset of the features that are also needed for determining the training model performance. Depending on the selection accuracy determined in each preceding phase, a wrapper algorithm considers either adding or removing a feature from the selected number of features. As a result, wrapper methods are often more computationally complex and more expensive than most filtering techniques.
Conventional wrapper approaches ^{[12]} take a set of attributes and require the user to include arguments as parameters, after which the most informative attributes are chosen from a set of features proportional to the arguments provided by the user. The limitations of such techniques are that the selected feature vector is recursively evaluated, in which case certain characteristics are not included at the first level for assessment. In addition, arguments are specified by the user, and thus certain feature mixtures cannot be taken into account even with more precision. These issues may cause searching overhead along with overfitting. Evolutionary wrapper approaches, which are more common when the search area is very broad, have been created to address the drawbacks of classic wrapper methods. These approaches have many benefits over conventional wrapper methods, including the fact that they need fewer domain details. Evolutionary optimization techniques are populationbased metaheuristic strategies that can solve a problem with multiple candidate solutions described by a group of individuals. Each entity in the FS tasks represents a part of the feature vector. An objective (target) function is employed to evaluate and assess the consistency of every candidate solution. The chosen individuals are exposed to the intervention of genetic operators in order to produce new entities that comprise the next generation ^{[13]}.
A plethora of variations of metaheuristic methods has already been developed to support the FS tasks. When defining a metaheuristic approach, exploration and exploitation are two opposing aspects to take into account. In order to increase the effectiveness of these algorithms, it is essential to establish a good balance between these two aspects. This is because the algorithms perform well in some situations but poorly in others. Every natureinspired approach has advantages and disadvantages of its own; hence it is not always practical to predict which algorithm is best for a given situation ^{[14]}.
2. Hybrid Evolutionary Approaches for Feature Selection
It is unusual for all properties in a considered dataset to be useful when designing an ML platform in real life. The inclusion of unwanted and redundant attributes lessens the model’s classification capability and accuracy. As more factors are added to an ML framework, its complexity increases ^{[15]}^{[16]}. By finding and assembling the ideal set of features, FS in ML aims to produce useful models of a problem under study and consideration ^{[17]}. Some important advantages of FS are ^{[10]}^{[12]}:

reducing overfitting and eliminating redundant data,

improving accuracy and reducing misleading results, and

reducing the ML algorithm training time, dropping the algorithm complexity, and speeding up the training process.
The prime components of an FS process are presented in Figure 1 and they are ^{[18]} as follows.
Figure 1. Key factors of feature selection.
1. Searching Techniques: To obtain the best features with the highest accuracy, searching approaches are required to be applied in an FS process. Exhaustive search, heuristic search, and evolutionary computation are a few popular searching methods. An exhaustive search is explored in a few works ^{[15]}^{[16]}. Numerous heuristic strategies and greedy techniques, such as sequential forward selection (SFS) ^{[19]}, and sequential backward selection (SBS), have therefore been used for FS ^{[20]}. However, in later parts of the FS process, it could be impossible to select or delete eliminated or selected features because both SFS and SBS suffer from the “nesting effect” problem. After being selected, features in the SFS method cannot be discarded later, while the features discarded in the SBS cannot be selected again. These two approaches can be compromised by using SFS l times and then applying SBS r times ^{[21]}. The nesting effect can be reduced by such a method, but the correct values of l and r must be determined carefully. Sequential backward and forward floating methods were presented to avoid this problem ^{[19]}. A twolayer cutting plane approach was recently suggested in ^{[20]} to evaluate the best subsets of characteristics. In ^{[21]}, an exhaustive FS search with backtracking and a heuristic search was proposed.
Various EC approaches have been proposed in recent years to tackle the challenges of the FS problems successfully. Some of them are differential evolution (DE)
^{[22]}, genetic algorithms (GAs)
^{[23]}, grey wolf optimization (GWO)
^{[24]}^{[25]}, ant colony optimization (ACO)
^{[26]}^{[27]}^{[28]}, binary Harris hawks optimization (BHHO)
^{[29]}^{[30]} and improved BHHO (IBHHO)
^{[31]}, binary ant lion optimization (BALO)
^{[32]}^{[33]}, salp swarm algorithm (SSA)
^{[34]}, dragon algorithm (DA)
^{[35]}, multiverse algorithm (MVA)
^{[36]}, Jaya optimization algorithms such as the FS based on the Jaya optimization algorithm (FSJaya)
^{[37]} and the FS based on the adaptive Jaya algorithm (AJA)
^{[38]}, grasshopper swarm intelligence optimization algorithm (GOA) and its binary versions
^{[39]}, binary teaching learningbased optimization (BTLBO)
^{[40]}, harmony search (HS)
^{[41]}, and the vortex search algorithm (VSA)
^{[42]}, etc. All these techniques have been applied for performing FS on various types of datasets, and they have been demonstrated to achieve high optimization rates and to increase the CA. EC techniques require no domain knowledge and do not presume whether the training dataset is linearly separable or not. Another valuable aspect of EC methods is that their populationbased process can deliver several solutions in one cycle. However, EC approaches often entail considerable computational costs because they typically include a wide range of assessments. The stability of an EC approach is also a critical concern, as the respective algorithms often pick different features from various rounds. Further research study is required as the growing number of characteristics in largescale datasets also raises computational costs and decreases the consistency of EC algorithm application
^{[13]} in certain realworld FS activities. A highlevel description of the most used EC algorithms is given below.

Genetic Algorithm (GA): A GA ^{[43]} is a metaheuristic influenced by natural selection that belongs to the larger class of evolutionary algorithms in computer science and operations research. GA relies on biologically inspired operators, such as mutation, crossover, and selection to develop highquality solutions to optimization and search challenges. The GA is a mechanism that governs biological evolution and for tackling both constrained and unconstrained optimization issues. The GA adjusts a population of candidate solutions on a regular basis.

Particle Swarm Optimization (PSO): PSO is a bioinspired algorithm that is straightforward to use while looking for the best alternative in the solution space. It differs from other optimization techniques in that it simply requires the objective function and is unaffected by the gradient or any differential form of the objective. It also has a small number of hyperparameters. Kennedy and Eberhart proposed PSO in 1995 ^{[44]}. Sociobiologists think that a school of fish or a flock of birds moving in a group “may profit from the experience of all other members”, as stated in the original publication. In other words, while a bird is flying around looking for food at random, all of the birds in the flock can share what they find and assist the entire flock to get the best hunt possible.

Grey Wolf Optimizer (GWO): Mirjalili et al.
^{[45]} presented GWO as a new metaheuristic in 2014. The grey wolf’s social order and hunting mechanisms inspired the algorithm. First, there are four wolves, or degrees of the social hierarchy, to consider when creating GWO.–The
α wolf: the solution having best fitness value;
–the β wolf: the solution having secondbest fitness value;
–the δ wolf: the solution having thirdbest fitness value; and
–the ω wolf: all other solutions.
As a result, the algorithm’s hunting mechanism is guided by the first three appropriate wolves, α, β, and δ. The remaining wolves are regarded as ω and follow them. Grey wolves follow a set of welldefined steps during hunting: encircling, hunting, and attacking.

Harris Hawk Optimization (HHO): Heidari and his team introduced HHO as a new metaheuristic algorithm in 2019 ^{[46]}. HHO uses Harris hawk principles to investigate the prey, surprise pounce, and diverse assault techniques used by Harris hawks in the environment. Hawks reflect alternatives in HHO, whereas prey represents the best solution. The Harris hawks use their keen vision to follow the target and then conduct a surprise pounce to seize the prey they have spotted. In general, HHO is divided into two phases: exploitation and exploration. The HHO algorithm can be switched from exploration to exploitation, and the exploration behaviour can then be adjusted depending on the fleeing prey’s energy.
2. Criteria for Evaluation: The common evaluation criteria for wrapper FS techniques are the classification efficiency and effectiveness by using the selected attributes. Decision trees (DTs), support vector machines (SVMs), naive Bayes (NB), knearest neighbor (KNN), artificial neural networks (ANNs), and linear discriminant analysis (LDA) are just a few examples of common classifiers that have been used as wrappers in FS applications ^{[47]}^{[48]}^{[49]}. In the domain of filter approaches, measurements from a variety of disciplines have been incorporated, particularly information theory, correlation estimates, distance metrics, and consistency criteria ^{[50]}. Individual feature evaluation, relying on a particular aspect, is a basic filter approach in which only the best tier features are selected ^{[47]}. Relief ^{[51]} is a distinctive case in which a distance metric is applied to assess the significance of features. Filter methods are often computationally inexpensive, but they do not consider attribute relationships, which often leads to complicated problems in case of repetitive feature sets, such as in the case of microarray gene data, where the genes are intrinsically correlated ^{[17]}^{[50]}. To overcome these issues, it is necessary to perform proper filter measurements to choose a suitable subset of relevant features in order to evaluate the whole feature set. Wang et al. ^{[52]} recently published a distance measure to assess the difference between the chosen feature space and the space spanned by all features in order to locate a subset of features that approximates all features. Peng et al. ^{[53]} introduced the minimum redundancy maximum relevance (MRMR) approach based on shared information, and recommended measures were added to the EC because of their powerful exploration capability ^{[54]}^{[55]}. A unified selection approach was proposed by Mao and Tsang ^{[20]}, which optimizes multivariate performance measures but also results in an enormous search area for highdimensional data, a problem that requires strong heuristic search methods for finding the best output. There are several relatively straightforward statistical methods, such as ttesting, logistic regression (LR), hierarchical clustering, and classification and regression tree (CART), which can be applied jointly to produce better classification results ^{[56]}. Recently, authors of ^{[57]} have applied sparse LR for FS problems including millions of features. Min et al. ^{[21]} developed a rough principle procedure to solve FS tasks under budgetary and schedule constraints. Many experiments show that most filter mechanisms are inefficient for cases with vast numbers of features ^{[58]}.
3. Number of Objectives: Singleobjective (SO) optimization frameworks are techniques which combine the classifier’s accuracy and the features quantity into a single optimization function. On the contrary, multiobjective (MO) optimization approaches entail techniques designed to find and balance the tradeoffs among alternatives. In an SO situation, a solution’s superiority over other solutions is determined by comparing the resulting fitness values, while in an MO optimization, the dominance notion is employed to get the best results ^{[59]}. In particular, to determine the significance of the derived feature sets, in an MO situation, multiple criteria need to be optimized by considering different parameters. MO strategies thus may be used to solve some challenging problems involving multiple conflicting goals ^{[60]}, and MO optimization comprises fitness functions that minimize or maximize multiple conflicting goals.