Synthetic Datasets: Comparison
Please note this is a comparison between Version 2 by Wendy Huang and Version 1 by Rohan Mitra.

With the consistent growth in the importance of machine learning and big data analysis, feature selection stands to be one of the most relevant techniques in the field. Extending into many disciplines, the use of feature selection in medical applications, cybersecurity, DNA micro-array data, and many more areas is witnessed. Machine learning models can significantly benefit from the accurate selection of feature subsets to increase the speed of learning and also to generalize the results. Feature selection can considerably simplify a dataset, such that the training models using the dataset can be “faster” and can reduce overfitting. Synthetic datasets were presented as a valuable benchmarking technique for the evaluation of feature selection algorithms.

  • synthetic datasets
  • feature selection algorithms
  • variable
  • unsupervised
  • filter
  • wrapper
  • embedded

1. Introduction

A Feature Selection Algorithm (FSA) can be described as the computational solution that produces a subset of features such that this reduced subset can produce comparable results in prediction accuracy compared to the full set of features. The general form of an FSA is a solution that algorithmically moves through the set of features until a “best” subset is achieved [4][1].
The existence of irrelevant and/or redundant features motivates the need for a feature selection process. An irrelevant feature is defined as a feature that does not contribute to the prediction of the target variable. On the other hand, a redundant feature is defined as a feature that is correlated with another relevant feature, meaning that it can contribute to the prediction of a target variable whilst not improving the discriminatory ability of the general set of features. FSAs are generally designed for the purpose of removing irrelevant and redundant features from the selected feature subset.
In real-life datasets, knowledge of the full extent of the relevance of the features in predicting the target variable is absent; hence, obtaining an optimal subset of features is nearly impossible. The most common ways to evaluate FSAs in such scenarios would be to employ the feature subsets in a learning algorithm and measure the resultant prediction accuracy [5][2]. However, this can prove to be disadvantageous, since the outcome would be sensitive to the learning algorithm itself along with the feature subset(s) [5][2].
Consequently, the production of controlled data environments for the purpose of evaluating FSAs has become necessary for the development of novel and robust FSAs. One way of standardizing this is through the use of synthetic datasets. The performance of FSAs depends on the extent of the relevance and irrelevance within the dataset; so, to produce an artificially controlled environment in which the relevance is known can be of significant advantage in their performance evaluation. This can be more conclusive for researchers given that the optimal solutions are known and thus do not rely on external evaluations to determine their performance. Moreover, researchers can easily indicate which algorithms are more accurate based on the number of relevant features selected [6][3]. In addition, the use of synthetic datasets provides a standardized platform for FSAs with different underlying architectures to be compared in a model agnostic manner. The existing literature in the field lacks a systematic evaluation of FSAs based on common benchmark datasets with controlled experimental conditions.

2. The Importance of Synthetic Data

Synthetic datasets were popularized by Friedman et al. in the early 1990s, where continuous valued features were developed for the purposes of regression modeling in high-dimensional data [7,8][4][5]. Friedman’s 1991 paper continues to be widely cited in the feature selection literature because it addresses the complex feature selection problem through an application of synthetically generated data. In 2020, synthetically generated adaptive regression splines were used to develop a solution for feature selection in Engineering Process Control (EPC) [9][6]. Although specifically used in the context of recurrent neural networks, the importance of synthetic data can be seen in the development of reliable feature selection techniques. In [10][7], Yamada et al. highlighted the relevance of using synthetic data for the development of novel feature selection techniques. The reseauthorchers discussed the challenge of feature selection when considering nonlinear functions and proposed a solution using stochastic gates. This approach outperforms prior regression analysis methods (such as the LASSO method—for variable selection) and also is more generalizable towards nonlinear models. Examples of applications of these nonlinear models were discussed including neural networks, in which the proposed approach was able to record higher levels of sparsity. The stochastic gate algorithm was subsequently tested on both real-life and synthetically generated data to further validate its performance. The general use of synthetic datasets appears to be for the purposes of validating feature selection algorithms, which is similarly presented in [11][8] for the production of a feature selection framework in datasets with missing data. Ref. [12][9] explained that the lack of available real data is a challenge faced when considering unsupervised learning in waveform data and suggested the use of synthetically generated datasets to produce real data applications. Other applications of synthetic data for unsupervised feature selection have been proven effective in the literature, as in the case of [13][10]. The reseauthorchers presented two novel unsupervised FSAs, experimentally tested using synthetic data. The authoresearchers recommended the study of the impact of the noisy features within the data as an area of further work. Synthetic data have also been used for evaluating dynamic feature selection algorithms [14][11], the process of dynamically manipulating the feature subsets based on the learning algorithm used [15][12]. Unsupervised feature selection has been growing in relevance, as it removes the need for class labels in producing feature subsets. Synthetic datasets have also been used for comparatively studying causality-based feature selection algorithms [16][13]. Most recently, synthetic datasets were presented as a valuable benchmarking technique for the evaluation of feature selection algorithms [17][14]. That preseaperrch presented six discrete synthetically generated datasets that drew inspiration from digital logic circuits. In particular, the generated datasets include an OR-AND circuit, an AND-OR circuit, an Adder, a 16-segment LED display, a comparator, and finally, a parallel resistor circuit (PRC). These datasets were then used for the purposes of testing some popular feature selection algorithms. Similar work with discrete-valued synthetic datasets was presented in [18][15], where the reseauthorchers produced a Boolean dataset based on the XOR function. The CorrAL dataset was proposed in that paperresearch containing six Boolean features x 1 , x 2 , , x 6 , with the target variable being determined by the Boolean function ( x 1 x 2 ) ( x 3 x 4 ) . Features x 1 , , x 4  were the relevant features, 𝑥5 was irrelevant and finally, 𝑥6 was redundant (correlated with the target variable). CorrAL was later extended to 100 features, allowing researchers to consider higher-dimensional data than the original synthetically generated dataset [19][16]. In [20][17], the reseauthorchers developed synthetic data that mimic microarray data. This was based on an earlier study conducted on hybrid evolutionary approaches to feature selection, namely, memetic algorithms that combine wrapper and filter feature evaluation metrics [21][18]. Initially, the reseauthorchers presented a feature ranking method based on a memetic framework that improved the efficiency and accuracy of non-memetic algorithm frameworks [22][19]. Another well-known synthetic dataset is the LED dataset, developed in 1984 by Breiman et al. [23][20]. This is a classification problem with 10 possible classes, described by seven binary attributes (0 indicating that a LED strip is off and 1 indicating that the LED strip is on). Two versions of this dataset were presented in the literature, one with 17 irrelevant features and another with 92 irrelevant features—both containing 50 samples. Different levels of noise were also incorporated into the dataset, with 2, 6, 10, 15, and 20% noise, allowing the evaluation of FSAs’ tolerance to the extent of noise in the dataset. These synthetic datasets were then used to test different feature selection algorithms, as indicated in [24][21]. A similar discrete synthetic dataset is the Madelon dataset [25][22], where relevant features are on the vertices of a five-dimensional hypercube. The reseauthorchers included 5 redundant features and 480 irrelevant features randomly generated from Gaussian distribution. In [1][23], the reseauthorchers tested ensemble feature selection for microarray data by creating five synthetic datasets. It was demonstrated empirically that the feature selection algorithms tested were able to find the (labeled) relevant features, which helped in the evaluation of the stability of these proposed feature selection methods. Synthetic datasets with continuous variables have also been presented in the literature. In [26][24], the reseauthorchers presented a framework for global redundancy minimization and subsequently tested this framework on synthetically generated data. The dataset contained a total of 400 samples across 100 features, with each sample being broken up into 10 groups of highly correlated values. These points were randomly assigned using the Gaussian distribution. This dataset, along with other existing datasets, was used as the testing framework for the algorithms proposed. Synthetic data have also been used in applications such as medical imaging, where Generative Adversarial Networks (GANs) are employed to produce image-based synthetic data [27][25]. However, the limitations of synthetic data must also be noted, as they often pose restrictions when it comes to the various challenges encountered in the feature selection process [17,28,29][14][26][27]. More specifically, it is important to acknowledge the fact that synthetic data often come with a lack of “realism”, meaning that the data generated are not as chaotic as what could be expected in the real world. Many real-world applications come with a tolerance for outliers and randomness, which cannot be accurately modeled with synthetic data [29][27]. Furthermore, synthetic datasets are often generated due to the lack of available real-world data, which poses an obstacle in itself. In many cases, the limited nature of the available data restricts researchers from being able to model (and thus synthesize) the data accurately. This can potentially lead to synthetically generated data that are less nuanced than their real-world counterparts. However, this is more often the case for high-dimensional information-dense applications, such as financial data [30,31][28][29]. Feature selection methods are categorized into three distinct types: filter methods, wrapper methods, and embedded methods. Filter methods are considered a preprocessing step to determine the best subset of features without employing any learning algorithms [32][30]. Although filter methods are computationally less expensive than wrapper methods, they come with a slight deficiency in that they do not employ a predetermined algorithm for the training of the data [33][31]. In wrapper methods, a subset is first generated, and a learning learning algorithm is applied to the selected subset so that the metrics pertaining to the performance of this specific subset are recorded. The subsets are algorithmically exhausted until an optimal solution is found. Embedded methods, on the other hand, combine the qualities of both filter and wrapper methods [34][32]. Embedded feature selection techniques have risen in popularity due to their improved accuracy and performance. They combine filters and classifiers and havee the advantages of different feature selection methods to produce the optimal selection on a given dataset.

References

  1. Sulieman, H.; Alzaatreh, A. A Supervised Feature Selection Approach Based on Global Sensitivity. Arch. Data Sci. Ser. (Online First) 2018, 5, 3.
  2. Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.W.; O’Sullivan, J.M. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front. Bioinform. 2022, 2, 927312.
  3. Mitra, R.; Varam, D.; Ali, E.; Sulieman, H.; Kamalov, F. Development of Synthetic Data Benchmarks for Evaluating Feature Selection Algorithms. In Proceedings of the 2022 2nd International Seminar on Machine Learning, Optimization, and Data Science (ISMODE), Virtual, 22–23 December 2022; pp. 47–52.
  4. Friedman, J.H. Multivariate Adaptive Regression Splines. Ann. Stat. 1991, 19, 1–67.
  5. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140.
  6. Kao, L.J.; Chiu, C.C. Application of integrated recurrent neural network with multivariate adaptive regression splines on SPC-EPC process. J. Manuf. Syst. 2020, 57, 109–118.
  7. Yamada, Y.; Lindenbaum, O.; Negahban, S.; Kluger, Y. Deep supervised feature selection using Stochastic Gates. arXiv 2018, arXiv:1810.04247.
  8. Yu, K.; Yang, Y.; Ding, W. Causal Feature Selection with Missing Data. ACM Trans. Knowl. Discov. Data 2022, 16, 1–24.
  9. Alkhalifah, T.; Wang, H.; Ovcharenko, O. MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning. arXiv 2021, arXiv:2109.05294.
  10. Panday, D.; Cordeiro de Amorim, R.; Lane, P. Feature weighting as a tool for unsupervised feature selection. Inf. Process. Lett. 2018, 129, 44–52.
  11. Kaya, S.K.; Navarro-Arribas, G.; Torra, V. Dynamic Features Spaces and Machine Learning: Open Problems and Synthetic Data Sets. In Integrated Uncertainty in Knowledge Modelling and Decision Making; Huynh, V.N., Entani, T., Jeenanunta, C., Inuiguchi, M., Yenradee, P., Eds.; Springer: Cham, Switzerland, 2020; pp. 125–136.
  12. Rughetti, D.; Sanzo, P.D.; Ciciani, B.; Quaglia, F. Dynamic Feature Selection for Machine-Learning Based Concurrency Regulation in STM. In Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Torino, Italy, 12–14 February 2014; pp. 68–75.
  13. Yu, K.; Guo, X.; Liu, L.; Li, J.; Wang, H.; Ling, Z.; Wu, X. Causality-based Feature Selection: Methods and Evaluations. arXiv 2019, arXiv:1911.07147.
  14. Kamalov, F.; Sulieman, H.; Cherukuri, A.K. Synthetic Data for Feature Selection. arXiv 2022, arXiv:2211.03035.
  15. John, G.H.; Kohavi, R.; Pfleger, K. Irrelevant Features and the Subset Selection Problem. In Machine Learning Proceedings 1994; Cohen, W.W., Hirsh, H., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 1994; pp. 121–129.
  16. Kim, G.; Kim, Y.; Lim, H.; Kim, H. An MLP-based feature subset selection for HIV-1 protease cleavage site analysis. Artif. Intell. Med. 2010, 48, 83–89.
  17. Zhu, Z.; Ong, Y.S.; Zurada, J.M. Identification of Full and Partial Class Relevant Genes. IEEE/ACM Trans. Comput. Biol. Bioinform. 2010, 7, 263–277.
  18. Liu, X.Y.; Liang, Y.; Wang, S.; Yang, Z.Y.; Ye, H.S. A Hybrid Genetic Algorithm with Wrapper-Embedded Approaches for Feature Selection. IEEE Access 2018, 6, 22863–22874.
  19. Zhu, Z.; Ong, Y.S.; Dash, M. Wrapper–Filter Feature Selection Algorithm Using a Memetic Framework. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2007, 37, 70–76.
  20. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth International Group: Belmont, CA, USA, 1984.
  21. Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 2012, 34, 483–519.
  22. Guyon, I.; Li, J.; Mader, T.; Pletscher, P.A.; Schneider, G.; Uhr, M. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recognit. Lett. 2007, 28, 1438–1444.
  23. Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. An ensemble of filters and classifiers for microarray data classification. Pattern Recognit. 2012, 45, 531–539.
  24. Wang, D.; Nie, F.; Huang, H. Feature Selection via Global Redundancy Minimization. IEEE Trans. Knowl. Data Eng. 2015, 27, 2743–2755.
  25. Figueira, A.; Vaz, B. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 2022, 10, 2733.
  26. Varol, G.; Romero, J.; Martin, X.; Mahmood, N.; Black, M.J.; Laptev, I.; Schmid, C. Learning from Synthetic Humans. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017.
  27. Ward, C.M.; Harguess, J.; Hilton, C. Ship Classification from Overhead Imagery using Synthetic Data and Domain Adaptation. In Proceedings of the OCEANS 2018 MTS/IEEE Charleston, Charleston, SC, USA, 22–25 October 2018; pp. 1–5.
  28. Assefa, S.A.; Dervovic, D.; Mahfouz, M.; Tillman, R.E.; Reddy, P.; Veloso, M. Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, New York, NY, USA, 15–16 October 2021.
  29. Bonnéry, D.; Feng, Y.; Henneberger, A.K.; Johnson, T.L.; Lachowicz, M.; Rose, B.A.; Shaw, T.; Stapleton, L.M.; Woolley, M.E.; Zheng, Y. The Promise and Limitations of Synthetic Data as a Strategy to Expand Access to State-Level Multi-Agency Longitudinal Data. J. Res. Educ. Eff. 2019, 12, 616–647.
  30. Chen, G.; Chen, J. A novel wrapper method for feature selection and its applications. Neurocomputing 2015, 159, 219–226.
  31. Sánchez-Maroño, N.; Alonso-Betanzos, A.; Tombilla-Sanromán, M. Filter Methods for Feature Selection—A Comparative Study. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2007, Birmingham, UK, 16–19 December 2007; Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 178–187.
  32. Xiao, Z.; Dellandrea, E.; Dou, W.; Chen, L. ESFS: A new embedded feature selection method based on SFS. In Ecole Centrale Lyon; Université de Lyon; LIRIS UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/École Centrale de Lyon; Research Report; Tsinghua University: Bejing, China, 2008.
More
ScholarVision Creations