Text Mining and Software Metrics in Vulnerability Prediction

Text Mining and Software Metrics in Vulnerability Prediction: Comparison

Please note this is a comparison between Version 2 by Ilias Kalouptsoglou and Version 3 by Dean Liu.

Vulnerability prediction is a mechanism that facilitates the identification (and, in turn, the mitigation) of vulnerabilities early enough during the software development cycle. The scientific community has recently focused a lot of attention on developing Deep Learning models using text mining techniques and software metrics for predicting the existence of vulnerabilities in software components. However, limited attention has been given on the comparison and the combination of text mining- based and software metrics- based vulnerability prediction models.

vulnerability prediction
software metrics
text mining
machine learning
ensemble learing

1. Introduction

Modern software programs are typically large, complicated, and interconnected. To design secure software, it is vital to follow secure and good programming methods. As a result, strategies and approaches that can offer developers with indicative information on how secure their software is are needed to help them improve their security level. Vulnerability prediction techniques may provide reliable information regarding software’s vulnerable hotspots and assist developers in prioritizing testing and inspection efforts by assigning limited testing resources to potentially vulnerable areas. Vulnerability Prediction Models (VPMs) are often created using Machine Learning (ML) approaches that utilize software features as input to differentiate between vulnerable and clean (or neutral) software components. Several VPMs have been developed throughout the years, each of which uses a different set of software features as inputs to anticipate the presence of vulnerable components (e.g., software metrics ^[1][2][3], text features ^[4][5], static analysis alerts ^[6][7], etc.).

More specifically, the initial attempts in the field of software vulnerability prediction investigated the ability of software metrics to indicate vulnerability existence in software, paying more focus on cohesion, coupling, and complexity metrics ^[1][2][3]. They utilized ML algorithms to classify software components as vulnerable or not. Text mining approaches, where researchers tried to extract text patterns from the source code utilizing Deep Learning (DL) models, were also examined ^[4][5][8][9], and demonstrated promising results in vulnerability prediction. Although both approaches have been studied individually, and there are several claims that text mining-based approaches lead to better vulnerability prediction models, to the best of our knowledge, apart from ^[10][11], there is a lack of studies that directly compare text mining-based with software metrics-based vulnerability models or studies that examine the combination of text features and software metrics as indicators of vulnerability.

2. Vulnerability Prediction

The purpose of Vulnerability Prediction is to identify software hotspots (i.e., software artefacts) that are more likely to contain software vulnerabilities. These hotspots are actually parts of the source code that require more attention by the software developers and engineers from a security viewpoint. Vulnerability Prediction Models (VPMs) are models able to detect software components that are likely to contain vulnerabilities. These models are normally built based on Machine Learning (ML) and are used in practice for prioritizing testing and inspection efforts, by allocating limited test resources to potentially vulnerable parts. For better understanding, the general structure of a Vulnerability Prediction Model is depicted in Figure 1.

Figure 1. The basic concept of vulnerability prediction.

As can be seen by Figure 1, the core element of vulnerability prediction is a vulnerability predictor, a model that is used to decide whether a given source code file (i.e., software component) is potentially vulnerable or not. The first step of the process is the construction of the vulnerability predictor. In order to construct the vulnerability predictor, a repository of clean and vulnerable software components (e.g., classes, functions, etc.) is initially constructed. Subsequently, appropriate mechanisms are employed in order to extract attributes from the source code (e.g., software metrics, static analysis alerts, text features, etc.), which are collected in order to construct the dataset that will be used for training and evaluating vulnerability prediction models. Then several VPMs are generated and the one demonstrating the best predictive performance is selected as the final vulnerability predictor. During the execution of the model in practice, when a new source code file arrives to the system, its attributes are extracted and provided as input to the vulnerability predictor, which, in turn, evaluates whether it is vulnerable or not. The selection of the type of the attributes that will be provided as input to the generated VPMs is an important design decision in Vulnerability Prediction. The main VPMs that can be found in the literature are based on software attributes extracted from the source code either through static analysis (e.g., such as software metrics) ^[1][2][3] and text mining (e.g., bag of words, sequences of tokens, etc.) ^[4][5][9]. Software metrics-based VPMs: When the VPMs utilize software metrics, they are trained on numerical features that describe some characteristics of the source code (e.g., complexity, lines of code, etc.). These metrics are commonly extracted through static analysis and can provide quantitative information about quality attributes of the source code, such as the number of function calls and the number of linearly independent paths through a program’s source code. Popular metric suites that are used in practice are the Chidamber & Kemerer (CK) Metrics ^[12] and Quality Model for Object Oriented Design (QMOOD) ^[13] metric suites. Several open- and closed-source tools are available for their calculation, such as the (Chidamber & Kemerer Java Metrics) CKJM Extended (http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/, accessed on 2 January 2022), and the Understand (https://en.wikipedia.org/wiki/Understand_(software)) tools. Text mining-based VPMs: On the other hand, text mining-based VPMs are trained on datasets made up of text tokens retrieved from the source code. The simplest text mining approach is Bag of Words (BoW). The code in BoW is separated into text tokens, each of which has a count of how many times it appears in the source code. As a result, each word represents a feature, and the frequency of that feature in a component equals the feature’s value in that component. Apart from BoW, a more complex text mining approach involves the transformation of the source code into a list of token sequences that can be fed into Deep Learning (DL) models that can parse sequential data (e.g., recurrent neural networks). The token sequences are the input to the DL models, which try to capture the syntactic information in the source code during the training phase and anticipate the presence of vulnerabilities in software components during the execution phase. To extract semantic information from tokens, text mining-based methods also employ Natural Language Processing (NLP) techniques including token encoding with word2vec (https://radimrehurek.com/gensim/models/word2vec.html, accessed on 10 December 2021) embedding vectors. Word embedding methods learn a real-valued vector representation for a predetermined fixed-sized vocabulary from a corpus of text ^[14]. On a given natural language processing task, such as document classification, an embedding layer is a word embedding trained in combination with a neural network. It needs cleaning and preparing the document text in order for each word to be encoded in a one-hot vector. The size of the vector space is determined by the model. Small random numbers are used to seed the vectors. The embedding layer is utilized at the front end of a neural network and is fitted using the Backpropagation method in a supervised way.

3. Ensemble Learning

The ensemble learning ^[15] is a machine learning meta method that aims to improve predictive performance by integrating predictions from various models. It is actually an ML technique that combines numerous base models to build a single best-predicting model. The core premise of ensemble learning is that by merging many models, the faults of a single model will most likely be compensated by other models, resulting in the ensemble’s total prediction performance being better than that of a single model. The most common ensemble methods are divided into three categories, namely bagging, boosting, and stacking. Bagging ^[16][17] is a technique used to reduce prediction variance by fitting each base classifier on a random subset of the original dataset and subsequently combining their individual predictions (either by voting or average) to generate a final prediction. Boosting ^[17] is an ensemble modeling strategy that aims to create a strong classifier out of a large number of weak ones. It is accomplished by constructing a model from a sequence of weak models. To begin, a model is created using the training data. The second model is then created, which attempts to correct the faults in the first model. This approach is repeated until either the entire training data set is properly predicted or the maximum number of models has been added. TIn this study, the stacking classifier is employed. Stacking (https://towardsdatascience.com/stacking-classifiers-for-higher-predictive-performance-566f963e4840, accessed on 2 January 2022) is a technique for bringing together models. It is made up of two-layer estimators. The baseline models that are used to forecast the outcomes on the validation datasets make up the first layer, while the meta-classifier constitutes the second layer, which takes all of the baseline model predictions as input and generates new predictions, as can be seen in the Figure 2.

Figure 2. The architecture of the Stacking classifier.

4. Comparison between software metrics- based and text mining- based models

In comparison with the software metrics approach, it can be seen (Table 1) that the sequence-based CNN models outperform the software metrics-based models. In particular, the best CNN model achieves an F1-score of 85.73% and an F2-score of 85.62%, which is 8% and 14% higher than the F1-score and F2-score respectively of the best software metrics-based model. In comparison with the Bag of Words (BoW) approach, the sequence-based models still demonstrate better predictive performance; however, the difference in the performance is much smaller compared to the metrics-based models, at least with respect to their F1-score and F2-score. This could be expected by the fact that those approaches are similar in nature (i.e., they are both text mining approaches), and their difference lies in the way how the text tokens are represented. In fact, the improvement that the sequence-based models introduce is that instead of taking as input the occurrences of the tokens in the code, they take as input their sequence inside the source code, potentially allowing them to detect more complex code patterns, and, thus, this improvement in the predictive performance could be attributed to those complex patterns. In general, from the above analysis one can notice that text mining-based models (either based on BoW or on the sequences of tokens) provide better results in vulnerability prediction than the software metrics-based models.

Table 1. A table with the evaluation scores of both text mining–based and software metrics–based models.

5. Hybrid model combining software metrics- based and text mining- based models

Four classifiers were repeatedly trained in nine folds of the dataset, two of them are based on software metrics (Support Vector Machine, Random Forest), and two are based on text mining (i.e., BoW, sequences of tokens). Then predictions were made with each classifier, and the predicted probabilities were saved. These probabilities constituted the input of the Random Forest meta-classifier. This meta-classifier was trained on the output of the first ones, and it was evaluated in a second Cross Validation loop. Figure 3 illustrates the overview of this approach.

Figure 3. The overview of the stacking approach between text mining and software metrics.

Table 2 presents the produced results.

Table 2. Stacking classifier evaluation.

Based on Table 1 and Table 2, the combination of statically extracted code metrics and text features (either BoW or sequences of tokens) did not manage to surpass the text mining approach, at least on this specific dataset. The fact that the ensemble learning classifiers did not produce better results leads to the conclusion that almost all the right predictions of the software metrics-based models are included in the right decisions of the text mining-based model and so, there are no errors to be compensated.

6. Conclusion

This analysis led to the conclusion that text mining is an effective solution for vulnerability prediction, while it is superior to software metrics utilization. More specifically, both Bag of Words and token sequences approaches provided better results than the software metrics-based models. Another interesting observation that was made by this analysis is that the combination of software metrics with text features did not lead to more accurate vulnerability prediction models. Although their predictive performance was found to be sufficient, it did not manage to surpass the predictive performance of the already strong text mining-based vulnerability prediction models.

References

Shin, Y.; Williams, L. Is complexity really the enemy of software security? In Proceedings of the 4th ACM Workshop on Qualityof Protection, Alexandria, VA, USA, 27 October 2008; pp. 47–50.
Shin, Y.; Williams, L. An empirical model to predict security vulnerabilities using code complexity metrics. In Proceedings of theSecond ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, Kaiserslautern, Germany,9 October 2008; pp. 315–317.
Chowdhury, I.; Zulkernine, M.; Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities. J. Syst.Archit. 2011, 57, 294–313.
Pang, Y.; Xue, X.; Wang, H. Predicting vulnerable software components through deep neural network. In Proceedings of the 2017International Conference on Deep Learning Technologies, Chengdu, China, 2 June 2017; pp. 6–10.
Li, Z.; Zou, D.; Xu, S.; Ou, X.; Jin, H.; Wang, S.; Deng, Z.; Zhong, Y. Vuldeepecker: A deep learning-based system for vulnerabilitydetection.arXiv2018, arXiv:1801.01681.
Zheng, J.; Williams, L.; Nagappan, N.; Snipes, W.; Hudepohl, J.P.; Vouk, M.A; On the value of static analysis for fault detection insoftware. IEEE Trans. Softw. Eng. 2006, 32, 240–253.
Gegick, M.; Williams, L. Toward the use of automated static analysis alerts for early identification of vulnerability-and attack-prone components. In Proceedings of the Second International Conference on Internet Monitoring and Protection (ICIMP 2007),San Jose, CA, USA, 1–5 July 2007; pp. 18–18.
Neuhaus, S.; Zimmermann, T.; Holler, C.; Zeller, A. Predicting vulnerable software components. In Proceedings of the 14th ACMConference on Computer and Communications Security, Alexandria, VA, USA, 2 November 2007; pp. 529–540.
Hovsepyan, A.; Scandariato, R.; Joosen, W.; Walden, J. Software vulnerability prediction using text analysis techniques.In Proceedings of the 4th International Workshop on Security Measurements and Metrics, Lund, Sweden, 21 September 2012;pp. 7–10.
Walden, J.; Stuckman, J.; Scandariato, R. Predicting vulnerable components: Software metrics vs text mining. In Proceedings ofthe 2014 IEEE 25th International Symposium on Software Reliability Engineering, Naples, Italy, 3–6 November 2014; pp. 23–33.
Zhang, Y.; Lo, D.; Xia, X.; Xu, B.; Sun, J.; Li, S. Combining software metrics and text features for vulnerable file prediction. InProceedings of the 2015 20th International Conference on Engineering of Complex Computer Systems (ICECCS), Gold Coast,Australia, 9–12 December 2015; pp. 40–49.
Ferenc, R.; Heged ̋us, P.; Gyimesi, P.; Antal, G.; Bán, D.; Gyimóthy, T. Challenging machine learning algorithms in predictingvulnerable javascript functions. In Proceedings of the 2019 IEEE/ACM 7th International Workshop on Realizing ArtificialIntelligence Synergies in Software Engineering (RAISE), Montreal, QC, Canada, 28–28 May 2019; pp. 8–14.
Sagi, O.; Rokach, L; Ensemble learning: A survey.. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1249.
Subramanyam, R.; Krishnan, M.S; Empirical analysis of ck metrics for object-oriented design complexity: Implications forsoftware defects.. IEEE Trans. Softw. Eng. 2003, 29, 297–310.
Goyal, P.K.; Joshi, G. QMOOD metric sets to assess quality of Java program. In Proceedings of the 2014 International Conferenceon Issues and Challenges in Intelligent Computing Techniques (ICICT), Ghaziabad, India, 7–8 February 2014; pp. 520–533.
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space.arXiv2013,arXiv:1301.3781
Breiman, L. Bagging predictors.Mach. Learn.1996,24, 123–140.