Commit Representations for JIT Vulnerability Prediction

Commit Representations for JIT Vulnerability Prediction: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Software Engineering

Contributor:

Rudolf Ferenc

Software systems are becoming increasingly complex and interdependent, with thousands of lines of code being added and modified daily. As a result, software vulnerabilities are becoming more prevalent and pose a significant threat to the security and reliability of software systems. With the evolution of software systems, their size and complexity are rising rapidly. Identifying vulnerabilities as early as possible is crucial for ensuring high software quality and security. Just-in-time (JIT) vulnerability prediction, which aims to find vulnerabilities at the time of commit, has increasingly become a focus of attention.

commit representation
vulnerability prediction
just-in-time
software
machine learning

1. Introduction

Software systems are becoming increasingly complex and interdependent, with thousands of lines of code being added and modified daily. As a result, software vulnerabilities are becoming more prevalent and pose a significant threat to the security and reliability of software systems. This is clearly apparent from cybersecurity firm Tenable’s report of 2021 [1]. According to them, from 2016 to 2021, the number of reported CVEs increased at an average annual growth rate of 28.3%, accumulating to a 241% increase. Many of these are zero-day vulnerabilities that were disclosed in 2021 across a variety of popular software applications, leaving software vendors a short time to prevent exploitation.

In order to mitigate these vulnerabilities, it is crucial to identify and address them as early as possible during the software development process. A fitting time for identification is during code addition to the codebase in the version control system, i.e., at commit time. This process is usually referred to as just-in-time (JIT) vulnerability prediction [2]. Commits contain data on bug fixes, feature additions, code refactoring, and additional metadata in the form of commit messages and author information. By analyzing commits based on the contained data, software engineers can identify patterns that may indicate the presence of vulnerabilities [3].

Identifying and analyzing these vulnerable commits manually is a daunting task, especially in large software projects with thousands of commits. It is not feasible for human analysts to examine each commit, and the task may be error-prone due to the sheer volume of changes [4]. Manual efforts are also found to be lacking in terms of adaptiveness, as vulnerabilities appear very rapidly in different forms and human resources are expensive. To address this challenge, researchers have developed machine-learning-based approaches to automatically identify and predict vulnerable commits [5].

One critical component of these approaches is the representation of commits. Commit representations capture the information contained in commits in a way that can be processed by machine learning algorithms, typically resulting in variable-length vectors. Representations are designed based on data contained in the commits such as the commit messages [6], the changed lines of code (patches) [7], or commit metadata and patches together [2,8]. Some approaches also use code metrics to supplement the commit metadata to predict potential vulnerability-contributing commits, such as VCCFinder [9].

Commit representations that use source code as input also vary in what form they take the source code and what granularity they parse it. Source code can be used in its raw form as raw text and processed by popular NLP methods, intermediate representation forms can be used such as the abstract syntax tree (AST) [7] or code metrics derived from the source code can also be used [9]. The granularity is also an essential factor, as it directly affects the usability of the representation and it varies a lot: the commit can be taken as a whole by aggregating the patches of all the changed files [8], or it can be trained at the file, class, method, or even line level. A finer granularity will contain fewer data but provide a more precise location of a suspected vulnerable piece of code than a coarser-grained one.

2. Existing Commit Representations for Vulnerability Prediction

The identification and prediction of vulnerable commits have been studied extensively in recent years. In this section, the existing literature on commit representations for vulnerability prediction is reviewed.

An earlier line of work entails using different attributes derived from the commit to reason about the code change’s effects. Mockus et al. used predictors such as the size of lines of code added, deleted, and unmodified, measures of developer experience, the type of change, etc., to predict the quality of change in regard to inspection, testing, and delivery [12].

Kamei et al. uses source-code metrics and Kim et al. uses features extracted from the revision history of software projects [13,14]. Similarly, other works extract features from metrics and/or commit metadata to enhance the procedure of code quality assessment and ultimately prevent vulnerabilities [9,15]. Usually, in these methods, the extracted features are fed into a machine learning model, such as an SVM or a random forest.

Even though metric-based approaches were an improvement over manual procedures, they still could not capture the semantic and syntactic relations between the code elements. Lomio et al. found that existing metrics were not sufficient enough and additional work was needed to find more appropriate ways to facilitate JIT vulnerability prediction [16]. A possible route for improvement was based on the insight that Hindle et al. provided in their work: source code is similar to texts written in natural languages, and as such, NLP methods can be leveraged to improve the related models’ effectiveness [17].

Agreeing on this insight, many works in the field use the source code itself as input to find an improvement over metric-based approaches. Minh Le et al. proposed DeepCVA, a convolutional neural network that uses code changes (patches) and their contexts [18] for commit-level vulnerability assessment.

Hoang et al. describe DeepJIT, an end-to-end deep learning framework also based on the convolutional neural network architecture, which automatically extracts features from commit messages and code changes and uses them to identify defects [2]. In their follow-up work, Hoang et al. propose CC2Vec, a framework that also uses commit messages and code changes as input but uses a hierarchical attention neural network to produce generally usable vectors [8] They evaluated their method on log-message generation, bug-fixing patch identification, and just-in-time defect prediction and found that they outperformed previous works. Our study specifically focused on DeepJIT and CC2Vec due to their innovative use of recent advances of machine learning approaches, such as CNNs and HANs. These models, at the time of our investigation, were contending with each other as state-of-the-art approaches for JIT vulnerability prediction. Their emerging prominence in commit analysis suggested their strong potential, making them particularly relevant to our research objectives in the dynamic field of defect prediction.

Even though these methods are promising due to their relatively high predictive power, they take the source code as (preprocessed) text and they do not use the strictly structured nature of source code. The available intermediate representations, such as the abstract syntax tree (AST) and control flow graphs (CFG) are promising for providing more structural information, as shown in works related to source code representation and repair [19,20,21,22].

Change trees are specifically designed to keep the structural differences in changed codes [23]. They are constructed by using the ASTs of the pre- and postcommit states and discarding the AST paths that are the same in the two states. The resulting AST structure is called a change tree, as it only contains the relations that are changed as part of the code change. We selected this method as our third candidate due to its customizability. It can be tailored for any source code element with an associated AST, potentially capturing more nuanced information than standard feature extraction methods. This versatility makes change trees a valuable addition to our comparative analysis in exploring the landscape of vulnerability prediction.

This entry is adapted from the peer-reviewed paper 10.3390/computers13010022

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.