Multi-View Code Representations for Function Vulnerability Detection

Multi-View Code Representations for Function Vulnerability Detection: Comparison

Please note this is a comparison between Version 1 by Zhenzhou Tian and Version 2 by Rita Xu.

The explosive growth of vulnerabilities poses a significant threat to the security of software systems. While various deep-learning-based vulnerability detection methods have emerged, they primarily rely on semantic features extracted from a single code representation structure, which limits their ability to detect vulnerabilities hidden deep within the code.

vulnerability detection
representation fusion
attributed control flow graph

1. Introduction

In recent times, the incidence of network attacks has witnessed a significant upsurge. These attacks are primarily driven by the ubiquitous presence of software vulnerabilities. To date, over 200,000 such vulnerabilities have been recorded on the Common Vulnerabilities and Exposures (CVE) website [1]. Given the pervasive exploitation of vulnerabilities and the significant security threats they pose, it is critical for developers to proactively detect vulnerabilities in their code. whether written by themselves or reused from open-source software.

However, identifying multi-faceted vulnerabilities requires security-related domain knowledge that goes beyond the expertise of most developers. This presents significant challenges for vulnerability detection. In light of the ever-expanding scale and complexity of modern software systems, it has become increasingly impractical, even for security professionals, to manually detect potential vulnerabilities within millions of lines of code, given the tremendous efforts and time required.

Inspired by its impressive performance in diverse domains such as NLP [2] and program analysis ^[3][4][5][6][3,4,5,6], deep learning has also been harnessed to develop a range of approaches ^[7][8][9][7,8,9] for the detection of vulnerabilities. These approaches utilize labeled training samples and extract semantic-aware features from them to construct classifiers that map the target code snippets onto a class space that indicates the absence or presence of vulnerabilities, or specific vulnerability types. Typically, prevailing deep-learning-based vulnerability detection methods rely on a single code representation structure to identify vulnerabilities, which, however, may fail to comprehensively capture vulnerability-indicative patterns and detect those vulnerabilities that are well-hidden within the code. This is due to the fact that these vulnerability-indicative patterns may require different perspectives on the code reflected by different code representation structures.

To address the aforementioned limitation, reswearchers propose S

^{2}

FVD, which entails a novel approach that leverages fused semantic vectors that are learned from three essential code representations, including token sequence, attribute control flow graph (ACFG), and abstract syntax tree (AST). These code representations provide distinct perspectives on the code, thereby allowing the model to more comprehensively capture vulnerability-indicative features from the code. The main contributions of this rpapesearchr are summarized as follows.

A novel DL-based vulnerability detection method called S $^{2}$ FVD is presented. To accommodate the distinct representations of the code, an adaptive learning model has been devised to capture the multi-faceted aspects of function semantics and fuse them together to ensure the extraction of comprehensive semantic features. This strategy effectively prevents the loss of critical features that are indicative of vulnerability patterns.

An extended-tree-structured neural network called ERvNN has been designed, which can effectively encode the semantics implied in the abstract syntax tree. With a GRU-style aggregation optimization on the tree nodes, it supports the straightforward and efficient encoding of multi-way tree structures, which otherwise should be firstly converted to the binary tree form.
Extensive experiments were conducted to evaluate the performance of S $^{2}$ FVD. The results demonstrated that S $^{2}$ FVD outperformed existing state-of-the-art DL-based methods in terms of accuracy, F $_{1}$ score, precision, and recall when detecting the presence of vulnerabilities and pinpointing the specific vulnerability types. Moreover, ablation studies confirmed the effectiveness of the devised ERvNN for encoding AST and the strategy of representation fusion for enhancing the performance of S $^{2}$ FVD.

A new dataset has been constructed to facilitate vulnerability detection research. The dataset consists of 25,333 C functions, each of which is well labeled with either a specific CWE ID indicating a vulnerability or a non-vulnerable ground truth. The source implementation of the S $^{2}$ FVD has also been made publicly available at https://github.com/lv-jiajun/S2FVD (accessed on 22 May 2023) to facilitate future benchmarking and comparisons.

2. Multi-View Code Representations for Function Vulnerability Detection

The closely related vulnerability detection methods, which broadly fall into the following three categories, including code similarity-based methods ^[10][11][10,11], static-rule-based methods [12], and learning-based methods ^[13][14][13,14], are mainly discussed. Also, in introducing these methods, rwesearchers focus more on the deep-learning-based ones. It should be noted that this is not a survey paper. Thus, the other types of vulnerability detection methods, which focus on binary code ^[15][16][15,16], examine executing dynamic analysis [17], or review formal semantic analysis ^[18][19][18,19] (e.g., model checking and symbolic execution), are not delved into.

2.1. Code-Similarity-Based Methods

Code-similarity-based vulnerability detection relies on the core idea that source code exhibiting high similarity is likely to share vulnerabilities ^[20][21][20,21]. However, while this approach can effectively identify vulnerabilities introduced through code cloning, it suffers from high rates of false negatives when it is used to detect other types of vulnerabilities not resulting from code cloning ^[22][23][22,23].

2.2. Rule-Based Methods

The static-rule-based methods involve scanning the target source code using a multitude of meticulously defined vulnerability rules or patterns. Prominent examples of typical static analyzers in this category include Infer [24], CodeChecker [25], and Checkmarx [26]. One of the main issues is that the vulnerability rules defined by human experts are often subjective, making it challenging to consider all possible scenarios that distinguish between vulnerabilities and non-vulnerabilities ^[27][28][27,28]. As a result, this approach may lead to a high rate of false positives and false negatives.

2.3. Learning-Based Methods

These methods can be broadly categorized as traditional machine- or deep-learning-based, depending on whether expert-defined features are required.

2.3.1. Conventional Machine Learning-Based Methods

Early works ^[29][30][29,30] typically utilized traditional machine learning algorithms for training detection models. These models rely on representative features that are engineered by experts such as code complexity metrics, code churns, imports and calls, and developer activities [31]. Nevertheless, these engineered features are often inadequate in indicating the presence of vulnerabilities. Additionally, most existing methods are restricted to in-project vulnerability detection, rather than providing general-purpose solutions.

2.3.2. Deep-Learning-Based Methods

Deep-learning-based methods, on the other hand, leverage the powerful feature learning capabilities of deep neural networks to automatically extract vulnerability patterns or features without requiring manual definition from experts [32]. The majority of deep-learning-based detection research concentrates on sequence-based code representation learning. For example, Russell et al. [7] developed a lexical analyzer to transform C/C++ functions into corresponding token sequences. These sequences were subsequently input into CNN and RNN models for training and then applied to detect code vulnerability. Li et al. created VulDeePecker [33], which is a vulnerability detection system based on deep learning. This system generates code gadgets (i.e., sets of control or data-dependent statements) that are lexically analyzed to establish token sequences, which are then fed into neural networks for vulnerability detection purposes. Later, Li et al. proposed SySeVR [8], which is a system framework for detecting vulnerabilities in C/C++ source code. This framework is primarily focused on obtaining code sequences that capture both syntactic and semantic information to achieve vulnerability detection.

Since sequence-based code representation overlooks the syntactic structure and control flow information inherent in source code, some research on code vulnerability detection has resorted to trees or graphs as code representations, as well as employing corresponding neural network models to learn semantic information within the code. For instance, Dam et al. [34] parsed a source code file into an abstract syntax tree and employed the Tree-LSTM model to detect vulnerabilities within files. Zhou et al. [9] introduced Devign, which is a graph neural network (GNN) model that bases its composite code representation on the abstract syntax tree. Devign encodes various data and control dependencies to create a joint graph, which is subsequently input into GNNs to detect source code vulnerability. Li et al. [35] deployed a program dependence graph as the code representation and used a FA-GCN (graph convolution network with feature attention) to classify the graph, thereby achieving the successful detection of code vulnerabilities.

However, the single-representation-based method has difficulty in capturing the complete semantic information in the code, thus leading to higher rates of both false positives and false negatives. To address this issue, multiple distinct code representations are extracted, while adaptive deep neural network models are selected or devised to encode the different aspects of the function semantics. By retrieving the deeply implied semantic features and fusing them organically, more comprehensive vulnerability indicative features are obtained that lead to enhanced code vulnerability detection performance.