Multi-View Code Representations for Function Vulnerability Detection

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Zhenzhou Tian	--	1356	2023-08-19 10:02:24	\|
2	update references and layout	Rita Xu	Meta information modification	1356	2023-08-21 05:13:59	\|

This entry is adapted from the peer-reviewed paper 10.3390/electronics12112495

The explosive growth of vulnerabilities poses a significant threat to the security of software systems. While various deep-learning-based vulnerability detection methods have emerged, they primarily rely on semantic features extracted from a single code representation structure, which limits their ability to detect vulnerabilities hidden deep within the code.

vulnerability detection representation fusion attributed control flow graph

1. Introduction

In recent times, the incidence of network attacks has witnessed a significant upsurge. These attacks are primarily driven by the ubiquitous presence of software vulnerabilities. To date, over 200,000 such vulnerabilities have been recorded on the Common Vulnerabilities and Exposures (CVE) website ^[1]. Given the pervasive exploitation of vulnerabilities and the significant security threats they pose, it is critical for developers to proactively detect vulnerabilities in their code. whether written by themselves or reused from open-source software.

However, identifying multi-faceted vulnerabilities requires security-related domain knowledge that goes beyond the expertise of most developers. This presents significant challenges for vulnerability detection. In light of the ever-expanding scale and complexity of modern software systems, it has become increasingly impractical, even for security professionals, to manually detect potential vulnerabilities within millions of lines of code, given the tremendous efforts and time required.

Inspired by its impressive performance in diverse domains such as NLP ^[2] and program analysis ^[3]^[4]^[5]^[6], deep learning has also been harnessed to develop a range of approaches ^[7]^[8]^[9] for the detection of vulnerabilities. These approaches utilize labeled training samples and extract semantic-aware features from them to construct classifiers that map the target code snippets onto a class space that indicates the absence or presence of vulnerabilities, or specific vulnerability types. Typically, prevailing deep-learning-based vulnerability detection methods rely on a single code representation structure to identify vulnerabilities, which, however, may fail to comprehensively capture vulnerability-indicative patterns and detect those vulnerabilities that are well-hidden within the code. This is due to the fact that these vulnerability-indicative patterns may require different perspectives on the code reflected by different code representation structures.

To address the aforementioned limitation, researchers propose S

^{2}

FVD, which entails a novel approach that leverages fused semantic vectors that are learned from three essential code representations, including token sequence, attribute control flow graph (ACFG), and abstract syntax tree (AST). These code representations provide distinct perspectives on the code, thereby allowing the model to more comprehensively capture vulnerability-indicative features from the code. The main contributions of this research are summarized as follows.

A novel DL-based vulnerability detection method called S $^{2}$ FVD is presented. To accommodate the distinct representations of the code, an adaptive learning model has been devised to capture the multi-faceted aspects of function semantics and fuse them together to ensure the extraction of comprehensive semantic features. This strategy effectively prevents the loss of critical features that are indicative of vulnerability patterns.

An extended-tree-structured neural network called ERvNN has been designed, which can effectively encode the semantics implied in the abstract syntax tree. With a GRU-style aggregation optimization on the tree nodes, it supports the straightforward and efficient encoding of multi-way tree structures, which otherwise should be firstly converted to the binary tree form.
Extensive experiments were conducted to evaluate the performance of S $^{2}$ FVD. The results demonstrated that S $^{2}$ FVD outperformed existing state-of-the-art DL-based methods in terms of accuracy, F $_{1}$ score, precision, and recall when detecting the presence of vulnerabilities and pinpointing the specific vulnerability types. Moreover, ablation studies confirmed the effectiveness of the devised ERvNN for encoding AST and the strategy of representation fusion for enhancing the performance of S $^{2}$ FVD.

A new dataset has been constructed to facilitate vulnerability detection research. The dataset consists of 25,333 C functions, each of which is well labeled with either a specific CWE ID indicating a vulnerability or a non-vulnerable ground truth. The source implementation of the S $^{2}$ FVD has also been made publicly available at https://github.com/lv-jiajun/S2FVD (accessed on 22 May 2023) to facilitate future benchmarking and comparisons.

2. Multi-View Code Representations for Function Vulnerability Detection

The closely related vulnerability detection methods, which broadly fall into the following three categories, including code similarity-based methods ^[10]^[11], static-rule-based methods ^[12], and learning-based methods ^[13]^[14], are mainly discussed. Also, in introducing these methods, researchers focus more on the deep-learning-based ones. It should be noted that this is not a survey paper. Thus, the other types of vulnerability detection methods, which focus on binary code ^[15]^[16], examine executing dynamic analysis ^[17], or review formal semantic analysis ^[18]^[19] (e.g., model checking and symbolic execution), are not delved into.

2.1. Code-Similarity-Based Methods

Code-similarity-based vulnerability detection relies on the core idea that source code exhibiting high similarity is likely to share vulnerabilities ^[20]^[21]. However, while this approach can effectively identify vulnerabilities introduced through code cloning, it suffers from high rates of false negatives when it is used to detect other types of vulnerabilities not resulting from code cloning ^[22]^[23].

2.2. Rule-Based Methods

The static-rule-based methods involve scanning the target source code using a multitude of meticulously defined vulnerability rules or patterns. Prominent examples of typical static analyzers in this category include Infer ^[24], CodeChecker ^[25], and Checkmarx ^[26]. One of the main issues is that the vulnerability rules defined by human experts are often subjective, making it challenging to consider all possible scenarios that distinguish between vulnerabilities and non-vulnerabilities ^[27]^[28]. As a result, this approach may lead to a high rate of false positives and false negatives.

2.3. Learning-Based Methods

These methods can be broadly categorized as traditional machine- or deep-learning-based, depending on whether expert-defined features are required.

2.3.1. Conventional Machine Learning-Based Methods

Early works ^[29]^[30] typically utilized traditional machine learning algorithms for training detection models. These models rely on representative features that are engineered by experts such as code complexity metrics, code churns, imports and calls, and developer activities ^[31]. Nevertheless, these engineered features are often inadequate in indicating the presence of vulnerabilities. Additionally, most existing methods are restricted to in-project vulnerability detection, rather than providing general-purpose solutions.

2.3.2. Deep-Learning-Based Methods

Deep-learning-based methods, on the other hand, leverage the powerful feature learning capabilities of deep neural networks to automatically extract vulnerability patterns or features without requiring manual definition from experts ^[32]. The majority of deep-learning-based detection research concentrates on sequence-based code representation learning. For example, Russell et al. ^[7] developed a lexical analyzer to transform C/C++ functions into corresponding token sequences. These sequences were subsequently input into CNN and RNN models for training and then applied to detect code vulnerability. Li et al. created VulDeePecker ^[33], which is a vulnerability detection system based on deep learning. This system generates code gadgets (i.e., sets of control or data-dependent statements) that are lexically analyzed to establish token sequences, which are then fed into neural networks for vulnerability detection purposes. Later, Li et al. proposed SySeVR ^[8], which is a system framework for detecting vulnerabilities in C/C++ source code. This framework is primarily focused on obtaining code sequences that capture both syntactic and semantic information to achieve vulnerability detection.

Since sequence-based code representation overlooks the syntactic structure and control flow information inherent in source code, some research on code vulnerability detection has resorted to trees or graphs as code representations, as well as employing corresponding neural network models to learn semantic information within the code. For instance, Dam et al. ^[34] parsed a source code file into an abstract syntax tree and employed the Tree-LSTM model to detect vulnerabilities within files. Zhou et al. ^[9] introduced Devign, which is a graph neural network (GNN) model that bases its composite code representation on the abstract syntax tree. Devign encodes various data and control dependencies to create a joint graph, which is subsequently input into GNNs to detect source code vulnerability. Li et al. ^[35] deployed a program dependence graph as the code representation and used a FA-GCN (graph convolution network with feature attention) to classify the graph, thereby achieving the successful detection of code vulnerabilities.

However, the single-representation-based method has difficulty in capturing the complete semantic information in the code, thus leading to higher rates of both false positives and false negatives. To address this issue, multiple distinct code representations are extracted, while adaptive deep neural network models are selected or devised to encode the different aspects of the function semantics. By retrieving the deeply implied semantic features and fusing them organically, more comprehensive vulnerability indicative features are obtained that lead to enhanced code vulnerability detection performance.

References

CVE. Available online: https://cve.mitre.org/ (accessed on 24 April 2023).
Scandariato, R.; Walden, J.; Hovsepyan, A. Predicting vulnerable software components via text mining. IEEE Trans. Softw. Eng. 2014, 40, 987–1001.
Shaukat, K.; Luo, S.; Varadharajan, V. A novel deep learning-based approach for malware detection. Eng. Appl. Artif. Intell. 2023, 122, 106030.
Tian, Z.; Wang, Q.; Gao, C.; Chen, L.; Wu, D. Plagiarism detection of multi-threaded programs via siamese neural networks. IEEE Access 2020, 8, 160802–160814.
Tian, Z.; Huang, Y.; Xie, B.; Chen, Y.; Chen, L.; Wu, D. Fine-grained compiler identification with sequence-oriented neural modeling. IEEE Access 2021, 9, 49160–49175.
Tian, Z.; Tian, J.; Wang, Z.; Chen, Y.; Xia, H.; Chen, L. Landscape estimation of solidity version usage on ethereum via version identification. Int. J. Intell. Syst. 2021, 37, 450–477.
Russel, R.; Kim, L.; Hamilton, L. Automated vulnerability detection in source code using deep representation learning. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications, Orlando, FL, USA, 17–20 December 2018; pp. 757–762.
Li, Z.; Zou, D.; Xu, S. SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities. IEEE Trans. Dependable Secur. Comput. 2021, 19, 2244–2258.
Zhou, Y.; Liu, S.; Siow, J. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 1–11.
Sun, H.; Cui, L.; Li, L. VDSimilar: Vulnerability detection based on code similarity of vulnerabilities and patches. Comput. Secur. 2021, 110, 102417.
Jang, J.; Agrawal, A.; Brumley, D. ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions. In Proceedings of the 2012 IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 20–23 May 2012; pp. 48–62.
FlawFinder. Available online: https://dwheeler.com/flawfinder/ (accessed on 10 April 2023).
Younis, A.; Malaiya, Y.; Anderson, C. To fear or not to fear that is the question: Code characteristics of a vulnerable functionwith an existing exploit. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, New Orleans, IL, USA, 9–11 March 2016; pp. 97–104.
Hin, D.; Kan, A.; Chen, H. LineVD: Statement-level vulnerability detection using graph neural networks. In Proceedings of the 19th International Conference on Mining Software Repositories, Pittsburgh, PA, USA, 23–24 May 2022; pp. 596–607.
Yang, S.; Cheng, L.; Zeng, Y. Asteria: Deep Learning-based AST-Encoding for Cross-platform Binary Code Similarity Detection. In Proceedings of the 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Taipei, China, 21–24 June 2021; pp. 154–196.
Vadayath, J.; Eckert, M.; Zeng, K.; Weideman, N.; Menon, G.P.; Fratantonio, Y.; Balzarotti, D.; Doupé, A.; Bao, T.; Wang, R.; et al. Arbiter: Bridging the static and dynamic divide in vulnerability discovery on binary programs. In 31st USENIX Security Symposium (USENIX Security 22); USENIX Association: Boston, MA, USA, 2022; pp. 413–430.
Beaman, C.; Redbourne, M.; Mummery, J.D.; Hakak, S. Fuzzing vulnerability discovery techniques: Survey, challenges and future directions. Comput. Secur. 2022, 120, 102813.
Zheng, P.; Zheng, Z.; Luo, X. Park: Accelerating smart contract vulnerability detection via parallel-fork symbolic execution. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis; Ser. ISSTA 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 740–751.
D’Silva, V.; Kroening, D.; Weissenbacher, G. A survey of automated techniques for formal software verification. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2008, 27, 1165–1178.
Li, Z.; Zou, D.Q.; Xu, S.H. VulPecker: An automated vulnerability detection system based on code similarity analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC ’16). Association for Computing Machinery, New York, NY, USA, 5–8 December 2016; pp. 201–213.
Cui, L.; Hao, Z.; Jiao, Y. Vuldetector: Detecting vulnerabilities using weighted feature graph comparison. IEEE Trans. Inf. Forensics Secur. 2020, 16, 2004–2017.
Li, Z. Survey on static software vulnerability detection for source code. Chin. J. Netw. Inf. Secur. 2019, 5, 1–14.
Kim, S.; Woo, S.; Lee, H.; Oh, H. Vuddy: A scalable approach for vulnerable code clone discovery. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2017; pp. 595–614.
Infer, Infer: A Tool to Detect Bugs in Java and c/c++/objective-c Code before It Ships. 2013. Available online: https://fbinfer.com (accessed on 12 April 2023).
CodeChecker. 2013. Available online: https://codechecker.readthedocs.io/en/latest (accessed on 20 April 2023).
Checkmarx, Checkmarx. 2022. Available online: https://www.checkmarx.com (accessed on 28 April 2023).
Stephan, L.; Sebastian, B.; Alexander, P. An Empirical Study on the Effectiveness of Static C Code Analyzers for Vulnerability Detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual, Republic of Korea, 18–22 July 2022; pp. 544–555.
Goseva-Popstojanova, K.; Perhinschi, A. On the capability of static code analysis to detect security vulnerabilities. Inf. Softw. Tech. 2015, 68, 18–33.
Ghaffarian, S.M.; Shahriari, H.R. Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Comput. Surv. 2017, 50, 1–36.
Perl, H.; Dechand, S.; Smith, M. Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, 12–16 October 2015; pp. 426–437.
Bosu, A.; Carver, J.C.; Hafiz, M. Identifying the characteristics of vulnerable code changes: An empirical study. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, Hong Kong, China, 16–21 November 2014; pp. 257–268.
Lin, G.; Wen, S.; Han, Q.L. Software Vulnerability Detection Using Deep Neural Networks: A Survey. Proc. IEEE 2020, 108, 1825–1848.
Li, Z.; Zou, D.; Xu, S. Vuldeepecker: A deep learning-based system for vulnerability detection. In Proceedings of the 2018 25th Annual Network and Distributed System Security Symposium (NDSS’18), San Diego, CA, USA, 18–21 February 2018; pp. 1–15.
Dam, H.K.; Pham, T.; Ng, S.W. A deep tree-based model for software defect prediction. arXiv 2018, arXiv:1802.00921.
Li, Y.; Wang, S.; Nguyen, T.N. Vulnerability detection with fine-grained interpretations. In Proceedings of the 2021 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 23–28 August 2021; pp. 292–303.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Software Engineering

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Zhenzhou Tian

Binhui Tian

Jiajun Lv

Lingwei Chen

View Times: 184

Update Date: 21 Aug 2023

Table of Contents

Video Upload Options

Confirm