Basic Process of Binary Code Similarity Analysis: Comparison
Please note this is a comparison between Version 1 by Jiang Du and Version 3 by Wendy Huang.

Against the backdrop of highly developed software engineering, code reuse has been widely recognized as an effective strategy to significantly alleviate the burden of development and enhance productivity. However, improper code citation could lead to security risks and license issues. With the source codes of many pieces of software being difficult to obtain, binary code similarity analysis (在高度发达的软件工程的背景下,代码重用已被广泛认为是显着减轻开发负担和提高生产力的有效策略。但是,不正确的代码引用可能会导致安全风险和许可证问题。由于许多软件的源代码难以获取,二进制代码相似度分析(BCSA) has been extensively implemented in fields such as bug search, code clone detection, and patch analysis.)在bug搜索、代码克隆检测、补丁分析等领域得到了广泛的应用。

  • binary code similarity analysis
  • source code
  • pre-compilation
  • compilation
  • assembly
  • linking
  • feature extraction
  • feature representation
  • feature comparison

1. Introduction引言

With the progression of science and technology, electronic devices, software, and the Internet have become integral components of daily life. The continual improvement of Internet technology, the frequent updates and upgrades of software applications, and the increasingly complex network environment, coupled with the ease of use of software, has brought immense convenience to people. However, it also presents significant challenges to software developers in terms of development and maintenance. The utilization of open source software not only reduces the workload of developers but also transfers the maintenance responsibilities to third-party software developers. Despite the benefits of the extensive use of open source software, including improved efficiency in development, it also entails several risks. For instance, the incorporation of open source software that contains vulnerabilities can lead to the introduction of such vulnerabilities into the engineering code. Additionally, the unauthorized use of open source software in project code may result in license compliance issues.随着科学技术的进步,电子设备、软件和互联网已成为日常生活中不可或缺的组成部分。互联网技术的不断进步,软件应用的频繁更新升级,网络环境日趋复杂,再加上软件的易用性,给人们带来了巨大的便利。然而,它在开发和维护方面也给软件开发人员带来了重大挑战。开源软件的使用不仅减少了开发人员的工作量,而且还将维护责任转移给了第三方软件开发人员。尽管广泛使用开源软件有好处,包括提高开发效率,但它也带来了一些风险。例如,合并包含漏洞的开源软件可能会导致将此类漏洞引入工程代码中。此外,在项目代码中未经授权使用开源软件可能会导致许可证合规性问题。
The “2022 Open Source Security and Analysis Report” 年开源安全与分析报告》[1] released by Synopsys highlights the prevalence of open source code in various industries. Among the 17 industries studied, those related to computer hardware semiconductors, network security, energy and clean technology, and the Internet of Things have code bases that are entirely composed of open source code. The remaining industries, which range from 93% to 99% in terms of open source code usage, still have significant portions of their code bases relying on open source software. The report also indicates that the extensive use of open source code in different industries has brought both benefits and risks. For example, in the Internet of Things sector, 100% of codebases use open source software, with 64% of those codebases being vulnerable. Similarly, in the aerospace, automotive, transportation, and logistics industries, 97% of codebases contain open source code and 60% of those codebases have security vulnerabilities.发布的报告强调了开源代码在各行各业的普遍性。在所研究的 17 个行业中,与计算机硬件半导体、网络安全、能源和清洁技术以及物联网相关的行业拥有完全由开源代码组成的代码库。其余行业,在开源代码使用方面从93%到99%不等,仍然有很大一部分代码库依赖于开源软件。报告还指出,开源代码在不同行业的广泛使用既带来了好处,也带来了风险。例如,在物联网领域,100% 的代码库使用开源软件,其中 64% 的代码库容易受到攻击。同样,在航空航天、汽车、运输和物流行业,97% 的代码库包含开源代码,其中 60% 的代码库存在安全漏洞。
In late 2021, a zero-day vulnerability was identified in 年底,在常用程序 ApacheLog4j, a commonly used program. This vulnerability, known as 中发现了一个零日漏洞。此漏洞称为 Log4Shell (CVE-2021-44228) [2],,攻击者可以利用该漏洞在受影响的服务器上执行任意代码。使用该漏洞的首次记录攻击发生在 enables an attacker to execute arbitrary code on an affected server. The first documented attacks using this vulnerability occurred on December 9, initially aimed at the Java Edition 1.18 of the 9 月 1 日,最初针对的是 Microsoft’s Minecraft game. According to the attack cases documented in the 游戏的 Java 版 18.4。根据 GitHub repository存储库 YfryTchsGD/Log4<>jAttackSurface, this vulnerability affects a range of popular services and platforms, including 中记录的攻击案例,该漏洞影响了一系列流行的服务和平台,包括 Apple iCloud, QQ Mailbox, 、QQ 邮箱、Steam Store, 商店、Twitter, and Baidu search. This highlights the potential far-reaching consequences of vulnerabilities in widely used open source codebases. 和百度搜索。这凸显了广泛使用的开源代码库中漏洞的潜在深远后果。
In terms of license security, works of innovation (including software) are protected by exclusive copyright as a matter of default. Any use, copying, distribution, or modification of the software without the express permission of the creator在许可安全方面,创新作品(包括软件)默认受专有版权保护。法律禁止在未经创建者/author in the form of an authorized license is legally prohibited. Even the most lenient open source licenses impose obligations on users when utilizing the software. The potential for license risk arises when the license of open source code present in a codebase may be in conflict with the overarching license of that codebase. For instance, the GNU General Public License (GPL) generally regulates the utilization of open source code in commercial software, but commercial software vendors may neglect the mandates of the GPL license, which may result in license conflicts. With respect to industries, the computer hardware and semiconductor industries have the highest percentage of codebases with open source license conflicts at 93%, followed by the Internet of Things industry at 83%. Conversely, healthcare, health tech, and life sciences have the lowest percentage of codebases with open source license conflicts at 41%.作者以授权许可形式明确许可的情况下使用、复制、分发或修改软件。即使是最宽松的开源许可证,用户在使用该软件时也会承担义务。当代码库中存在的开源代码许可证可能与该代码库的总体许可证发生冲突时,就会出现潜在的许可证风险。例如,GNU 通用公共许可证 (GPL) 通常规范了在商业软件中使用开源代码,但商业软件供应商可能会忽视 GPL 许可证的要求,这可能会导致许可证冲突。从行业来看,计算机硬件和半导体行业的代码库开源许可冲突比例最高,为93%,其次是物联网行业,为83%。相反,医疗保健、健康技术和生命科学的开源许可证冲突代码库比例最低,为 41%。
BCSA constitutes a strategic approach to address security vulnerabilities arising from code reuse, under the inhibitive prerequisite of source access denial. By measuring and comparing the similarity between binary and vulnerable functions, a preliminary assessment of the potential vulnerability properties of the target function can be performed. Such comparison framework can be utilized for a singular binary function match, as well as extended to multiple matches, that is, indexing the target function in a global vulnerability database. Analogously, this methodology can also aid in revealing covert acts of code plagiarism as well as potential licensing risks.是一种战略方法,用于在拒绝源代码访问的抑制性先决条件下解决代码重用引起的安全漏洞。通过测量和比较二元函数和易受攻击函数之间的相似性,我们可以对目标函数的潜在脆弱性进行初步评估。这种比较框架既可用于单一的二进制函数匹配,也可以扩展到多个匹配,即在全局漏洞数据库中对目标函数进行索引。类似地,这种方法也有助于揭示代码剽窃的隐蔽行为以及潜在的许可风险。
The reuse of开源代码的重用或许可安全问题可能会对网络安全和版权保护构成威胁,并使在程序分析期间获取源代码更具挑战性。稳定且适用于嵌入式设备的动态分析工具在可用性方面受到限制。因此,研究人员已经开始研究使用 open source code or licensing security issues can pose threats to both network security and copyright protection and make it more challenging to obtain source code during program analysis. Dynamic analysis tools that are stable and adaptable for use in embedded devices are limited in availability. As a result, researchers have started to investigate the detection of code reuse using BCSA techniques and have achieved significant progress. However, there is a lack of comprehensive literature that presents the recent advancements in BCSA techniques, inspired by technologies such as natural language processing (NLP) and graph neural networks (GNN). A literature review conducted by Haq et al. BCSA 技术检测代码重用,并取得了重大进展。然而,缺乏全面的文献来介绍受自然语言处理 (NLP) 和图神经网络 (GNN) 等技术启发的 BCSA 技术的最新进展。Haq等人进行的文献综述。[3]总结了 provides a summary of the development of BCSA technology in the two decades prior to 2019 and a systematic analysis of the technical details of BCSA methods. Kim et al. 2019 年之前二十年中 BCSA 技术的发展,并系统分析了 BCSA 方法的技术细节。Kim等人。[4]分析了 analyzed 43 BCSA papers from 2014 to 2020, outlined the problems in the current research, and offered solutions. Yu et al43 年至 2014 年的 2020 篇 BCSA 论文,概述了当前研究中存在的问题,并提供了解决方案。Yu 等. [5] evaluated the content评估了 of 34 works, focusing specifically on their performance in searching for vulnerabilities in embedded device firmware.34 项工作的内容,特别关注它们在搜索嵌入式设备固件漏洞方面的表现。
Research on software similarity analysis comprises both source code similarity analysis and 软件相似性分析的研究包括源代码相似性分析和Binary Code Similarity Analysis(BCSA). Source code similarity analysis is often performed whenever the source code is readily available, to discern the reuse of vulnerable code segments or exploit unauthorized utilization of code, particularly in interpretive languages such as Java or CSA。当源代码易于访问时,经常进行源代码相似性分析,以检查易受攻击的代码段的重用或未经许可的代码的利用,例如在解释型语言(如Java或Python. However, in a majority of circumstances, the target programs are in binary format, and procuring the source code presents a formidable challenge. Consequently, BCSA plays a pivotal role in code similarity analysis research. The ensuing sectors provide an overview of BCSA from two perspectives: the transformation process from source to binary code, and the foundational procedures implicated in BCSA.)的情况下。但是,在大多数情况下,目标程序是二进制格式,获取源代码变得具有挑战性。因此,BCSA在代码相似性分析研究中发挥着重要作用。以下各节将从两个角度概述 BCSA:从源代码到二进制代码的转换过程以及 BCSA 中涉及的基本过程。

2. Compile Preprocessing编译预处理

Binary code represents the machine code that results from the compilation of source code and can be executed directly via the central processing unit. This code comprises a series of binary digits (二进制代码表示源代码编译产生的机器代码,可以通过中央处理器直接执行。该代码由一系列二进制数字(0 s and 1 s) and is not easily readable by humans. To facilitate the analysis of binary code, reverse engineering techniques are employed to translate the machine language code into assembly language, and tools such as debuggers are utilized to simplify the manual examination process.秒和 1 秒)组成,人类不容易阅读。为了便于对二进制代码的分析,采用逆向工程技术将机器语言代码翻译成汇编语言,并利用调试器等工具简化人工检查过程。 As shown in Figure如图 1, the typical process of transforming source code into binary code usually encompasses four stages: pre-compilation, compilation, assembly, and linking. The pre-compilation phase primarily manages operations such as the expansion of header files, substitution of macros, and the elimination of comments. The compilation stage carries out lexical, syntax, and semantic analysis on the code, optimizes it, and transforms it into assembly code. The assembly stage transforms assembly code into machine code. Finally, the linking stage integrates the compiled object files into a binary form to generate the final executable file.所示,将源代码转换为二进制代码的典型过程通常包括四个阶段:预编译、编译、汇编和链接。预编译阶段主要管理头文件的扩展、宏的替换和注释的消除等操作。编译阶段对代码进行词法、句法和语义分析,对其进行优化,并将其转换为汇编代码。汇编阶段将汇编代码转换为机器代码。最后,链接阶段将编译的目标文件集成到二进制形式中,以生成最终的可执行文件。
Figure 1. Compilation Process from Source Code to Binary.
从源代码到二进制的编译过程。
The编译过程负责将源代码准确地转换为 compilation process is responsible for accurately converting the source code into a binary format that the CPU can execute directly. However, the outcome of this process is not fixed, as various factors such as the choice of compiler, optimization options, target CPU architecture, and operating system can all have an impact on the final machine code produced. Consequently, the same source code can result in different binary code outputs through different compilation paths, presenting a challenge for BCSA.CPU 可以直接执行的二进制格式。然而,这个过程的结果并不是固定的,因为编译器的选择、优化选项、目标 CPU 架构和操作系统等各种因素都会对生成的最终机器代码产生影响。因此,相同的源代码可能会通过不同的编译路径导致不同的二进制代码输出,这给 BCSA 带来了挑战。

3. 三、Basic Process of BCSA基本流程

The central objective of BCSA is to establish the provenance of two binary functions by analyzing their similarities. This analysis forms the foundation for determining the likeness of binary functions. In certain circumstances, the one-to-one comparison of binary functions can be expanded, such as in the case of vulnerability search, where it may be extended to one-to-many function comparison, and in code clone detection, where it may be expanded to many-to-many function comparison.的中心目标是通过分析两个二进制函数的相似性来确定它们的出处。这种分析构成了确定二进制函数相似性的基础。在某些情况下,二进制函数的一对一比较可以扩展,例如在漏洞搜索的情况下,它可以扩展到一对多功能比较,在代码克隆检测中,它可以扩展到多对多函数比较。 This research presents a clear本研究通过将工作分为三个阶段,清晰地描述了 depiction of the technical characteristics of BCSA technology byBCSA 技术的技术特征,如图 organizing the work into three stages, as depicted in Figure 2. These stages are the feature extraction stage, the feature representation stage, and the feature comparison stage.所示。这些阶段是特征提取阶段、特征表示阶段和特征比较阶段。
Figure 2. The overall process of the method.
该方法的整个过程。
Phase 1: Feature Extraction. The primary task in this stage is to obtain the inherent features of the binary function through the utilization of analysis tools such as 阶段:特征提取。此阶段的主要任务是通过利用 IDA Pro 等分析工具获得二进制函数的固有特征[6], BAP 公司[7], Angr 安格[8], Valgrind 瓦尔格林德[9], etc. These inherent features refer to those that are directly obtained from the analysis tools without any additional processing, such as program control flow graphs and call graphs. The input to this stage is a set of raw binary functions, such as binary files, and the output is the raw binary function features. These features are then subjected to further processing in the feature representation stage before being compared in the feature comparison stage. As an illustration, the work performed in the feature extraction stage in 等。这些固有特征是指直接从分析工具中获取的特征,无需任何额外处理,例如程序控制流图和调用图。此阶段的输入是一组原始二进制函数,例如二进制文件,输出是原始二进制函数特征。然后,这些特征在特征表示阶段进行进一步处理,然后在特征比较阶段进行比较。举例来说,在Gemini 的特征提取阶段执行的工作[10]涉及提取功能控制流图 involves the extraction of function control flow graphs (CFGs) and(CFG) 和基本块信息。 basic block information.
Phase 2: Feature Representation. The main objective of this stage is to process the inherent features of the function obtained in the feature extraction stage in accordance with the author’s requirements and preferences. The input of this stage is the inherent features of the function as produced by the feature extraction stage, and the output is a form of data that can be directly utilized for similarity calculation in the feature comparison stage. As an example, the work performed in the feature representation stage in 阶段:特征表示。此阶段的主要目标是根据作者的要求和偏好处理在特征提取阶段获得的函数的固有特征。该阶段的输入是特征提取阶段产生的函数的固有特征,输出是特征比较阶段可直接用于相似度计算的一种数据形式。例如,在 Gemini 的特征表示阶段执行的工作[10] encompasses two main tasks: first, the basic block information is transformed into a digital vector representation, serving as nodes in the control flow graph (包含两个主要任务:首先,将基本块信息转换为数字向量表示,作为控制流图(CFG), resulting in the creation of an ACFG with basic block attribute information. The ACFG is then represented as a vector through the use of an end-to-end neural network, providing a representation of the function that encompasses both its structural and semantic information. This vector representation is used to directly calculate the similarity of the functions in the feature comparison stage.
)中的节点,从而创建具有基本块属性信息的ACFG。然后,通过使用端到端神经网络将 ACFG 表示为向量,从而提供包含其结构和语义信息的函数表示。这种向量表示用于在特征比较阶段直接计算函数的相似度。 Phase 3: Feature Comparison. The primary task of this stage is to employ an appropriate method to calculate the similarity between pairs of functional features generated in the feature representation stage. The input of this stage is the representation of the functional features directly produced by the feature representation stage, and the output is the score of similarity between the two functions obtained through the similarity calculation. As an illustration, in the case of Gemini [10], the feature comparison stage employs the cosine distance method to determine the similarity between two feature vectors representing two functions.

References

  1. 2022 Open Source Security and Analysis Report . analyst-reports. Retrieved 2023-11-23
  2. CVE-2021-44228 . CVE - Common Vulnerabilities and Exposures. Retrieved 2023-11-23
  3. Irfan Ul Haq; Juan Caballero; A Survey of Binary Code Similarity. ACM Comput. Surv. 2021, 54, 1-38.
  4. Dongkwan Kim; Eunsoo Kim; Sang Kil Cha; Sooel Son; Yongdae Kim; Revisiting Binary Code Similarity Analysis Using Interpretable Feature Engineering and Lessons Learned. IEEE Trans. Softw. Eng. 2022, 49, 1661-1682.
  5. Yu, Y.; Gan, S.; Qiu, J.; et al. Binary Code Similarity Analysis and Its Applications on Embedded Device Firmware Vulnerability Search. Journal of Software 2022, 33, 4137-4172.
  6. IDA Pro . hex-rays. Retrieved 2023-11-23
  7. David Brumley; Ivan Jager; Thanassis Avgerinos; Edward J. Schwartz. BAP: A Binary Analysis Platform; Springer Science and Business Media LLC: Dordrecht, GX, Netherlands, 2011; pp. 463-469.
  8. Fish Wang; Yan Shoshitaishvili. Angr - The Next Generation of Binary Analysis; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, United States, 2017; pp. 8-9.
  9. Nicholas Nethercote; Julian Seward. Valgrind; Association for Computing Machinery (ACM): New York, NY, United States, 2007; pp. 89-100.
  10. Xiaojun Xu; Chang Liu; Qian Feng; Heng Yin; Le Song; Dawn Song. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection; Association for Computing Machinery (ACM): New York, NY, United States, 2017; pp. 363-376.
More
Video Production Service