二进制代码相似性检测(Binary Code Similarity Detection,BCSD)技术在逆向工程、漏洞检测、恶意软件检测、软件抄袭以及补丁分析等学术应用领域发挥着重要作用。大多数研究已经集中在对二进制函数进行控制流嵌入和基于自然语言处理(Natural Language Processing,NLP)技术的底层代码嵌入技术的研究之中。然而,需要指出的是,函数在实际运行中不仅包含控制流信息,还包括数据流语义信息。因此,如何全面抽象函数的语义特征显得尤为关键。为此,该文提出了BS-DD模型,这是一个融合了控制流和数据依赖关系的二进制函数相似性判断框架。通过模拟执行二进制代码的方法来提取语义信息,并运用化简算法构建数据依赖关系图。最后,借助图神经网络进行相似性判别。对来自开源社区的 7个广泛使用的软件进行了不同组合的编译,并在此基础上设计了3 个不同的任务场景以及真实的漏洞检测实验,用以比较BS-DD方法与最新基于数据流的BCSD方法的性能。实验结果显示,该模型在召回率和MRR(Mean Reciprocal Rank)分数方面取得了显著的提高。在真实环境的漏洞检测中,该模型也始终优于其他方法。
Cross-architecture Binary Code Similarity Analysis Based on Data Dependencies
Binary Code Similarity Detection(BCSD)technology plays a pivotal role in various academic applications such as reverse en-gineering,vulnerability detection,malware analysis,software plagiarism,and patch analysis.Most research efforts have predominantly focused on control-flow embedding of binary functions and the exploration of underlying code embedding techniques utilizing Natural Language Processing(NLP)technology.However,it is worth noting that functions encompass not only control-flow information but also data-flow semantic information during their actual execution.Consequently,achieving a comprehensive abstraction of the semantic features of functions becomes crucial.In light of this,we introduce BS-DD,a framework for assessing binary function similarity that in-tegrates both control flow and data dependency relationships.We extract semantic information by simulating the execution of binary code and employ a simplification algorithm to construct a data dependency graph.Finally,we leverage graph neural networks for similarity as-sessment.We compile seven widely used software packages from the open-source community in various combinations and design three distinct task scenarios,including real-world vulnerability detection experiments,to compare the performance of the BS-DD approach with the latest data-flow-based BCSD methods.Experimental results demonstrate significant improvements in recall and Mean Reciprocal Rank(MRR)scores for such model.In real-world vulnerability detection scenarios,such model consistently outperforms other methods.