首页|基于学习的源代码漏洞检测研究与进展

基于学习的源代码漏洞检测研究与进展

扫码查看
源代码漏洞自动检测是源代码漏洞修复的前提和基础,对于保障软件安全具有重要意义.传统的方法通常是基于安全专家人工制定的规则检测漏洞,但是人工制定规则的难度较大,且可检测的漏洞类型依赖于安全专家预定义的规则.近年来,人工智能技术的快速发展为实现基于学习的源代码漏洞自动检测提供了机遇.基于学习的漏洞检测方法是指使用基于机器学习或深度学习技术来进行漏洞检测的方法,其中基于深度学习的漏洞检测方法由于能够自动提取代码中漏洞相关的语法和语义特征,避免特征工程,在漏洞检测领域表现出了巨大的潜力,并成为近年来的研究热点.本文主要回顾和总结了现有的基于学习的源代码漏洞检测技术,对其研究和进展进行了系统的分析和综述,重点对漏洞数据挖掘与数据集构建、面向漏洞检测任务的程序表示方法、基于机器学习和深度学习的源代码漏洞检测方法、源代码漏洞检测的可解释方法、细粒度的源代码漏洞检测方法等五个方面的研究工作进行了系统的分析和总结.在此基础上,给出了一种结合层次化语义感知、多粒度漏洞分类和辅助漏洞理解的漏洞检测参考框架.最后对基于学习的源代码漏洞检测技术的未来研究方向进行了展望.
Research and Progress on Learning-Based Source Code Vulnerability Detection
Automatic detection of source code vulnerabilities is the precondition and foundation of source code vulnerability repair,which is of great significance for ensuring software security.Traditional approaches usually detect vulnerabilities based on the rules predefined by security experts.However,it is difficult to define detection rules manually,and the types of vulnerabilities that can be detected depend on the rules predefined by security experts.In recent years,the rapid development of artificial intelligence technology has provided opportunities to realize learning-based automatic source code vulnerability detection.Learning-based vulnerability detection methods are data-driven methods that use machine learning or deep learning techniques to detect vulnerabilities,among which deep learning based vulnerability detection methods have shown great potential in the field of vulnerability detection and have become a research hotspot in recent years due to their ability to automatically extract syntax and semantic features related to vulnerabilities in source code to avoid feature engineering.This paper mainly reviews and summarizes existing learning-based source code vulnerability detection techniques,and provides a systematic analysis and overview of their research and progress,focusing on five aspects of the research work:vulnerability data mining and dataset construction,program representation methods for vulnerability detection tasks,traditional machine learning and deep learning-based source code vulnerability detection approaches,interpretable methods for source code vulnerability detection,fine-grained methods for source code vulnerability detection.Specifically,in the first part,we count existing publicly available vulnerability datasets,including their sources and sizes,and describe the challenges faced in building vulnerability datasets,as well as how to address these challenges.In the second part,we briefly introduce intermediate code representations and divide existing code representations applied in the field of vulnerability detection into four categories:metric based,sequence based,syntax tree based and graph based code representations.For each type of code representation method,we list some representative methods and analyze their advantages and disadvantages.In the third part,we introduce commonly used vulnerability detection tools and review coarse-grained vulnerability detection methods,including rule-based,machine learning based,and deep learning based vulnerability detection methods,and then analyze and discuss the characteristics,strengths and weaknessesof each type of vulnerability detection method.In the fourth part,we introduce interpretable methods that can further explain vulnerability detection results,briefly describe model self-interpretation methods,model approximation methods and sample feedback methods one by one,summarize their characteristics and discuss their strengths and weaknesses.In the fifth part,we first elucidate the problems and challenges posed by fine-grained vulnerability detection,and then provide a detailed description of existing representative methods for fine-grained vulnerability detection and their approaches to alleviate these challenges.Finally,we propose a source code vulnerability detection a framework that combines hierarchical semantic aware,multi-granularity vulnerability classification and assisted vulnerability understanding,and analyze its feasibility.We also prospect the future research directions for learning-based source code vulnerability detection techniques,such as the construction of large-scale,high-quality vulnerability datasets,techniques for detecting vulnerabilities in small or imbalanced samples,accurate and efficient vulnerability detection models,early detection techniques for vulnerabilities etc.

software securitysource code vulnerability detectionvulnerability data miningvulnerability feature extractioncode representation learningdeep learningmodel interpretabilityvulnerability detection

苏小红、郑伟宁、蒋远、魏宏巍、万佳元、魏子越

展开 >

哈尔滨工业大学计算学部 哈尔滨 150001

软件安全 源代码漏洞检测 漏洞数据挖掘 漏洞特征提取 代码表示学习 深度学习 模型可解释性 漏洞检测

国家自然科学基金项目

62272132

2024

计算机学报
中国计算机学会 中国科学院计算技术研究所

计算机学报

CSTPCD北大核心
影响因子:3.18
ISSN:0254-4164
年,卷(期):2024.47(2)
  • 3