Research and Progress on Learning-Based Source Code Vulnerability Detection
Automatic detection of source code vulnerabilities is the precondition and foundation of source code vulnerability repair,which is of great significance for ensuring software security.Traditional approaches usually detect vulnerabilities based on the rules predefined by security experts.However,it is difficult to define detection rules manually,and the types of vulnerabilities that can be detected depend on the rules predefined by security experts.In recent years,the rapid development of artificial intelligence technology has provided opportunities to realize learning-based automatic source code vulnerability detection.Learning-based vulnerability detection methods are data-driven methods that use machine learning or deep learning techniques to detect vulnerabilities,among which deep learning based vulnerability detection methods have shown great potential in the field of vulnerability detection and have become a research hotspot in recent years due to their ability to automatically extract syntax and semantic features related to vulnerabilities in source code to avoid feature engineering.This paper mainly reviews and summarizes existing learning-based source code vulnerability detection techniques,and provides a systematic analysis and overview of their research and progress,focusing on five aspects of the research work:vulnerability data mining and dataset construction,program representation methods for vulnerability detection tasks,traditional machine learning and deep learning-based source code vulnerability detection approaches,interpretable methods for source code vulnerability detection,fine-grained methods for source code vulnerability detection.Specifically,in the first part,we count existing publicly available vulnerability datasets,including their sources and sizes,and describe the challenges faced in building vulnerability datasets,as well as how to address these challenges.In the second part,we briefly introduce intermediate code representations and divide existing code representations applied in the field of vulnerability detection into four categories:metric based,sequence based,syntax tree based and graph based code representations.For each type of code representation method,we list some representative methods and analyze their advantages and disadvantages.In the third part,we introduce commonly used vulnerability detection tools and review coarse-grained vulnerability detection methods,including rule-based,machine learning based,and deep learning based vulnerability detection methods,and then analyze and discuss the characteristics,strengths and weaknessesof each type of vulnerability detection method.In the fourth part,we introduce interpretable methods that can further explain vulnerability detection results,briefly describe model self-interpretation methods,model approximation methods and sample feedback methods one by one,summarize their characteristics and discuss their strengths and weaknesses.In the fifth part,we first elucidate the problems and challenges posed by fine-grained vulnerability detection,and then provide a detailed description of existing representative methods for fine-grained vulnerability detection and their approaches to alleviate these challenges.Finally,we propose a source code vulnerability detection a framework that combines hierarchical semantic aware,multi-granularity vulnerability classification and assisted vulnerability understanding,and analyze its feasibility.We also prospect the future research directions for learning-based source code vulnerability detection techniques,such as the construction of large-scale,high-quality vulnerability datasets,techniques for detecting vulnerabilities in small or imbalanced samples,accurate and efficient vulnerability detection models,early detection techniques for vulnerabilities etc.