软件学报2025,Vol.36Issue(1) :1-26.DOI:10.13328/j.cnki.jos.007143

预训练模型在软件工程领域应用研究进展

Research Progress of Pre-trained Model in Software Engineering

宫丽娜 周易人 乔羽 姜淑娟 魏明强 黄志球
软件学报2025,Vol.36Issue(1) :1-26.DOI:10.13328/j.cnki.jos.007143

预训练模型在软件工程领域应用研究进展

Research Progress of Pre-trained Model in Software Engineering

宫丽娜 1周易人 1乔羽 1姜淑娟 2魏明强 3黄志球1
扫码查看

作者信息

  • 1. 南京航空航天大学计算机科学与技术学院,江苏 南京 211106;高安全系统的软件开发与验证技术工信部重点实验室(南京航空航天大学),江苏 南京 211106
  • 2. 中国矿业大学计算机科学与技术学院,江苏 徐州 221116
  • 3. 南京航空航天大学计算机科学与技术学院,江苏 南京 211106
  • 折叠

摘要

近年来深度学习在软件工程领域任务中取得了优异的性能.众所周知,实际任务中优异性能依赖于大规模训练集,而收集和标记大规模训练集需要耗费大量资源和成本,这限制了深度学习技术在实际任务中的广泛应用.随着深度学习领域预训练模型(pre-trained model,PTM)的发布,将预训练模型引入到软件工程(software engineering,SE)任务中得到了国内外软件工程领域研究人员的广泛关注,并得到了质的飞跃,使得智能化软件工程进入了一个新时代.然而,目前没有研究提炼预训练模型在软件工程领域的成功和机遇.为阐明这一交叉领域的工作(pre-trained models for software engineering,PTM4SE),系统梳理当前基于预训练模型的智能软件工程相关工作,首先给出基于预训练模型的智能软件工程方法框架,其次分析讨论软件工程领域常用的预训练模型技术,详细介绍使用预训练模型的软件工程领域下游任务,并比较和分析预训练模型技术这些任务上的性能.然后详细介绍常用的训练和微调PTM的软件工程领域数据集.最后,讨论软件工程领域使用PTM面临的挑战和机遇.同时将整理的软件工程领域PTM和常用数据集发布在https://github.com/OpenSELab/PTM4SE.

Abstract

In recent years,deep learning has achieved excellent performance in software engineering(SE)tasks.Excellent performance in practical tasks depends on large-scale training sets,and collecting and labeling large-scale training sets require a lot of resources and costs,which limits the wide application of deep learning techniques in practical tasks.With the release of pre-trained model(PTM)in the field of deep learning,researchers in SE have begun to pay attention to PTM and introduced PTM into SE tasks.PTM has made a qualitative leap in SE tasks,which makes intelligent software engineering enter a new era.However,none of the studies have refined the success,failure,and opportunities of pre-trained models in SE.To clarify the work in this cross-field(pre-trained models for software engineering,PTM4SE),this study systematically reviews the current studies related to PTM4SE.Specifically,the study first describes the framework of the intelligent software engineering methods based on pre-trained models and then analyzes the commonly used pre-trained models in SE.Meanwhile,it introduces the downstream tasks in SE with pre-trained models in detail and compares and analyzes the performance of pre-trained model techniques on these tasks.The study then presents the datasets used in SE for training and fine-tuning the PTMs.Finally,it discusses the challenges and opportunities for PTM4SE.The collated PTMs and datasets in SE are published athttps://github.com/OpenSELab/PTM4SE.

关键词

软件仓库挖掘/预训练模型/程序语言模型

Key words

software repository mining/pre-trained model(PTM)/programming language model

引用本文复制引用

出版年

2025
软件学报
中国科学院软件研究所,中国计算机学会

软件学报

CSCD北大核心
影响因子:2.833
ISSN:1000-9825
段落导航相关论文