基于预训练Transformer语言模型的源代码剽窃检测研究
Research on Source Code Plagiarism Detection Based on Pre-Trained Transformer Language Model
钱亮宏 1王福德 2孙晓海3
作者信息
- 1. 益数软件科技(上海)有限公司数据科学部,上海 200233
- 2. 吉林海诚科技有限公司技术部,长春 130119;吉林农业大学智慧农业研究院,长春 130118
- 3. 吉林海诚科技有限公司技术部,长春 130119
- 折叠
摘要
为解决源代码剽窃检测的问题,以及针对现有方法需要大量训练数据且受限于特定语言的不足,提出了一种基于预训练Transformer语言模型的源代码剽窃检测方法,其结合了词嵌入、相似度计算和分类模型.该方法支持多种编程语言,不需要任何标记为剽窃的训练样本,即可达到较好的检测性能.实验结果表明,该方法在多个公开数据集上取得了先进的检测效果,F1值接近.同时,对特定的能获取到较少标记为剽窃训练样本的场景,还提出了一种结合有监督学习分类模型的方法,进一步提升了检测效果.该方法能广泛应用于缺乏训练数据、计算资源有限以及语言多样的源代码剽窃检测场景.
Abstract
To address the issue of source code plagiarism detection and the limitations of existing methods that require a large amount of training data and are restricted to specific languages,we propose a source code plagiarism detection method based on pre-trained Transformer language models,in combination with word embedding,similarity and classification models.The proposed method supports multiple programming languages and does not require any training samples labeled as plagiarism to achieve good detection performance.Experimental results show that the proposed method achieves state-of-the-art detection performance on multiple public datasets.In addition,for scenarios where only a few labeled plagiarism training samples can be obtained,this paper also proposes a method that combines supervised learning classification models to further improve detection performance.The method can be widely used in source code plagiarism detection scenarios where training data is scarce,computational resources are limited,and the programming languages are diverse.
关键词
源代码剽窃检测/Transformer模型/预训练模型/机器学习/深度学习Key words
source code plagiarism detection/Transformer model/pre-trained model/machine learning/deep learning引用本文复制引用
基金项目
吉林省教育厅产业化培育基金资助项目(JJKH20240274CY)
出版年
2024