基于预训练Transformer语言模型的源代码剽窃检测研究

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：为解决源代码剽窃检测的问题,以及针对现有方法需要大量训练数据且受限于特定语言的不足,提出了一种基于预训练Transformer语言模型的源代码剽窃检测方法,其结合了词嵌入、相似度计算和分类模型.该方法支持多种编程语言,不需要任何标记为剽窃的训练样本,即可达到较好的检测性能.实验结果表明,该方法在多个公开数据集上取得了先进的检测效果,F1值接近.同时,对特定的能获取到较少标记为剽窃训练样本的场景,还提出了一种结合有监督学习分类模型的方法,进一步提升了检测效果.该方法能广泛应用于缺乏训练数据、计算资源有限以及语言多样的源代码剽窃检测场景.

外文标题：Research on Source Code Plagiarism Detection Based on Pre-Trained Transformer Language Model

外文摘要：To address the issue of source code plagiarism detection and the limitations of existing methods that require a large amount of training data and are restricted to specific languages,we propose a source code plagiarism detection method based on pre-trained Transformer language models,in combination with word embedding,similarity and classification models.The proposed method supports multiple programming languages and does not require any training samples labeled as plagiarism to achieve good detection performance.Experimental results show that the proposed method achieves state-of-the-art detection performance on multiple public datasets.In addition,for scenarios where only a few labeled plagiarism training samples can be obtained,this paper also proposes a method that combines supervised learning classification models to further improve detection performance.The method can be widely used in source code plagiarism detection scenarios where training data is scarce,computational resources are limited,and the programming languages are diverse.

外文关键词：

source code plagiarism detectionTransformer modelpre-trained modelmachine learningdeep learning

作者：

钱亮宏、王福德、孙晓海

展开 >

作者单位：

益数软件科技(上海)有限公司数据科学部,上海 200233

吉林海诚科技有限公司技术部,长春 130119

吉林农业大学智慧农业研究院,长春 130118

关键词：

源代码剽窃检测 Transformer模型预训练模型机器学习深度学习

基金：

吉林省教育厅产业化培育基金资助项目

项目编号：

JJKH20240274CY

出版年：

2024

吉林大学学报(信息科学版)

吉林大学

吉林大学学报(信息科学版)

CSTPCD

影响因子：0.607

ISSN：1671-5896

年,卷(期)：2024.42(4)