Research on Source Code Plagiarism Detection Based on Pre-Trained Transformer Language Model
To address the issue of source code plagiarism detection and the limitations of existing methods that require a large amount of training data and are restricted to specific languages,we propose a source code plagiarism detection method based on pre-trained Transformer language models,in combination with word embedding,similarity and classification models.The proposed method supports multiple programming languages and does not require any training samples labeled as plagiarism to achieve good detection performance.Experimental results show that the proposed method achieves state-of-the-art detection performance on multiple public datasets.In addition,for scenarios where only a few labeled plagiarism training samples can be obtained,this paper also proposes a method that combines supervised learning classification models to further improve detection performance.The method can be widely used in source code plagiarism detection scenarios where training data is scarce,computational resources are limited,and the programming languages are diverse.