基于信息检索的源代码自动命名

扫码查看

原文链接

万方数据
维普

中文摘要：源代码自动命名是指为给定代码的方法体命名一个反映代码功能的有意义的名称,可以使代码易读易懂,提高软件开发效率.传统自动命名方法仅使用代码的词法或者语法等单一信息,基于深度学习的自动命名方法通常忽略了语料库中的相似代码,影响命名准确率.针对上述问题,提出一种基于信息检索的源代码自动命名方法.首先,利用预训练模型和BERT-whitening方法提取输入代码和语料库中代码的有效特征,使用欧氏距离计算两者之间的语义相似度.其次,在语料库代码中选择与输入代码语义相似度较高的代码组成候选库,利用Jaccard系数和最长公共子序列分别计算输入代码与候选库代码的词法和语法相似度.最后,使用加权和来匹配候选库中与输入代码最相似的代码片段,复用该代码片段的方法名称作为输入代码的方法名称.实验结果表明,在公开的Java-small数据集上,与基于向量空间模型(VSM)和基于深度学习模型Code2Vec的自动命名方法相比,该方法的F1值分别提升了 6.93和1.22个百分点,具有较优的预测性能.

外文标题：Automatic Naming of Source Code Based on Information Retrieval

外文摘要：Automatic naming of source code entails predicting a descriptive name that reflects the code function within a given method body.This practice can improve code readability and comprehension,thus enhancing the software development efficiency.Traditional naming approaches only use single information,such as lexical or syntactic information of the code,whereas deep learning-based naming approaches usually ignore similar examples in the corpus;both these approaches affect the code naming accuracy.To address these problems,this paper proposes an approach for automatic naming of source codes based on information retrieval.The proposed approach utilizes a pre-trained model and Bidirectional Encoder Representations from Transformers(BERT)-whitening method,which is an overall method for extracting the effective features of the input code and the code in the corpus,and calculates the semantic similarity between them on the basis of the Euclidean distance.Subsequently,the code with the highest semantic similarity ranking to the input code is selected as a candidate library among the corpus codes.The lexical and syntactic similarity between the input code and candidate library codes is calculated using the Jaccard index and the Longest Common Subsequence(LCS)method.Finally,lexical and syntactic similarities are fused to match the code fragment in the candidate library with the highest similarity to the input code.The method name of the code snippet is then reused as the method name of the input code.Experimental results show that the F1 value of the proposed approach on the public Java-small dataset increases by 6.93 and 1.22 percentage points compared to that for the Vector Space Model(VSM)and Code2Vec model,respectively,indicating excellent predictive performance.

外文关键词：

automatic naminginformation retrievaldeep learningBERT-whitening methodsemantic similarity

作者：

李雪、王雅文、张前进

展开 >

作者单位：

北京邮电大学网络与交换技术全国重点实验室,北京 100876

关键词：

自动命名信息检索深度学习 BERT-whitening方法语义相似度

基金：

国家自然科学基金

项目编号：

U1736110

出版年：

2024

DOI：

10.19678/j.issn.1000-3428.0068041

计算机工程

华东计算技术研究所　上海市计算机学会

计算机工程

CSTPCD北大核心

影响因子：0.581

ISSN：1000-3428

年,卷(期)：2024.50(6)