计算机工程2024,Vol.50Issue(6) :304-310.DOI:10.19678/j.issn.1000-3428.0068041

基于信息检索的源代码自动命名

Automatic Naming of Source Code Based on Information Retrieval

李雪 王雅文 张前进
计算机工程2024,Vol.50Issue(6) :304-310.DOI:10.19678/j.issn.1000-3428.0068041

基于信息检索的源代码自动命名

Automatic Naming of Source Code Based on Information Retrieval

李雪 1王雅文 1张前进1
扫码查看

作者信息

  • 1. 北京邮电大学网络与交换技术全国重点实验室,北京 100876
  • 折叠

摘要

源代码自动命名是指为给定代码的方法体命名一个反映代码功能的有意义的名称,可以使代码易读易懂,提高软件开发效率.传统自动命名方法仅使用代码的词法或者语法等单一信息,基于深度学习的自动命名方法通常忽略了语料库中的相似代码,影响命名准确率.针对上述问题,提出一种基于信息检索的源代码自动命名方法.首先,利用预训练模型和BERT-whitening方法提取输入代码和语料库中代码的有效特征,使用欧氏距离计算两者之间的语义相似度.其次,在语料库代码中选择与输入代码语义相似度较高的代码组成候选库,利用Jaccard系数和最长公共子序列分别计算输入代码与候选库代码的词法和语法相似度.最后,使用加权和来匹配候选库中与输入代码最相似的代码片段,复用该代码片段的方法名称作为输入代码的方法名称.实验结果表明,在公开的Java-small数据集上,与基于向量空间模型(VSM)和基于深度学习模型Code2Vec的自动命名方法相比,该方法的F1值分别提升了 6.93和1.22个百分点,具有较优的预测性能.

Abstract

Automatic naming of source code entails predicting a descriptive name that reflects the code function within a given method body.This practice can improve code readability and comprehension,thus enhancing the software development efficiency.Traditional naming approaches only use single information,such as lexical or syntactic information of the code,whereas deep learning-based naming approaches usually ignore similar examples in the corpus;both these approaches affect the code naming accuracy.To address these problems,this paper proposes an approach for automatic naming of source codes based on information retrieval.The proposed approach utilizes a pre-trained model and Bidirectional Encoder Representations from Transformers(BERT)-whitening method,which is an overall method for extracting the effective features of the input code and the code in the corpus,and calculates the semantic similarity between them on the basis of the Euclidean distance.Subsequently,the code with the highest semantic similarity ranking to the input code is selected as a candidate library among the corpus codes.The lexical and syntactic similarity between the input code and candidate library codes is calculated using the Jaccard index and the Longest Common Subsequence(LCS)method.Finally,lexical and syntactic similarities are fused to match the code fragment in the candidate library with the highest similarity to the input code.The method name of the code snippet is then reused as the method name of the input code.Experimental results show that the F1 value of the proposed approach on the public Java-small dataset increases by 6.93 and 1.22 percentage points compared to that for the Vector Space Model(VSM)and Code2Vec model,respectively,indicating excellent predictive performance.

关键词

自动命名/信息检索/深度学习/BERT-whitening方法/语义相似度

Key words

automatic naming/information retrieval/deep learning/BERT-whitening method/semantic similarity

引用本文复制引用

基金项目

国家自然科学基金(U1736110)

出版年

2024
计算机工程
华东计算技术研究所 上海市计算机学会

计算机工程

CSTPCD北大核心
影响因子:0.581
ISSN:1000-3428
段落导航相关论文