MMCUP:融合多模态信息的代码注释自动更新方法

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：良好的代码注释对于程序维护有着重要价值.但在实际开发过程中,开发人员经常会在更改代码后忽略更新相应的代码注释,导致更新后的代码和注释不一致,对软件可维护性造成影响.现有注释更新方法在进行注释更新时,通常仅将代码视为普通文本进行处理,忽视了代码结构信息.为此,本文提出了一种融合多模态信息的代码注释更新方法MMCUP(Multi-Modal Comment UPdating).MMCUP使用了旧代码注释、代码编辑序列和AST差异序列三种模态的信息来训练基于Transformer架构的模型,以对注释进行更新.实验结果表明,MMCUP在Accuracy、Recall@5等指标上相较于CUP和HatCUP等方法至少提高了 5.8％和4.4％.

外文标题：MMCUP:Updating Code Comments Based on Multi-Modal Information

外文摘要：As the complexity of software continues to increase,program comprehension has become increasingly important in the process of software development.Code comments are one of the most important documents in program comprehension,and high-quality code comments are of great value for program maintenance.However,during software development developers often neglect to update corresponding comments after changing code.This could introduce inconsistent comments,which not only cause confusion in software development and maintenance but also have a negative impact on the robustness of the system.To address this problem,some research has attempted to automatically update corresponding comments when code changes occur.While code contains abundant and explicit structural information,existing approaches often treat code as plain text and ignore its structural information when updating comments.This can lead to many failures in comment updates.To address this issue,this paper proposes a code comment updating approach called MMCUP(Multi-Modal Comment UPdating)that integrates multi-modal information.MMCUP uses three modalities of information,which includes old code comment sequence,code edit sequence,and AST difference sequence.First,data processing is performed to construct comment sequences based on the original comment information.Then,Code edit sequences and AST difference sequences are constructed based on the code before and after changes.These sequences are combined with old code comments to form input sequences that are fed into the model.After that,the Transformer encoder is used to encode each token in the input sequence separately.During training,multi-modal information features are fused through a multi-head attention mechanism.Finally,the decoder in Transformer is used to decode the encoded multi-modal information features and update comments.Experimental results show that MMCUP has improved Accuracy by 5.8％compared to HatCUP and 4.4％compared to HebCUP.The Recall@5 is 3.6％higher than HatCUP which achieves previous best performance.To determine whether the code edit sequence and AST difference sequence used in MMCUP can help improve the performance of comment update,we conducted ablation experiments.The experimental results show that both the code edit sequence and AST difference sequence can improve comment update performance.In addition,we conducted experimental analysis on the effect of MMCUP when facing more complex scenarios of code changes.The results show that compared with other approaches,for complex samples,MMCUP has shown the best performance among all comment update approaches.This indicates that MMCUP can learn different comment update situations to cope with more complex scenarios.To further validate the effectiveness of our method,we conducted a manual evaluation comparing MMCUP with HatCUP.The results of the manual evaluation also showed that the comments updated by MMCUP were more in line with the expectations of developers.Meanwhile,we discussed the reasons for failure cases of MMCUP and conducted an threats to validity analysis.In future research,we will further optimize the structured features extracted from code.For example,we will utilize control flow and data flow when extracting code features to increase the information obtained by the model.Additionally,we plan to explore other modalities of information that could be integrated into our model to further improve its performance.

外文关键词：

code comment updatingprogram comprehensioncode-comment co-evolutiondeep learningsequence-to-sequence model

作者：

刘诗凡、崔展齐、陈翔、李莉

展开 >

作者单位：

北京信息科技大学计算机学院北京 100101

南通大学信息科学技术学院江苏南通 226019

关键词：

代码注释更新程序理解代码-注释共同演化深度学习序列到序列模型

基金：

江苏省前沿引领技术基础研究专项国家自然科学基金项目北京信息科技大学"勤信人才"培育计划项目

项目编号：

BK20200200161702041QXTCP C201906

出版年：

2024

DOI：

10.11897/SP.J.1016.2024.00172

计算机学报

中国计算机学会中国科学院计算技术研究所

计算机学报

CSTPCD北大核心

影响因子：3.18

ISSN：0254-4164

年,卷(期)：2024.47(1)

参考文献量2