一种基于词法特征和数据挖掘的无意义变量名检测方法

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：标识符是代码的重要组成部分,也是人们理解代码语义的关键元素之一.变量名是最常见的标识符之一,其质量对于代码的可读性和可理解性有着重要的意义.然而,因为各种原因程序员经常使用一些毫无意义的变量名,如"a"和"var"等.这些无意义的变量名严重降低了代码的可理解性,需要进行检测并重构(重命名).为此,提出了一种基于词法特征和数据挖掘的自动化方法,以检测代码中无意义的变量名.首先,对开源代码中的无意义变量名进行了实证分析,发现无意义变量名通常比较短且不包含任何有意义的单词,因此可以利用词法特征筛选出名称较短且不包含有意义单词的可疑变量名.如果可疑变量名包含缩写词,则使用缩写词扩展算法进行扩展,以获得完整的变量名.然后,基于数据挖掘算法判断可疑变量名是否为约定俗成的常用变量名.有些常用的变量名,如"i"和"e",虽然字面上没有明确的语义,但是通过约定俗成的表示规范,程序员可以理解该变量的语义,因此不算是无意义的变量名,也不需要进行重构.如果可疑变量名称不是约定俗成的常用变量名,则断定该变量名为无意义的变量名,并提醒程序员进行重命名.在开源数据集上进行实验,结果表明,该方法具有较高的准确率,其平均查准率为85％,平均查全率为91.5％.

外文标题：Nonsense Variable Names Detection Method Based on Lexical Features and Data Mining

外文摘要：Identifiers is an important part of code,and it is also one of the key elements for people to understand the semantics of code.Variables are widely used to represent objects in programs.Names of such variables could serve as a major clue to the re-sponsibility of the variables if they are serious and properly named.However,unqualified variable names(e.g.,"a","var")are constructed frequently by developers.Such nonsense variable names have a severe negative impact on the readability and maintai-nability of software applications.So,automated identification of bad smells is one of the hot topics in the field of software refacto-ring.To identify such nonsense names automatically,we conduct an empirical study to figure out the key features that could be exploited to distinguishing nonsense names from well-constructed meaningful ones.Results of the study suggest that nonsense variable names are often short and rarely contain meaningful words.To this end,in this paper,we propose a heuristics and data mining-based approach to identifying nonsense variable names.It first retrieves suspicious variable names based on lexical analy-sis.On the resulting suspicious names,it conducts an abbreviation expansion-based filtering to exclude such variable names that are carefully constructed to represent the abbreviations of meaningful words.Finally,it conducts data mining-based filtering to further exclude well-known symbols(e.g."i","e").Experimental results on open source datasets show that the proposed method has high accuracy.Its average precision and recall is 85％and 91.5％,respectively.

外文关键词：

Software refactoringCode qualityData miningNonsense variable namesLexical features

作者：

姜艳杰、东春浩、刘辉

展开 >

作者单位：

北京大学计算机学院北京 100871

北京理工大学计算机学院北京 100081

关键词：

软件重构代码质量数据挖掘无意义变量名词法特征

基金：

国家自然科学基金重点项目

项目编号：

62232003

出版年：

2024

DOI：

10.11896/jsjkx.231100030

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(6)

参考文献量53