首页|基于代码和描述文本相融合的软件分类研究

基于代码和描述文本相融合的软件分类研究

扫码查看
第三方软件系统在现代软件开发过程中有着重要的作用.软件开发人员根据需求,在第三方软件库中检索合适的依赖库来构建软件,可避免许多重复工作,加快开发过程.然而,检索第三方依赖库的过程可能会很困难.通常第三方软件库提供预设的标签(类别)给软件开发人员进行查找,但是如果一个软件的预设标签被错误地标注,软件开发人员就无法查找到其需要的库,这势必会影响开发过程.提出了一种软件分类模型来解决上述挑战,模型结合方法向量、方法重要性和文本向量,将未知类别的软件分类到已知类别.鉴于此问题尚未有公开的数据集,为此建立了一个数据集并公开,此数据集包含来自Maven存储库的 30种类别的 120个软件系统.在此自建数据集上对提出的分类模型进行了测试,预测类别的准确度对于 1个候选者的情况(top-1)为 70%,对于 3个候选者的情况(top-3)则达到了 90%.实验结果表明,所提模型可以有效用于对开源存储库中的软件系统分类,辅助软件开发人员快速查找第三方库.
Research on software classification based on the fusion of code and descriptive text
Third-party software systems play a significant role in modern software development.Software developers build software based on requirements by retrieving appropriate dependency libraries from third-party software repositories,effectively avoiding repetitive wheel-building operations and thus speeding up the development process.However,retrieving third-party dependency libraries can be challenging.Typically,third-party software repositories provide preset tags(categories)for software developers to search.However,when a software's preset tags are incorrectly labeled,software developers are unable to find the libraries required,and this inevitably affects the development process.This study proposes a software clustering model to address the aforementioned challenges.The model combines method vectors,method importance,and text vectors to categorize unknown categories of software into known categories.In addition,because no publicly available dataset exists for this problem,we built a dataset and made it publicly available.This clustering model was tested on a self-built dataset comprising 30 categories and software systems from the Maven repository.The accuracy of the prediction category was 70%for one candidate(top-1)and 90%for three candidates(top-3).The experimental results show that our model can help software developers find suitable software,can be useful for classifying software systems in open-source repositories,and can assist software developers in quickly locating third-party libraries.

software classificationthird-party software systemmethod importance scorecode2vec

陈宇航、王世宙、汤正婷、陈良育、姜宁康

展开 >

华东师范大学 软件工程学院,上海 200062

软件分类 第三方软件系统 方法重要性分数 code2vec

2025

华东师范大学学报(自然科学版)
华东师范大学

华东师范大学学报(自然科学版)

北大核心
影响因子:0.55
ISSN:1000-5641
年,卷(期):2025.(1)