一种结合代码片段和混合主题模型的软件数据聚类方法

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：使用主题模型进行文档聚类是众多文本挖掘任务中一种常见的做法.许多研究针对软件问答网站的数据,利用主题模型进行聚类来分析不同领域在社区的发展情况.然而,这些软件相关数据往往包含代码片段且文本长度分布不均,使用传统单一的主题模型对文本数据建模,易得到不稳定的聚类结果.文中提出了一种结合代码片段和混合主题模型的聚类方法,并使用Stack Overflow作为数据源,构造了在该平台上被提问数量排名前60的Python第三方库数据集,经过建模,该数据集最终划分为以下6个不同的领域:网络安全、数据分析、人工智能、文本处理、软件开发和系统终端.实验结果表明,在自动评估和人工评估的指标上,使用代码片段结合文本进行主题建模,在聚类结果划分的质量上表现良好,而联合多个模型进行实验,一定程度上提高了聚类结果的稳定性和准确性.

外文标题：Software Data Clustering Method Combining Code Snippets and Hybrid Topic Models

外文摘要：Using topic model to cluster documents is a common practice in many text mining tasks.Many studies use topic models to cluster data from software websites to analyze the development of communities in different fields.However,due to the fact that these software-related data often contain code snippets and the uneven distribution of text length,it is easy to get unstable cluste-ring results by using traditional single topic model to handle this text data.This paper proposes a clustering method combining code snippets and hybrid topic models,and uses Stack Overflow as the data source to construct a Python third-party libraries dataset with the top 60 questions on the platform.After analyzing,it is finally divided into the following six different areas:net-work security,data analysis,artificial intelligence,text processing,software development and system terminal.Experimental re-sults show that in terms of automatic evaluation and manual evaluation indicators,using code snippets combined with text for topic modeling,the quality of clustering results division performs well,while combining multiple models for experiments can im-prove the stability and accuracy of clustering results to a certain extent.

外文关键词：

Code snippetsTopic modelStack OverflowPythonCluster

作者：

魏林林、沈国华、黄志球、蔡梦男、郭菲菲

展开 >

作者单位：

南京航空航天大学计算机科学与技术学院南京 211106

南京航空航天大学高安全系统的软件开发与验证技术工业和信息化部重点实验室南京 211106

软件新技术与产业化协同创新中心南京 210093

关键词：

代码片段主题模型 Stack Overflow Python 聚类

基金：

国家自然科学基金国家自然科学基金民航应急科学与技术重点实验室开放基金

项目编号：

61772270U2241216NJ2022022

出版年：

2024

DOI：

10.11896/jsjkx.230300091

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(6)

参考文献量25