Software Data Clustering Method Combining Code Snippets and Hybrid Topic Models
Using topic model to cluster documents is a common practice in many text mining tasks.Many studies use topic models to cluster data from software websites to analyze the development of communities in different fields.However,due to the fact that these software-related data often contain code snippets and the uneven distribution of text length,it is easy to get unstable cluste-ring results by using traditional single topic model to handle this text data.This paper proposes a clustering method combining code snippets and hybrid topic models,and uses Stack Overflow as the data source to construct a Python third-party libraries dataset with the top 60 questions on the platform.After analyzing,it is finally divided into the following six different areas:net-work security,data analysis,artificial intelligence,text processing,software development and system terminal.Experimental re-sults show that in terms of automatic evaluation and manual evaluation indicators,using code snippets combined with text for topic modeling,the quality of clustering results division performs well,while combining multiple models for experiments can im-prove the stability and accuracy of clustering results to a certain extent.