首页|基于主题词向量中心点的K-means文本聚类算法

基于主题词向量中心点的K-means文本聚类算法

扫码查看
K-means由于其时间复杂度低运行速度快一直是最为流行的聚类算法之一,但是该算法在进行聚类时需要预先给出聚类个数和初始类中心点,其选取得合适与否会直接影响最终聚类效果。该文对初始类中心和迭代类中心的选取进行大量研究,根据决策图进行初始类中心的选择,利用每个类簇的主题词向量替代均值作为迭代类中心。实验表明,该文的初始点选取方法能够准确地选取初始点,且利用主题词向量作为迭代类中心能够很好地避免噪声点和噪声特征的影响,很大程度上地提高了K-means算法的性能。
K-MEANS TEXT CLUSTERING ALGORITHM BASED ON THE CENTER POINT OF SUBJECT WORD VECTOR
K-means is one of the most popular clustering algorithms because of its low time complexity and fast running speed.However,K-means algorithm needs to give the number of clusters and the initial center points in advance when clustering,and its selection will directly affect the final clustering effect.In this paper,a lot of research has been done on the selection of initial class center and iterative class center.The initial cluster center was selected according to the decision diagram,and the subject word vector of each cluster was used instead of the mean value as the iterative cluster center.Experiments show that the initial point selection method in this paper can accurately select the initial point,and using the subject word vector as the iterative class center can well avoid the influence of noise points and noise features,and greatly improve the k-means clustering performance.

K-meansInitial pointDecision graphIterative class centerTopic word vector

季铎、刘云钊、彭如香、孔华锋

展开 >

中国刑事警察学院 辽宁沈阳 110854

公安部第三研究所 上海 201204

武汉商学院 湖北武汉 430056

K-means 初始点 决策图 迭代类中心 主题词向量

国家重点研发计划项目辽宁网络安全执法协同创新中心开放课题

2018YFC0830401

2024

计算机应用与软件
上海市计算技术研究所 上海计算机软件技术开发中心

计算机应用与软件

CSTPCD北大核心
影响因子:0.615
ISSN:1000-386X
年,卷(期):2024.41(10)