融合密度和划分的文本聚类算法

Text Clustering Algorithm Combining Density and Partition

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：文档聚类是聚类的经典应用,它是将相似的文档归为同一类,可以有效地组织、摘要和导航文本信息,也可以用来提高分类效果.论文使用BERT模型处理文档向量化,将文档表示为高维向量.传统的密度聚类算法不适用于高维数据集,划分聚类算法中的K-均值算法可以有效地聚类文档,但是算法的性能非常依赖于初始中心点的选择.论文提出了一种新的融合密度和划分的文本聚类算法.首先,通过密度选择适当的聚类中心点集合,然后使用最远距离的想法逐渐选择初始类中心点,最后使用划分方法对数据集进行聚类.实验表明,该算法的聚类效果稳定,聚类效果良好.

外文摘要：Document clustering is a classic application of clustering,which is to classify similar documents into the same cate-gory,which can effectively organize,summarize and navigate text information,and can also be used to improve the classification ef-fect.This article uses the BERT model to process documents into vectors and represents documents as high-dimensional vectors.The traditional density clustering algorithm is not suitable for high-dimensional data sets.The K-means algorithm in the partition clustering algorithm can effectively cluster documents,but the performance of the algorithm is very dependent on the selection of the initial center point.This paper proposes a new text clustering algorithm that merges density and partition.First,the appropriate clus-tering center points are selected by density,and then the idea of the farthest distance is used to gradually select the initial cluster center points,and finally,the partition method is used to analyze the data set for clustering.Experiments show that the clustering ef-fect of the new algorithm is stable and good clustering results have been achieved.

外文关键词：

document clusteringBERTK-means algorithmdensityfarthest distance

作者：

刘龙、刘新、蔡林杰、唐朝

展开 >

作者单位：

湘潭大学计算机学院·网络空间安全学院湘潭 411105

关键词：

文档聚类 BERT K-均值算法密度最远距离

基金：

网络犯罪侦查湖南省普通高等学校重点实验室开放基金

项目编号：

2018WLFZZC003

出版年：

2024

DOI：

10.3969/j.issn.1672-9722.2024.01.029

计算机与数字工程

中国船舶重工集团公司第七0九研究所

计算机与数字工程

CSTPCD

影响因子：0.355

ISSN：1672-9722

年,卷(期)：2024.52(1)

参考文献量20