首页|增加类簇级对比的SCCL文本深度聚类方法研究

增加类簇级对比的SCCL文本深度聚类方法研究

扫码查看
[目的]改进SCCL模型在文本深度聚类任务上的效果,提出一种新的基于SCCL的文本深度聚类模型ISCCL.[方法]ISCCL模型基于句向量预训练语言模型对输入文本进行数据增强和编码获取两组增强表征,在SCCL模型的基础上增加两层非线性网络,将增强表征降维到维度与聚类数量相同的类簇特征空间.从列空间的角度构造正负簇对进行对比学习,引导模型挖掘对聚类任务有用的特征,并减少假正样本产生的影响.[结果]在 AgNews、Biomedical、StackOverflow、20NewsGroups 和 zh10 共 5 种基准数据集中,ISCCL模型的聚类准确率分别达到88.89%、48.74%、78.17%、56.97%和86.42%,较SCCL模型提升0.69%~2.67%.[局限]需要预先设定类簇特征空间维度(与聚类数目K值相同),然而在实际应用中往往很难明确原始数据的具体聚类数目,应当根据数据情况适当调整.[结论]ISCCL模型能够有效提取类簇特征,在SCCL模型的基础上提升了文本深度聚类效果.
SCCL Text Deep Clustering with Increased Cluster-Level Comparison
[Objective]This paper proposes a new deep clustering model(ISCCL)for texts based on SCCL,aiming to improve its performance in clustering tasks.[Methods]First,the ISCCL model utilized sentence vector pre-trained models to perform data augmentation and encoding to obtain two sets of augmented representations of the input texts.Then,we added two layers of nonlinear networks to the SCCL model.It reduced the augmented representations to a cluster feature space with dimensions equal to the number of clusters.Third,we constructed positive and negative cluster pairs from the perspective of column space for contrastive learning.It guided the model to explore valuable features for clustering tasks and reduce the impact of false positive samples.[Results]In five benchmark datasets,including AgNews,Biomedical,StackOverflow,20NewsGroups,and zh10,the clustering accuracy of the ISCCL model reached 88.89%,48.74%,78.17%,56.97%,and 86.42%,respectively,which is an improvement of 0.69%to 2.67%compared to the SCCL model.[Limitations]The dimension of the cluster feature space needs to be pre-set(the same as the clustering number K value).However,it is often difficult to determine the specific cluster number of the original data,and adjustments should be made according to the dataset.[Conclusions]The ISCCL model can effectively extract cluster features and improve the deep clustering performance on texts.

Contrastive LearningDeep ClusteringSCCLCluster Feature LearningRepresentative Learning

李婕、张智雄、王宇飞

展开 >

中国科学院文献情报中心 北京 100190

中国科学院大学经济与管理学院信息资源管理系 北京 100190

对比学习 深度聚类 SCCL 类簇特征学习 表示学习

国家科技图书文献中心专项

2023XM42

2024

数据分析与知识发现
中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI
影响因子:1.452
ISSN:2096-3467
年,卷(期):2024.8(3)
  • 43