增加类簇级对比的SCCL文本深度聚类方法研究

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：[目的]改进SCCL模型在文本深度聚类任务上的效果,提出一种新的基于SCCL的文本深度聚类模型ISCCL.[方法]ISCCL模型基于句向量预训练语言模型对输入文本进行数据增强和编码获取两组增强表征,在SCCL模型的基础上增加两层非线性网络,将增强表征降维到维度与聚类数量相同的类簇特征空间.从列空间的角度构造正负簇对进行对比学习,引导模型挖掘对聚类任务有用的特征,并减少假正样本产生的影响.[结果]在 AgNews、Biomedical、StackOverflow、20NewsGroups 和 zh10 共 5 种基准数据集中,ISCCL模型的聚类准确率分别达到88.89％、48.74％、78.17％、56.97％和86.42％,较SCCL模型提升0.69％～2.67％.[局限]需要预先设定类簇特征空间维度(与聚类数目K值相同),然而在实际应用中往往很难明确原始数据的具体聚类数目,应当根据数据情况适当调整.[结论]ISCCL模型能够有效提取类簇特征,在SCCL模型的基础上提升了文本深度聚类效果.

外文标题：SCCL Text Deep Clustering with Increased Cluster-Level Comparison

外文摘要：[Objective]This paper proposes a new deep clustering model(ISCCL)for texts based on SCCL,aiming to improve its performance in clustering tasks.[Methods]First,the ISCCL model utilized sentence vector pre-trained models to perform data augmentation and encoding to obtain two sets of augmented representations of the input texts.Then,we added two layers of nonlinear networks to the SCCL model.It reduced the augmented representations to a cluster feature space with dimensions equal to the number of clusters.Third,we constructed positive and negative cluster pairs from the perspective of column space for contrastive learning.It guided the model to explore valuable features for clustering tasks and reduce the impact of false positive samples.[Results]In five benchmark datasets,including AgNews,Biomedical,StackOverflow,20NewsGroups,and zh10,the clustering accuracy of the ISCCL model reached 88.89％,48.74％,78.17％,56.97％,and 86.42％,respectively,which is an improvement of 0.69％to 2.67％compared to the SCCL model.[Limitations]The dimension of the cluster feature space needs to be pre-set(the same as the clustering number K value).However,it is often difficult to determine the specific cluster number of the original data,and adjustments should be made according to the dataset.[Conclusions]The ISCCL model can effectively extract cluster features and improve the deep clustering performance on texts.

外文关键词：

Contrastive LearningDeep ClusteringSCCLCluster Feature LearningRepresentative Learning

作者：

李婕、张智雄、王宇飞

展开 >

作者单位：

中国科学院文献情报中心北京 100190

中国科学院大学经济与管理学院信息资源管理系北京 100190

关键词：

对比学习深度聚类 SCCL 类簇特征学习表示学习

基金：

国家科技图书文献中心专项

项目编号：

2023XM42

出版年：

2024

DOI：

10.11925/infotech.2096-3467.2023.0156

数据分析与知识发现

中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI

影响因子：1.452

ISSN：2096-3467

年,卷(期)：2024.8(3)

参考文献量43