SCCL Text Deep Clustering with Increased Cluster-Level Comparison
[Objective]This paper proposes a new deep clustering model(ISCCL)for texts based on SCCL,aiming to improve its performance in clustering tasks.[Methods]First,the ISCCL model utilized sentence vector pre-trained models to perform data augmentation and encoding to obtain two sets of augmented representations of the input texts.Then,we added two layers of nonlinear networks to the SCCL model.It reduced the augmented representations to a cluster feature space with dimensions equal to the number of clusters.Third,we constructed positive and negative cluster pairs from the perspective of column space for contrastive learning.It guided the model to explore valuable features for clustering tasks and reduce the impact of false positive samples.[Results]In five benchmark datasets,including AgNews,Biomedical,StackOverflow,20NewsGroups,and zh10,the clustering accuracy of the ISCCL model reached 88.89%,48.74%,78.17%,56.97%,and 86.42%,respectively,which is an improvement of 0.69%to 2.67%compared to the SCCL model.[Limitations]The dimension of the cluster feature space needs to be pre-set(the same as the clustering number K value).However,it is often difficult to determine the specific cluster number of the original data,and adjustments should be made according to the dataset.[Conclusions]The ISCCL model can effectively extract cluster features and improve the deep clustering performance on texts.