计算机应用与软件2024,Vol.41Issue(3) :226-232.DOI:10.3969/j.issn.1000-386x.2024.03.035

结合指数函数改进的随机近邻嵌入式短文本聚类

STOCHASTIC NEIGHBOR EMBEDDING SHORT TEXT CLUSTERING IMPROVED BY EXPONENTIAL FUNCTION

汪晓晨 宋叔尼
计算机应用与软件2024,Vol.41Issue(3) :226-232.DOI:10.3969/j.issn.1000-386x.2024.03.035

结合指数函数改进的随机近邻嵌入式短文本聚类

STOCHASTIC NEIGHBOR EMBEDDING SHORT TEXT CLUSTERING IMPROVED BY EXPONENTIAL FUNCTION

汪晓晨 1宋叔尼2
扫码查看

作者信息

  • 1. 东北大学理学院 辽宁沈阳 110819
  • 2. 广东培正学院 广东广州 510830
  • 折叠

摘要

近年来深度学习在短文本聚类方面发挥巨大作用,最近提出的短文本聚类(Short Text Clustering,STC)算法在此方面取得不错的成效.为进一步提高聚类准确率并优化算法性能,基于指数函数提出改进的随机近邻嵌入算法.该算法用指数函数度量样本点与聚类中心差距,放大不同特征差别,并在后期使用k-means++算法预先确定聚类中心与聚类数目.在Stackoverflow数据集上的实验证明,随机指数嵌入聚类模型(e-STC)在准确率与标准互信息上均优于原STC模型,准确率相对提高3.2%,互信息相对提高2.9%.

Abstract

In recent years,deep learning has played an important role on the short text clustering.The short text clustering algorithm(STC)proposed recently has achieved good results in this field.In order to further improve the clustering accuracy and optimize the performance of algorithm,an improved stochastic neighbor embedding algorithm based on exponential function(e-STC)is proposed.This algorithm magnified the difference between different features by using exponential function to calculate the gap between sample points and clustering center.In the later stage,K-Means++algorithm was used to determine the clustering center and clustering number in advance.The results of experiments on Stackoverflow dataset show that e-STC algorithm is superior to the original STC algorithm in terms of the accuracy and the normalized mutual information metric.The accuracy is improved by 3.2%,and the normalized mutual information is increased by 2.9%relatively.

关键词

短文本聚类/深度算法/随机近邻嵌入/特征提取

Key words

Short text clustering/Depth clustering/Random neighbor embedding/Feature extraction

引用本文复制引用

基金项目

国家自然科学基金(11801065)

出版年

2024
计算机应用与软件
上海市计算技术研究所 上海计算机软件技术开发中心

计算机应用与软件

CSTPCD北大核心
影响因子:0.615
ISSN:1000-386X
参考文献量28
段落导航相关论文