基于IBTM-TMW的信号设备故障文本聚类方法

Research on Fault Text Clustering Method of Signal Equipment Based on IBTM-TMW

杨妮 ¹张友鹏 ¹左静 ¹赵斌¹

扫码查看

作者信息

1. 兰州交通大学自动化与电气工程学院,甘肃兰州 730070
折叠

摘要

针对信号设备故障文本数据存在的长度短、专业性强及难以智能化再利用等问题,提出基于改进的词对主题模型和词向量融合(IBTM-TMW)的信号设备故障文本聚类方法.首先,为减少数据噪音,提升数据质量,在数据预处理过程中引入自建词典和保留动名词处理;其次,在词对的吉布斯采样建模过程中引入词的差异性重要度作为加权因素,利用改进的词对主题模型(IBTM)提升文本主题特征的学习能力,并将词频-改进逆文档频率权重(TF-MIDF)嵌入到Word2vec词向量的生成过程,将词的文本重要性与Word2vec词向量融合,完善文本词特征向量的表示;最后,通过融合文本主题特征向量和词特征向量,增强文本特征的表示能力,并采用K-means++算法进行故障聚类分析.结果表明:同一试验数据集下,所提方法生成的文本特征向量明显优于其他传统模型,其诊断精度达到89.9％,高于K-means,GMM,AGNES和BIRCH等聚类模型(诊断精度分别为78.3％,68.1％,87.9％和81.7％).该方法可增强故障文本特征与类别间关联关系的识别能力,为基于文本数据驱动的故障诊断提供参考.

Abstract

To tackle issues including short length,strong technical specificity and challenges in intelligent reuse of signal equipment fault text data,a signal equipment fault text clustering method based on improved Biterm Topic Model and Word Vector Fusion(IBTM-TMW)is proposed.Firstly,to reduce noise of the data and improve data quality,a customized dictionary and gerund processing are introduced in the process of data preprocessing.Secondly,during the Gibbs sampling modeling process of word pairs,the differential importance of words is introduced as a weighting factor,and the Improved Biterm Topic Model(IBTM)is used to improve the learning capability of text topic features.The weight of Term Frequency-Modified Inverse Document Frequency(TF-MIDF)is embedded into the generation process of Word2vec word vectors.The text importance of words is integrated into the Word2vec word vector to refine the feature vector representation of text words.Finally,the text topic feature vector and the word feature vector are integrated to enhance the text feature representation capability.On this basis,the K-means++algorithm is used for fault cluster analysis.The results show that within the same data set,the quality of the text feature vector generated by IBTM-TMW model is significantly higher than those of LDA and Label-LDA models,and its diagnostic accuracy of Correct Classification Rate(CCR)reaches 89.9％(surpassing the 78.3％,68.1％,87.9％and 81.7％accuracies of K-means,GMM,AGNES and BIRCH,respectively).The proposed method improves the capability of analyzing the correlation between fault text features and their categories,thereby offering a valuable reference for text-data-driven fault diagnosis.

关键词

故障诊断/主题模型/词向量/权重/文本聚类

Key words

Fault diagnosis/Topic model/Word vector/Weight/Text clustering

引用本文复制引用

出版年

2024

中国铁道科学

中国铁道科学研究院

中国铁道科学

CSTPCDCSCD北大核心

影响因子：1.191

ISSN：1001-4632

段落导航