基于DAN与FastText的藏文短文本分类研究

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：随着藏文信息不断融入社会生活,越来越多的藏文短文本数据存在网络平台上.针对传统分类方法在藏文短文本上分类性能低的问题,文中提出了一种基于DAN-FastText的藏文短文本分类模型.该模型使用FastText网络在较大规模的藏文语料上进行无监督训练获得预训练的藏文音节向量集,使用预训练的音节向量集将藏文短文本信息转化为音节向量,把音节向量送入DAN(Deep Averaging Networks)网络并在输出阶段融合经过FastText网络训练的句向量特征,最后通过全连接层和softmax层完成分类.在公开的TNCC(Tibetan News Classification Corpus)新闻标题数据集上所提模型的 Macro-F1是64.53％,比目前最好评测结果TiBERT模型的Macro-F1得分高出2.81％,比GCN模型的Macro-F1得分高出6.14％,融合模型具有较好的藏文短文本分类效果.

外文标题：Study on Tibetan Short Text Classification Based on DAN and FastText

外文摘要：As Tibetan information continues to be integrated into social life,more and more Tibetan short text data is available on online platforms.Aiming at the low classification performance of traditional classification methods on Tibetan short texts,a Ti-betan short text classification model based on DAN-FastText is proposed.The model uses the FastText network to perform un-supervised training on a large-scale Tibetan corpus to obtain the pre-trained Tibetan syllabic vector set,uses the pre-trained sylla-ble vector set to convert the Tibetan short text information into syllable vector,sends the syllable vector into the deep averaging networks(DAN)network and fuses the sentence vector features trained by the FastText network in the output stage,and finally completes the classification through the fully connected layer and the softmax layer.On the publicly available tibetan news classi-fication corpus(TNCC)news headline dataset,Macro-F1 is 64.53％,which is 2.81％higher than that of the TiBERT model and 6.14％higher than that GCN model,and the fusion model has a better Tibetan short text classification effect.

外文关键词：

Tibetan short text classificationFeature fusionDeep averaging networksFast text

作者：

李果、陈晨、杨进、群诺

展开 >

作者单位：

西藏大学信息科学技术学院拉萨 850000

藏文信息技术教育部工程研究中心拉萨 850000

四川大学网络空间安全学院成都 610000

关键词：

藏文短文本分类特征融合深度平均网络快速文本

基金：

国家自然科学基金国家自然科学基金

项目编号：

6187225462162057

出版年：

2024

DOI：

10.11896/jsjkx.230700064

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(z1)

参考文献量19