基于GraphSAGE网络的藏文短文本分类研究

Research on Tibetan Short Text Classification Based on GraphSAGE Network

扫码查看

原文链接

维普
万方数据

中文摘要：文本分类是自然语言处理领域的重要研究方向,由于藏文数据的稀缺性、语言学特征抽取的复杂性、篇章结构的多样性等因素导致藏文文本分类任务进展缓慢.因此,该文以图神经作为基础模型进行改进.首先,在"音节-音节""音节-文档"建模的基础上,融合文档特征,采用二元分类模型动态网络构建"文档-文档"边,以充分挖掘短文本的全局特征,增加滑动窗口,减少模型的计算复杂度并寻找最优窗口取值.其次,针对藏文短文本的音节稀疏性,首次引入GraphSAGE作为基础模型,并探究不同聚合方式在藏文短文本分类上的性能差异.最后,为捕获节点间关系的异质性,对邻居节点进行特征加权再平均池化以增强模型的特征提取能力.在TNCC标题文本数据集上,该文模型的分类准确率达到了 62.50％,与传统GCN、原始GraphSAGE和预训练语言模型CINO相比,该方法在分类准确率上分别提高了 2.56％、1％和2.4％.

外文摘要：Test classification is an important research direction in the field of natural language processing.The Tibet-an text categorization is challenged by data scarcity,complexity of extracted linguistic features,and diversity of chapter structures.In this paper,we use graph neural model as the framework.Firstly,on the basis of the"syllable-syllable"and"syllable-document",we combine the document features to dynamically construct"document-docu-ment"edge,mining the global features of short text.We also increase the sliding window to find the optimal win-dow value.Secondly,aimed at the syllable sparsity of Tibetan short text,GraphSAGE is introduced as the base model to explore the performance difference in different aggregation functions.Finally,to capture the heterogeneity of relationships between nodes,a feature-weighting approach is proposed based on average pooling.Experiments on the TNCC title dataset show our model has reached 62.50％accuracy,outperforming the GGN,the original Graph-SAGE and the pre-trained language model CINO by 2.56％,1％and 2.4％,respectively.

外文关键词：

graph neural networkTibetan text classificationTNCC dataset

作者：

敬容、杨逸民、万福成、国旗、于洪志、马宁

展开 >

作者单位：

西北民族大学语言与文化计算教育部重点实验室,甘肃兰州,730030

西北民族大学甘肃省民族语言智能处理重点实验室,甘肃兰州,730030

大连市气象局大连市气象信息中心,辽宁大连,116000

关键词：

图神经网络藏文文本分类 TNCC数据集

基金：

国家自然科学基金

项目编号：

62366046

出版年：

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

年,卷(期)：2024.38(9)

参考文献量6