融合领域知识图谱的跨境民族文本聚类方法
Cross-Border Ethnic Text Clustering Method Based on Domain Knowledge Graph
陈春吉 1毛存礼 1张勇丙 1黄于欣 1高盛祥 1郝鹏鹏1
作者信息
- 1. 昆明理工大学 信息工程与自动化学院,云南 昆明 650000;昆明理工大学 云南省人工智能重点实验室,云南 昆明 650000
- 折叠
摘要
跨境民族文本聚类任务旨在建立跨境民族不同文本间的关联关系,为跨境民族文本检索、事件关联分析提供支撑.但是跨境民族间文化文本表达差异大,加上文化表达背景缺失,导致跨境民族文本聚类困难.基于此,该文提出了融合领域知识图谱的跨境民族文本聚类方法,首先融入跨境民族领域知识图谱,实现对跨境民族文本数据的文化背景知识补充及实体语义关联,从而获得文本的增强局部语义;同时考虑到跨境民族文本数据中全局语义信息的重要性,采用异构图注意力网络提取文本、主题、领域关键词之间的全局特征信息;最后利用变分自编码网络进行局部信息和全局信息的融合,并利用学习到的潜在特征表示进行聚类.实验表明,提出方法较基线方法Acc提升 11.4%,NMI提升 1%,ARI提升 9.4%.
Abstract
The task of cross-border ethnic text clustering aims to establish the correlation between different texts of cross-border ethnic groups,which is challenged by substantial differences in cultural text expression among cross-border ethnic groups.This paper proposes a cross-border ethnic text clustering method based on domain knowledge graph.For local semantic information,the method adopts the cross-border ethnic domain knowledge graph to pro-vide the cultural background knowledge and identify the association of entities in the texts.For global semantic in-formation,the method applies the heterogeneous graph attention network is used to extract text features,topics and domain keywords.The variational autoencoding network is finally employed to fuse local information and global in-formation,and the learned feature representation is used for clustering.Experiments show that the proposed method improves Acc by 11.4%,NMI by 1%,and ARI by 9.4%compared with the baseline method.
关键词
跨境民族/知识图谱/文本聚类/异构图注意力网络Key words
cross-border ethnicity/knowledge graph/text clustering/heterogeneous graph attention network引用本文复制引用
基金项目
国家自然科学基金(62166023)
国家自然科学基金(61866019)
云南省自然科学基金(2019FA023)
云南省科技重大专项(202103AA080015)
云南省科技重大专项(202002AD080001)
出版年
2024