融合敏感词典和异构图的汉泰跨语言敏感信息识别
Chinese-Thai cross-lingual sensitive information recognition incorporating sensitive dictionary and heterogeneous graph
朱栩冉 1余正涛 1张勇丙1
作者信息
- 1. 昆明理工大学信息工程与自动化学院,云南昆明 650500;昆明理工大学云南省人工智能重点实验室,云南昆明 650500
- 折叠
摘要
通用跨语言文本分类模型识别毒品、暴力和自然灾害等敏感信息不准确,且汉泰双语敏感词表示多样化、难对齐导致不同语言信息聚合能力较弱,为此提出一种融合敏感词典和异构图的汉泰跨语言敏感信息识别方法.利用汉泰敏感词典构建具有文档对齐和词对齐的跨语言异构图结构,将文档以及所含关键词和敏感词作为节点,双语对齐、相似关系和不同词性作为边构建汉泰跨语言异构图;基于多语言预训练模型对文档节点和词节点进行表征;通过多层图卷积神经网络对输入文档进行编码,使用敏感信息分类器对文档进行分类预测.实验结果表明,所提方法准确率较基线模型提高了 5.83%.
Abstract
To address the problems of inaccurate recognition of sensitive information such as drugs,violence and natural disasters using general cross-lingual text classification models,and the weak ability to aggregate information in different languages due to diverse and difficult alignment of bilingual Chinese-Thai sensitive word representations,a Chinese-Thai cross-lingual sensitive information recognition method that integrated sensitive dictionaries and heterogeneous graphs was proposed.The cross-lingual heterogeneous graph structures with document alignment and word alignment to be constructed by the Chinese-Thai sensitive dic-tionary were used,while documents and the contained keywords and sensitive words were taken as nodes,bilingual alignment,similarity relations and different lexical properties were taken as edges to construct the Chinese-Thai cross-lingual heterogeneous graph.Document nodes and word nodes were characterized through a multilingual pre-trained model.Input documents were encoded through a multilayer graph convolutional neural network,and documents were encoded by sensitive information classifier for classification prediction.Experimental results show that the accuracy of the proposed method is improved by 5.83%com-pared to that of the baseline model.
关键词
敏感词典/跨语言/异构图/图卷积神经网络/敏感信息识别/多语言预训练模型/双语对齐Key words
sensitive dictionary/cross-lingual/heterogeneous graph/graph convolutional neural network/sensitive information identification/multi-lingual pre-trained model/bilingual alignment引用本文复制引用
基金项目
国家自然科学基金项目(U21B2027)
国家自然科学基金项目(61972186)
国家自然科学基金项目(62266028)
云南省重大科技专项计划基金项目(202202AD080003)
出版年
2024