基于汉字形音义多元知识和标签嵌入的文本语义匹配模型

A Text Semantic Matching Model with Chinese Characters'Glyph,Pinyin and Sense-based Multi-knowledge and Label Embedding

赵云肖 ¹李茹 ²李欣杰 ³苏雪峰 ⁴施艳蕊 ³乔雪妮 ¹胡志伟 ¹闫智超¹

扫码查看

作者信息

1. 山西大学计算机与信息技术学院,山西太原 030006
2. 山西大学计算机与信息技术学院,山西太原 030006;山西大学计算智能与中文信息处理教育部重点实验室,山西太原 030006
3. 中译语通科技股份有限公司,北京 100043
4. 山西大学计算机与信息技术学院,山西太原 030006;山西工程科技职业大学现代物流学院,山西晋中 030609
折叠

摘要

文本语义匹配指基于给定的文本判别文本之间的语义关系.针对该任务,现有模型的信息编码未考虑利用除汉字字符外的潜在语义信息,且在分类时未考虑标签信息对模型性能的影响.因此,该文提出了一种使用汉字形音义多元知识和标签嵌入的文本语义匹配方法.首先,通过信息编码层对汉字的形音义的多元知识进行编码;其次,通过信息整合层获取融合汉字形音义多元知识的联合表示;然后,经过标签嵌入层利用编码后的分类标签与汉字形音义的联合表示生成信号监督标签;最后,经过标签预测层获取文本层面与标签层面的联合信息表示,进而对文本语义关系进行最终的判别.在多个数据集上的实验结果显示,该文提出的模型优于多个基线模型,验证了模型的有效性.

Abstract

Text semantic matching aims to identify semantic relationships between texts based on the given texts.The existing methods neglect the enhancement and utilization of potential semantic information other than Chinese characters in the encoder and do not consider the impact of label information.Therefore,this paper proposes a text semantic matching method with multi-knowledge and label embedding via language models.Firstly,the information encodeing layer is used to encode the multi-knowledge of Chinese characters glyph,pinyin and sense.Next,the in-formation integration layer is used to get the joint representation of multi-knowledge of Chinese characters'glyph,pinyin and sense.Then,the label embedding layer utilizes the encoded representationof classificationlabels andjoint representation of multi-knowledge to generate the representation of supervised labels.Further,the label prediction layer acquires enhanced joint representations from both the textual and label aspects,and obtains the ultimate pre-diction of semantic relationships.The experiment results on multiple widely used datasets show that the proposed method is effective and outperforms previous state-of-the-art models.

关键词

汉字形音义多元知识/标签嵌入/文本语义匹配

Key words

Chinese characters'glyph,pinyin,sense-based multi-knowledge/label embedding/text semantic matc-hing

引用本文复制引用

基金项目

国家自然科学基金(61936012)

山西省重点研发计划(202102020101008)

山西省"四个一批"科技兴医创新计划(2022XM01)

出版年

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCSCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

参考文献量36

段落导航