字音和字形能有效增强汉字的表示吗?——基于命名实体识别任务的验证

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：[目的]验证汉字的字音和字形对增强汉字表示的有效性.[方法]基于命名实体识别任务,分别以通用嵌入模块、双向LSTM模块、Softmax激活的全连接网络模块作为模型的基准字嵌入层、上下文编码层、解码层,在M SRA、PeopleDaily、CCKS2017、Resume、E-Commerce等数据集上,比较以汉字拼音、汉字图像、五笔字型码、四角码、仓颉码、偏旁部首增强字嵌入后Micro-F1值和各实体F1值的变化.[结果]使用字音、字形增强字嵌入,模型在MSRA、PeopleDaily数据集上的性能下降近0.010,在CCKS2017、Resume、E-Commerce数据集上的性能变化无统计学意义.[局限]仅使用32×32像素的简体字图像,可能影响字形特征的提取.[结论]字音、字形特征在增强字的表示的同时也引入了噪音,在不同语料和实体上表现出差异化的效果.

外文标题：Can Phonetics and Orthography Effectively Enhance Chinese Character Representation?

外文摘要：[Objective]This study aims to investigate the effectiveness of using phonetics and orthography features to enhance the representation of Chinese characters.[Methods]Based on the Named Entity Recognition(NER)task,we used a general embedding module,a bidirectional LSTM module,and a fully connected network with Softmax activation as the benchmark embedding layer,context encoding and decoding layers.Then,we compared the changes in Micro-F 1 scores and entity-specific F1 scores after enhancing character embeddings with Chinese pinyin,images,Wubi input codes,Four-Corner codes,Cangjie codes,and radicals,using datasets such as MSRA,PeopleDaily,CCKS2017,Resume,and E-Commerce.[Results]Using phonetic and orthographic enhanced embeddings led to a performance decrease of nearly 0.01 in the MSRA and PeopleDaily datasets.At the same time,there was no statistically significant change in performance in the CCKS2017,Resume,and E-Commerce datasets.[Limitations]Using only 32x32 pixels images of Chinese simplified characters may affect the extraction of orthographic features.[Conclusions]While phonetic and orthographic features can enhance the representation of Chinese characters,they also introduce noise.They lead to varying impacts on model performance across different corpora and entities.

外文关键词：

Glyph EmbeddingFeature FusionCharacter PronunciationNamed Entity RecognitionCharacter Glyph

作者：

段宇锋、张美聪、刘宴佐、贺国秀

展开 >

作者单位：

华东师范大学经济与管理学院上海 200062

关键词：

字嵌入特征融合字音命名实体识别字形

出版年：

2024

DOI：

10.11925/infotech.2096-3467.2023.0665

数据分析与知识发现

中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI

影响因子：1.452

ISSN：2096-3467

年,卷(期)：2024.8(10)