首页|字音和字形能有效增强汉字的表示吗?——基于命名实体识别任务的验证

字音和字形能有效增强汉字的表示吗?——基于命名实体识别任务的验证

扫码查看
[目的]验证汉字的字音和字形对增强汉字表示的有效性.[方法]基于命名实体识别任务,分别以通用嵌入模块、双向LSTM模块、Softmax激活的全连接网络模块作为模型的基准字嵌入层、上下文编码层、解码层,在M SRA、PeopleDaily、CCKS2017、Resume、E-Commerce等数据集上,比较以汉字拼音、汉字图像、五笔字型码、四角码、仓颉码、偏旁部首增强字嵌入后Micro-F1值和各实体F1值的变化.[结果]使用字音、字形增强字嵌入,模型在MSRA、PeopleDaily数据集上的性能下降近0.010,在CCKS2017、Resume、E-Commerce数据集上的性能变化无统计学意义.[局限]仅使用32×32像素的简体字图像,可能影响字形特征的提取.[结论]字音、字形特征在增强字的表示的同时也引入了噪音,在不同语料和实体上表现出差异化的效果.
Can Phonetics and Orthography Effectively Enhance Chinese Character Representation?
[Objective]This study aims to investigate the effectiveness of using phonetics and orthography features to enhance the representation of Chinese characters.[Methods]Based on the Named Entity Recognition(NER)task,we used a general embedding module,a bidirectional LSTM module,and a fully connected network with Softmax activation as the benchmark embedding layer,context encoding and decoding layers.Then,we compared the changes in Micro-F 1 scores and entity-specific F1 scores after enhancing character embeddings with Chinese pinyin,images,Wubi input codes,Four-Corner codes,Cangjie codes,and radicals,using datasets such as MSRA,PeopleDaily,CCKS2017,Resume,and E-Commerce.[Results]Using phonetic and orthographic enhanced embeddings led to a performance decrease of nearly 0.01 in the MSRA and PeopleDaily datasets.At the same time,there was no statistically significant change in performance in the CCKS2017,Resume,and E-Commerce datasets.[Limitations]Using only 32x32 pixels images of Chinese simplified characters may affect the extraction of orthographic features.[Conclusions]While phonetic and orthographic features can enhance the representation of Chinese characters,they also introduce noise.They lead to varying impacts on model performance across different corpora and entities.

Glyph EmbeddingFeature FusionCharacter PronunciationNamed Entity RecognitionCharacter Glyph

段宇锋、张美聪、刘宴佐、贺国秀

展开 >

华东师范大学经济与管理学院 上海 200062

字嵌入 特征融合 字音 命名实体识别 字形

2024

数据分析与知识发现
中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI
影响因子:1.452
ISSN:2096-3467
年,卷(期):2024.8(10)