融合汉字多语义与文本统计特征的中文医学新词发现研究

Chinese Medical New Word Detection by Chinese Character's Multi-Semantic Word Vector and Statistical Text Features

王巍洁 ¹任慧玲 ¹李晓瑛 ¹王勖 ¹张颖¹

扫码查看

作者信息

1. 北京协和医学院/中国医学科学院医学信息研究所/图书馆北京 100020
折叠

摘要

[目的/意义]为提高机器理解医学文本的能力,提高医学自然语言处理等上层任务效果,保障医学知识内容更新及时性、覆盖完整性,提出一种融合汉字多语义信息与文本统计特征的医学新词发现方法.[方法/过程]以规范用词的医学文献摘要数据为新词发现来源,基于N-gram模型获取N元词串,将词串存入字典树,从词的内部凝固度、词的自由程度、词的语义相似度3个角度同时计算每个N-gram词串的关联置信度、左右邻接熵、多语义相似度(包括汉字细粒度字符语义信息、BERT词向量信息),遍历上述各指标阈值评估N-gram词串为医学新词的可能.[结果/结论]从中华医学会收录的截至2022年10月20日的最新1 000篇文摘中发现医学新词3 263个,去除重复项后,共获得764个医学新词.提出的融合汉字多语义与文本统计特征的医学新词发现方法对比现有方法具有一定提升,且在应用上可以有效提高医学分词任务效果,使医学分词后的名词类别更清晰、概念更明确、内涵更丰富.结合汉字内在多语义信息与字词外部统计特征的医学新词发现方法,不仅可以提高计算机的新词发现能力,还可提高计算机面对专业且复杂的医学文本自然语言处理效果,对及时更新领域知识内容等具有重要帮助.

Abstract

[Purpose/Significance]In order to improve the machine's ability of medical texts understanding and the effectiveness of upper-level tasks such as medical natural language processing,and guarantee the timeliness and coverage integrity of medical knowledge content updates,this paper proposes a medical new word detection algorithm that integrates Chinese characters'multi-semantic information with statistical text features of texts.[Method/Process]Taking the abstract of medical literature with canonical words as the source of new word detection,the paper obtained N-gram word string based on the N-gram model and stored it into the dictionary tree.From the word's internal so-lidification degree,the freedom degree,and the semantic similarity,it calculated the correlation confidence,left-right adjacency entropy,and multi-semantic similarity(including the semantic information of Chinese characters with fine-grained characters,BERT word vector information),and traversed the thresholds of each of the above indicators to evaluate the possibility of N-gram word strings as medical new words.[Result/Conclusion]From the latest 1 000 ab-stracts in the Chinese Medical Association as of October 20,2022,the medical new word detection method identified 3 263 new words,of which 764 were retained after removing duplicates.The method incorporating multi-semantic information of Chinese characters has made some progress over existing methods,and can effectively improve the effectiveness of the medical segmentation task.After the medical word segmentation,the noun category is clearer,the concept is more explicit,and the connotation is richer.This algorithm can not only improve the computer's new word detection ability,but also its natural language processing effect in the face of specialized and complex medical texts,which is important to timely update the domain knowledge content.

关键词

医学新词发现/N-gram/多语义词向量/关联置信度/左右信息熵

Key words

medical new word discovery/N-gram/multi-semantic word vector/correlation confidence/left-right entropy

引用本文复制引用

基金项目

科技创新2030新一代人工智能重大项目(2020AAA0104901)

出版年

2024

图书情报工作

中国科学院文献情报中心

图书情报工作

CSTPCDCSSCICHSSCD北大核心

影响因子：2.203

ISSN：0252-3116

参考文献量49

段落导航