Chinese Medical New Word Detection by Chinese Character's Multi-Semantic Word Vector and Statistical Text Features
[Purpose/Significance]In order to improve the machine's ability of medical texts understanding and the effectiveness of upper-level tasks such as medical natural language processing,and guarantee the timeliness and coverage integrity of medical knowledge content updates,this paper proposes a medical new word detection algorithm that integrates Chinese characters'multi-semantic information with statistical text features of texts.[Method/Process]Taking the abstract of medical literature with canonical words as the source of new word detection,the paper obtained N-gram word string based on the N-gram model and stored it into the dictionary tree.From the word's internal so-lidification degree,the freedom degree,and the semantic similarity,it calculated the correlation confidence,left-right adjacency entropy,and multi-semantic similarity(including the semantic information of Chinese characters with fine-grained characters,BERT word vector information),and traversed the thresholds of each of the above indicators to evaluate the possibility of N-gram word strings as medical new words.[Result/Conclusion]From the latest 1 000 ab-stracts in the Chinese Medical Association as of October 20,2022,the medical new word detection method identified 3 263 new words,of which 764 were retained after removing duplicates.The method incorporating multi-semantic information of Chinese characters has made some progress over existing methods,and can effectively improve the effectiveness of the medical segmentation task.After the medical word segmentation,the noun category is clearer,the concept is more explicit,and the connotation is richer.This algorithm can not only improve the computer's new word detection ability,but also its natural language processing effect in the face of specialized and complex medical texts,which is important to timely update the domain knowledge content.
medical new word discoveryN-grammulti-semantic word vectorcorrelation confidenceleft-right entropy