Purpose/Significance To construct a named entity corpus of traditional Chinese medicine(TCM)ancient records,and to improve the recognition accuracy and applicability of the general domain named entity recognition(NER)model in the field of TCM ancient records.Method/Process Annotation standards for entities in TCM ancient records are formulated,and 2 384 Xin'an medical records are annotated.A RoBERTa-BiLSTM-CRF model is developed,and word vectors with semantic features are generated using the RoBERTa pre-trained language model.The BiLSTM-CRF model is used to learn the global semantic features of sequences and decode and output the optimal label sequence.Dictionary and rule features are incorporated to enhance the model's capability to recognize entity boundaries and categories.Result/Conclusion The model shows a good recognition effect on the named entity corpus of Xin'an medical cases.Integration of domain terminology dictionaries and rule-based features improves the overall Fl score to 72.8%.
关键词
中医古籍医案/命名实体识别/语料库/词典/自然语言处理
Key words
traditional Chinese medicine(TCM)ancient records/named entity recognition(NER)/corpus/dictionary/natural language processing(NLP)