Aiming at the scarcity of cotton pest and disease text corpus data and the lack of Chinese named entity recognition corpus,and the problems of complexity,diversity and uneven distribution of the content of cotton pest and disease entities,a Chinese entity recognition corpus CDIPNER containing 11 categories of cotton pests and diseases entities was constructed,and a named entity recognition model based on RoBERTa multi-feature fusion was proposed.The model adopted RoBERTa pre-training model with stronger mask learning ability for character-level embedding vector conversion,extracted feature vectors jointly by BiLSTM and IDCNN models to capture the temporal and spatial features of the text,respectively,fused the extracted feature vectors using a multi-head self-attention mechanism,and finally generated predicted sequences using the CRF algorithm.The results showed that the model had 96.60%recognition accuracy,95.76%recall,and 96.18%F1 value for named entities in cotton pest and disease text;it also had good results on public datasets such as ResumeNER.The results indicate that the model could effectively identify named entities of cotton pest and disease and has certain generalisation ability.
关键词
棉花/病虫害/RoBERTa模型/命名实体识别/多特征融合/多头注意力机制
Key words
Cotton/Pests and diseases/RoBERTa model/Named entity recognition/Multi-feature fusion/Multi-head attention mechanism