Chinese Scientific Literature Annotation Method Based on Large Language Model
High-quality annotated data are crucial for Natural Language Processing(NLP)tasks in the field of Chinese scientific literature.A method of annotation based on a Large Language Model(LLM)was proposed to address the lack of high-quality annotated corpora and the issues of inconsistent and inefficient manual annotation in Chinese scientific literature.First,a fine-grained annotation specification suitable for multi-domain Chinese scientific literature was established to clarify entity types and annotation granularity.Second,a structured text annotation prompt template and a generation parser were designed.The annotation task of Chinese scientific literature was set up as a single-stage,single-round question-and-answer process in which the annotation specifications and text to be annotated were filled into the corresponding slots of the prompt template to construct the task prompt.This prompt was then injected into the LLM to generate output text containing annotation information.Finally,the structured annotation data were obtained by the parser.Subsequently,using prompt learning based on LLM,the Annotated Chinese Scientific Literature(ACSL)entity dataset was generated,which contains 10 000 annotated documents and 72 536 annotated entities distributed across 48 disciplines.For ACSL,three baseline models based on RoBERTa-wwm-ext,a configuration of the Robustly optimized Bidirectional Encoder Representations from Transformers(RoBERT)approach,were proposed.The experimental results demonstrate that the BERT+Span model performs best on long-span entity recognition in Chinese scientific literature,achieving an F1 value of 0.335.These results serve as benchmarks for future research.
text annotation methodChinese scientific literatureLarge Language Model(LLM)prompt learninginformation extraction