Exploration and Practice of Classification Indexing Combined with Large Language Models
[Purpose/Significance]Document classification is one of the fundamental tasks of information service institutions such as libraries.The limited human resources make it challenging to categorize the vast number of documents,and the current automatic indexing technologies are not yet fully integrated into the entire indexing process.Large language models(LLMs),with their excellent capabilities in natural language understanding and processing capabilities,have been utilized for various natural language processing tasks such as text generation,automatic summarization,and text classification,which can be integrated into the entire classification process.[Method/Process]Combining the long-term practical experience of the National Newspaper Index,the research on how to introduce LLMs into the classification and indexing process is conducted from three aspects:reducing the reading pressure on indexers,directly using LLMs for classification,and combining them with automatic indexing models.A prompt-assisted topic classification model is designed to leverage the LLM for intelligent analysis and extraction of document content,guiding the model to output concise information summaries.This allows indexers to quickly understand the basic situation of the research,grasp the essence of key concepts and their interrelationships,and thus quickly and accurately determine how to classify the collections.[Results/Conclusions]When the LLM cannot be directly used for text classification tasks based on the"Chinese Library Classification"(CLC),it is combined with existing automatic models to generate the ACBKSY model.The overall classification accuracy of the model has improved by 2.16%,and the non-rejection accuracy has increased by 3.77%.On this basis,the actual indexing workflow is optimized to increase the systematicity and coherence of the indexing work,ensuring that every step from document input to final classification is more efficient and accurate.This optimized workflow has been put into use in the R and F categories of the collection,and it can improve the efficiency of indexing by 1.1 to 1.4 times.However,there are still some shortcomings in this paper,such as not providing the LLM with sufficient learning to fully understand the category settings of the CLC and some simple rule divisions;the classification based on the CLC is essentially a hierarchical classification,and how to guide the LLM to gradually output classification results in the form of multiple rounds of dialogue needs further study.
automatic indexinglarge language model(LLM)ERNIE botGPT-4