基于《中国图书馆分类法》(下简称《中图法》)的文献深层分类蕴含着两个经典的自然语言处理问题:极限多标签文本分类(Extreme Multi-label Text Classification,XMC)和层次文本分类(Hierarchical Text Classification,HTC).然而目前基于《中图法》的文献分类研究普遍将其视为普通的文本分类问题,由于没有充分挖掘问题的核心特点,这些研究在深层分类上的效果普遍不理想甚至不可行.相较于同类研究,本文基于对《中图法》文献分类特点和难点的深入分析,从XMC和HTC两个角度对基于《中图法》的文献深层分类和相关的解决方案进行了考察和研究,并针对该场景下的特点进行应用和创新,不仅提高了分类的准确度,还扩展了分类的深度和广度.本文模型首先通过适用于XMC问题的轻量深度学习模型提取了文本的语义特征作为分类的基础依据,而后针对《中图法》分类中的HTC问题,利用LTR(Learning to Rank)框架融入包括层级结构信息等多元特征作为分类的辅助依据,极大化地挖掘了蕴含在文本语义及分类体系中的信息和知识.本模型兼具深度学习模型强大的语义理解能力与机器学习模型的可解释性,同时具备良好的可扩展性,后期可较为便捷地融入专家定制的新特征进行提高,并且模型较为轻量,可在有限计算资源下轻松应对数万级别的分类标签,为基于《中图法》的全深度分类奠定良好的基础.
A Study of Automated Deep Classification of Literature Based on Chinese Library Classification
Deep classification of literature based on Chinese Library Classification(CLC)includes two classical natural language processing problems:Extreme Multi-label Text Classification(XMC)and Hierarchical Text Classification(HTC).However,the current research on literature classification based on CLC generally treats it as an ordinary text classification problem.Since the core features of the problem are not fully explored,these studies are generally unsatisfactory or even infeasible in deep categorization.This paper,through the in-depth analysis of the characteristics and difficulties of the literature classification based on CLC,examines and researches the deep classification of literature based on the CLC and related solutions from the perspectives of XMC and HTC.It applies and innovates them for the characteristics of this scenario,which not only improves the accuracy of the classification,but also extends the depth and breadth of the classification.In this paper,the model first extracts the semantic features of the text as the basis of classification through a lightweight deep learning model applicable to the XMC problem.And then,for the HTC problem in the classification of CLC,it utilizes the LTR(Learning to Rank)framework to incorporate multivariate features including hierarchical structural information as the auxiliary basis of classification,which greatly exploits the information and knowledge embedded in the semantic and classification system of the text.The model utilizes the LTR framework to incorporate multiple features including hierarchical structure information as an auxiliary basis for classification.It also combines the powerful semantic understanding ability of deep leaming models with the interpretability of machine learning models,and has good scalability,which can be easily improved by incorporating new features customized by experts at a later stage.Moreover,the model is relatively lightweight,which can easily cope with tens of thousands of classified labels under the limited computational resources,and lays a good foundation for the full-depth categorization based on the CLC.
Extreme Multi-label Text Classification(XMC)Hierarchical Text Classification(HTC)Deep learningChinese Library Classification