文本分类算法及其应用场景研究综述

A Survey of Text Classification Algorithms and Application Scenarios

刘晓明 ¹李丞正旭 ¹吴少聪 ¹张宇辰 ¹白红艳 ¹程泽华 ¹陈卓 ¹李永峰 ¹兰钰 ¹沈超¹

扫码查看

作者信息

1. 西安交通大学电子与信息学部西安 710049
折叠

摘要

随着大数据时代的到来,互联网中的文本信息迎来了井喷式的增长.文本分类作为自然语言处理中最重要的技术之一,其广泛应用于多个领域,如情感分析、新闻分类、自然语言推理、主题标记、抽取式问答、虚假内容检测等.从传统机器学习分类方法理论的深入到深度学习分类方法探索的兴起,相关研究模型与思路也在不断演变,各类新的方法、数据集和评价指标层出不穷,丰富了文本分类领域的研究,取得了卓越的理论成就和应用效果.尽管如此,新技术不断发展和业务应用场景不断丰富,同时,也为文本分类研究带来了许多新的问题与挑战,如数据约束场景中不均衡数据的文本表征学习、小样本场景下的文本分类等.针对当前研究难题与挑战,本文对文本分类方法进行了系统性调研,并对当前方法在实际应用场景中面临的技术挑战和未来的研究方向进行了综合探讨.具体而言,本文主要综述了七部分内容,分别是:(1)对文本分类技术的相关基础知识进行了全面介绍,包括文本分类的常见符号定义、计算范式和文本预处理技术;(2)对基于传统机器学习的文本分类方法进行了详细总结;同时,为了方便读者针对不同的应用场景选择合适的分类模型,本文对不同分类器擅长处理的文本分类难题及方法优劣进行了总结;(3)对基于新兴深度学习的文本分类方法进行了周详梳理,根据领域内代表性技术的核心思想进行分类,在此基础上对不同类别下的主要方法进行描述,同时对其技术的优劣进行了总结;(4)为了方便读者对文本分类模型的有效性进行验证,针对文本分类技术应用最为广泛的七大场景,本文对相关数据集进行了系统性的总结;(5)本文对不同任务目标下的常用的模型评价方法进行详尽介绍,以便对模型性能进行合理的定量评估;(6)基于上述内容,本文对典型应用场景中不同种类文本分类算法进行了性能总结对比;(7)本文分别从数据约束与模型计算两个层面对当前文本分类技术所面临的挑战和未来的重要研究方向进行了总结.本文通过梳理文本分类研究发展脉络,对涉及的代表性技术进行了详细总结和对比分析,有效填补了文本分类领域前沿技术的应用综述空白.

Abstract

With the advent of the era of big data,text information on the internet has ushered in a blowout growth.As one of the most important technologies in natural language processing,text classification has a wide range of applications,such as sentiment analysis,news categorization,natural language inference,topic labeling,extractive question answer and fake news detection,etc.From the deepening of traditional machine learning methods to the rising of deep learning methods,related research of text classification models and ideas are constantly evolving,and various new methods,data sets,and evaluation indicators emerge in an endless stream,enriching the research in the field of text classification and achieving excellent theoretical achievements and application effects.Nevertheless,with the rapid development of advanced new technologies,the rich and diverse business application scenarios have also introduced many complex new technical challenges to this field,such as text representation learning with unbalanced data,text classification under few-shot learning scenarios,and so on.In response to the above research challenges and problems,this paper conducts an overall survey of text classification methods,and comprehensively discusses the technical challenges faced by current methods and future research directions.More specifically,this paper mainly consists of seven parts,which are(1)Introducing the relevant basic knowledge of text classification technology,including the definition of common symbols,computational paradigms and text preprocessing techniques,and so on.(2)Summarizing the text classification methods based on traditional machine learning.At the same time,in order to facilitate readers to select the appropriate models for different application scenarios,this paper summarizes the advan-tages and disadvantages of different classifiers,i.e.,what kind of text classification problems they are good at dealing with.(3)Sorting out the text classification methods based on the emerging deep learning carefully,which are classified according to the key ideas of representative technologies in the field.Then the main methods under different categories are described,in which their advantages and disadvantages are summarized thoroughly.(4)In order to facilitate readers to verify the validity of the text classification models,this paper systematically summarizes the relevant datasets for the seven most widely used scenarios of text classification technology.(5)This paper introduces the commonly used model evaluation methods under different task objectives in detail,so as to quantitatively and reasonably evaluate the text classification model performance.(6)Based on the above,this paper summarizes and compares the performance of different types of text classification algorithms in typical application scenarios.(7)Summarizing the challenges faced by existing text classification technology and the important research directions in the future from two aspects,i.e.,data limitation and model computation performance.By sorting out the development of text classification research,this paper provides a detailed summary and comparative analysis of representative technologies involved in the development of text classi-fication research which effectively addresses the gap in the application overview of innovative technologies in the field of text classification and offers a comprehensive reference for researchers to quickly get started on related issues.

关键词

文本分类/机器学习/深度学习/评价指标/数据约束

Key words

text classification/machine learning/deep learning/evaluation metrics/limited data

引用本文复制引用

基金项目

国家重点研发计划(2020YFB1406900)

国家自然科学基金(62272371)

国家自然科学基金(61902308)

国家自然科学基金(U21B2018)

国家自然科学基金(62103323)

国家自然科学基金(62161160337)

国家自然科学基金(61822309)

国家自然科学基金(61773310)

博士后创新人才支持计划(BX20190275)

博士后创新人才支持计划(BX20200270)

中国博士后科学基金面上项目(2019M663723)

中国博士后科学基金面上项目(2021M692565)

中央高校基本科研业务费专项(xzy012024144)

陕西省重点产业创新计划(2021ZDLGY01-02)

出版年

2024

计算机学报

中国计算机学会中国科学院计算技术研究所

计算机学报

CSTPCDCSCD北大核心

影响因子：3.18

ISSN：0254-4164

被引量2

参考文献量9

段落导航