A Survey of Text Classification Algorithms and Application Scenarios
With the advent of the era of big data,text information on the internet has ushered in a blowout growth.As one of the most important technologies in natural language processing,text classification has a wide range of applications,such as sentiment analysis,news categorization,natural language inference,topic labeling,extractive question answer and fake news detection,etc.From the deepening of traditional machine learning methods to the rising of deep learning methods,related research of text classification models and ideas are constantly evolving,and various new methods,data sets,and evaluation indicators emerge in an endless stream,enriching the research in the field of text classification and achieving excellent theoretical achievements and application effects.Nevertheless,with the rapid development of advanced new technologies,the rich and diverse business application scenarios have also introduced many complex new technical challenges to this field,such as text representation learning with unbalanced data,text classification under few-shot learning scenarios,and so on.In response to the above research challenges and problems,this paper conducts an overall survey of text classification methods,and comprehensively discusses the technical challenges faced by current methods and future research directions.More specifically,this paper mainly consists of seven parts,which are(1)Introducing the relevant basic knowledge of text classification technology,including the definition of common symbols,computational paradigms and text preprocessing techniques,and so on.(2)Summarizing the text classification methods based on traditional machine learning.At the same time,in order to facilitate readers to select the appropriate models for different application scenarios,this paper summarizes the advan-tages and disadvantages of different classifiers,i.e.,what kind of text classification problems they are good at dealing with.(3)Sorting out the text classification methods based on the emerging deep learning carefully,which are classified according to the key ideas of representative technologies in the field.Then the main methods under different categories are described,in which their advantages and disadvantages are summarized thoroughly.(4)In order to facilitate readers to verify the validity of the text classification models,this paper systematically summarizes the relevant datasets for the seven most widely used scenarios of text classification technology.(5)This paper introduces the commonly used model evaluation methods under different task objectives in detail,so as to quantitatively and reasonably evaluate the text classification model performance.(6)Based on the above,this paper summarizes and compares the performance of different types of text classification algorithms in typical application scenarios.(7)Summarizing the challenges faced by existing text classification technology and the important research directions in the future from two aspects,i.e.,data limitation and model computation performance.By sorting out the development of text classification research,this paper provides a detailed summary and comparative analysis of representative technologies involved in the development of text classi-fication research which effectively addresses the gap in the application overview of innovative technologies in the field of text classification and offers a comprehensive reference for researchers to quickly get started on related issues.
text classificationmachine learningdeep learningevaluation metricslimited data