Hierarchical Document Classification Method Based on Improved Self-attention Mechanism and Representation Learning
An essential task of document classification is to study how to effectively represent input features,and sentence and document vector representations can assist in downstream tasks in natural language processing,such as text sentiment analysis and data leakage prevention.Feature representation is also increasingly becoming one of the keys to performance bottlenecks and interpretability of document classification problems.A hierarchical document classification model is proposed to address the pro-blems of extensive repetitive computation and lack of interpretability faced by existing hierarchical models,and the performance effects of sentence and document representations on the document classification problem are investigated.The proposed model in-tegrates a sentence encoder and a document encoder that fuses input feature vectors using an improved self-attention mechanism,forming a hierarchy to enable hierarchical processing of document-level data,simplifying the computation while enhancing the in-terpretability of the model.Compared with the model that only uses the special token vector of pre-trained models as sentence representation,the proposed model can achieve an average of 4%performance improvements on five public document classifica-tion datasets,and an average of about 2%higher than the model that uses mean attention outputs of word vector matrix.