首页|基于改进自注意力机制和表示学习的分层文档分类方法

基于改进自注意力机制和表示学习的分层文档分类方法

扫码查看
文档分类的一项基本工作是研究如何高效地表示输入特征,句子和文档向量表示也可以辅助自然语言处理的下游任务,如文本情感分析和数据泄露预防等.特征表示也逐渐成为文档分类问题的性能瓶颈和模型可解释性的关键之一.针对现有分层模型面临的大量重复计算以及可解释性缺乏的问题,提出了一种分层文档分类模型,并研究了句子和文档表示方法对文档分类问题的性能影响.所提模型集成了使用改进自注意力机制融合输入特征向量的句子编码器和文档编码器,形成了一个层次结构,以实现对文档级数据的分层处理,在简化计算的同时增强了模型的可解释性.与仅使用预训练语言模型的特殊标记向量作为句子表示的模型相比,所提模型在5个公开文档分类数据集上实现了平均4%的性能提升,比使用词向量矩阵的注意力输出均值的模型提高了 2%.
Hierarchical Document Classification Method Based on Improved Self-attention Mechanism and Representation Learning
An essential task of document classification is to study how to effectively represent input features,and sentence and document vector representations can assist in downstream tasks in natural language processing,such as text sentiment analysis and data leakage prevention.Feature representation is also increasingly becoming one of the keys to performance bottlenecks and interpretability of document classification problems.A hierarchical document classification model is proposed to address the pro-blems of extensive repetitive computation and lack of interpretability faced by existing hierarchical models,and the performance effects of sentence and document representations on the document classification problem are investigated.The proposed model in-tegrates a sentence encoder and a document encoder that fuses input feature vectors using an improved self-attention mechanism,forming a hierarchy to enable hierarchical processing of document-level data,simplifying the computation while enhancing the in-terpretability of the model.Compared with the model that only uses the special token vector of pre-trained models as sentence representation,the proposed model can achieve an average of 4%performance improvements on five public document classifica-tion datasets,and an average of about 2%higher than the model that uses mean attention outputs of word vector matrix.

Sentence representationDocument representationAttention mechanismDocument classificationModel interpre-tability

廖兴滨、钱杨舸、王乾垒、秦小林

展开 >

中国科学院成都计算机应用研究所自动推理实验室 成都 610213

中国科学院大学计算机科学与技术学院 北京 100080

句子表示 文档表示 注意力机制 文档分类 模型可解释性

四川省科技计划四川省科技计划中科院STS计划区域重点(A类)

2019ZDZX00062020YFQ0056KFJ-STS-QYZD-2021-21-001

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(2)
  • 30