基于变分信息瓶颈多任务算法的多领域文本分类
Variational information bottleneck and multi-task learning for multi-domain text classification
马儀 1邵玉斌 1杜庆治 1龙华 2马迪南3
作者信息
- 1. 昆明理工大学信息工程与自动化学院,昆明 650500
- 2. 昆明理工大学信息工程与自动化学院,昆明 650500;云南省媒体融合重点实验室,昆明 650032
- 3. 云南省媒体融合重点实验室,昆明 650032
- 折叠
摘要
多领域文本分类存在领域差异和词汇差异,导致分类的准确性和泛化性低,传统方法无法取得很好的效果.针对上述问题,本文提出基于变分信息瓶颈多任务算法的多领域文本分类方法,将任务建模为从综合特征中提取任务专属特征的分层学习表示问题.首先基于信息瓶颈原理,将综合特征和任务专属特征之间存在的冗余信息建模为均值为零,方差为对角矩阵的加性噪声,通过重参数化方法让噪声参与模型训练;其次通过信息瓶颈的变分边界构建模型损失函数以限制模型的信息流动,从而将带有加性噪声的综合特征解耦为任务专属特征;最后通过解码器中的分类器处理任务专属特征得到文本分类结果.实验表明,该模型在FDU-MTL多领域文本分类数据集上的平均分类准确率达到92.17%,较多个对比模型有明显提升,且该模型具有更好的可解释性.
Abstract
Multi-domain text classification is challenged by domain and vocabulary differences,resulting in low accuracy and generalization.Traditional methods are ineffective in addressing this issue.This paper pro-poses a multi-domain text classification method based on a variational information bottleneck multi-task algo-rithm.The task is formulated as a hierarchical learning representation problem that extracts task-specific fea-tures from comprehensive features.Firstly,we introduce additive between comprehensive features and task-specific features,following the information bottleneck principle.Secondly,we construct a model loss function to limit the information flow through the variational boundary of the information bottleneck,decoupling the comprehensive features with additive noise into task-specific features.Finally,the classifier in the decoder uti-lizes the task-specific features to generate text classification results.The proposed model achieves an average classification accuracy of 92.17%on the FDU-MTL multi-domain text classification dataset,outperforming several compared models and demostrating better interpretability.
关键词
信息瓶颈/多任务模型/多领域/变分边界/可解释性Key words
Information bottleneck/Multi-task model/Multi-domain/Variational boundary/Interpretability引用本文复制引用
基金项目
云南省媒体融合重点实验室项目(320225403)
出版年
2024