融合语义解释和DeBERTa的极短文本层次分类

扫码查看

原文链接

万方数据
维普

中文摘要：文本层次分类在社交评论主题分类、搜索词分类等场景中有重要应用,这些场景的数据往往具有极短文本特征,体现在信息的稀疏性、敏感性等中,这对模型特征表示和分类性能带来了很大挑战,而层次标签空间的复杂性和关联性使得难度进一步加剧.基于此,提出了一种融合语义解释和DeBERTa模型的方法,该方法的核心思想在于:引入具体语境下各个字词或词组的语义解释,补充优化模型获取的内容信息;结合DeBERTa模型的注意力解耦机制与增强掩码解码器,以更好地把握位置信息、提高特征提取能力.所提方法首先对训练文本进行语法分词、词性标注,再构造GlossDeBERTa模型进行高准确率的语义消歧,获得语义解释序列;然后利用SimCSE框架使解释序列向量化,以更好地表征解释序列中的句子信息;最后训练文本经过DeBERTa模型神经网络后,得到原始文本的特征向量表示,再与解释序列中的对应特征向量相加,传入多分类器.实验遴选短文本层次分类数据集TREC中的极短文本部分,并进行数据扩充,最终得到的数据集平均长度为12词.多组对比实验表明,所提出的融合语义解释的DeBERTa模型性能最为优秀,在验证集和测试集上的Accuracy值、F1-micro值、F1-macro值相比其他算法模型有较大的提升,能够很好地应对极短文本层次分类任务.

外文标题：Very Short Texts Hierarchical Classification Combining Semantic Interpretation and DeBERTa

外文摘要：Text hierarchy classification has important applications in scenarios such as social comment topic classification and search term classification.The data in these scenarios often exhibits short text features,which is reflected in the sparsity and sen-sitivity of information.It poses great challenges for model feature representation and classification performance.The complexity and associativity of the hierarchical label space further exacerbate the difficulties.In view of this,a method fusing semantic inter-pretation and DeBERTa model is proposed,and the core idea of the method is as follows:introducing the semantic interpretation of individual words or phrases in specific contexts to supplement and optimize the content information acquired by the model;combining the disentangled attention and enhanced mask decoder of the DeBERTa model to better grasp the location information and improve the feature extraction ability.The method firstly performs grammatical disambiguation and lexical annotation on the training text,and then constructs the GlossDeBERTa model to perform semantic disambiguation with high accuracy to obtain the semantic interpreted sequence.Then the SimCSE framework is used to make the interpreted sequence vectorized to better charac-terize the sentence information in the interpreted sequence.Finally,the training text passes through the DeBERTa model neural network to get the feature vector representations of the original text,which is then summed up with the corresponding feature vector in the interpreted sequence,and passed into the multi-class classifier.The experiments select the very short text portion of the short text hierarchical categorization dataset TREC and expand the data,resulting in a dataset with an average length of 12 words.Multiple sets of comparison experiments show that the DeBERTa model proposed in this paper with fused semantic inter-pretation has the best performance,and the Accuracy,F1-micro,and F1-macro values on the validation and test sets are much bet-ter than other algorithmic models,which can well cope with the task of hierarchical categorization of very short texts.

外文关键词：

Very short textHierarchical classificationSemantic interpretationDeBERTaGlossDeBERTaSimCSE

作者：

陈昊飏、张雷

展开 >

作者单位：

南京大学计算机软件新技术全国重点实验室南京 210023

关键词：

极短文本层次分类语义解释 DeBERTa GlossDeBERTa SimCSE

基金：

国家自然科学基金国家自然科学基金南京大学软件新技术与产业化协同创新中心项目

项目编号：

6219278362376117

出版年：

2024

DOI：

10.11896/jsjkx.231100134

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(5)

参考文献量28