面向抽取式阅读理解的数据增强研究

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：在抽取式阅读理解中,语言模型在训练数据较少情况下的表现不佳,使用EDA方法虽能有效增加数据量,但会造成数据中语义信息损失,导致模型训练效果不佳.针对上述问题,结合EDA提出面向少样本情况下抽取式阅读理解的数据增强方法,在保留数据中问题正确答案的前提下对数据进行单词级和句子级数据增强.同时,为了对句子语义影响最小的单词进行数据增强,使用基于语义相似度的数据增强方法(DASS)计算句子中某一个单词删除前后的语义相似度,以判断该单词对句子语义的影响,选择对语义影响最小的单词进行数据增强,提升训练数据质量,以提升语言模型鲁棒性.在HotpotQA数据集上的实验结果表明,DASS可以解决模型在样本数量较少时获取语义信息不足的问题,在样本数量为500时,模型预测的F1值提升23.94%,在对整个数据集使用该方法时,模型预测的F1值提升了2.54%.

外文标题：Research on Data Augmentation for Extractive Reading Comprehension

外文摘要：In extractive reading comprehension,the performance of language model is poor in the case of less training data.Although EDA method can effectively increase the amount of data,it will cause the loss of semantic information in the data,resulting in poor training effect of the model.In response to the above problems,combined with EDA,a data augmentation method for extracting reading comprehension in the case of few samples is proposed.The data is enhanced at the word level and sentence level on the premise of retaining the correct answers to the questions in the data.At the same time,the data is enhanced for the single word with the least impact on sentence semantics,The data aug-mentation method based on semantic similarity(DASS)is used to calculate the semantic similarity of a word in a sentence before and after de-letion to determine the impact of the word on sentence semantics.The word with the least impact on semantics is selected for data enhancement to improve the quality of training data,so as to improve the robustness of the language model.The experimental results on HotpotQA show that DASS can solve the problem of insufficient semantic information when the number of samples is small.When the number of samples is 500,the F1 value predicted by the model increases by 23.94%.When this method is used for the whole dataset,the F1 value predicted by the mod-el increases by 2.54%.

外文关键词：

extractive reading comprehensionEDAdata augmentationsemantic similaritymachine reading comprehension

作者：

胡新荣、徐伟、罗瑞奇、刘军平、朱强、杨捷、李立军

展开 >

作者单位：

武汉纺织大学计算机与人工智能学院

湖北省服装信息化工程技术研究中心,湖北武汉 430200

伍伦贡大学计算机与信息技术学院,伍伦贡 2522

宁波慈星股份有限公司,浙江宁波 315000

展开 >

关键词：

抽取式阅读理解 EDA 数据增强语义相似度机器阅读理解

出版年：

2024

DOI：

10.11907/rjdk.231137

软件导刊

湖北省信息学会

软件导刊

影响因子：0.524

ISSN：1672-7800

年,卷(期)：2024.23(6)