通过多文档精排与融合的开放域问答任务增强实现

扫码查看

原文链接

万方数据
维普

中文摘要：开放域问答(OpenQA)是自然语言处理中的一项具有挑战性的任务,传统的机器学习和深度学习技术通常用于从原始语料库中检索与问题相关的候选文档片段以进行答案提取.然而,当前方法检索的候选文档片段往往包含大量的噪声以及与问题无关的信息,并且主流的OpenQA模型在准确响应需要多个文档片段作为相关证据的问题方面存在不足.鉴于此,提出通过多文档精排与融合增强开放域问答的方法(RFMD),该方法在检索阶段设计了基于Transformer的文档精排模块,以减少候选文档中的噪声信息;在阅读理解阶段,RFMD采用以文本生成为中心的问答模块,通过构建跨文档片段的全局注意力机制,整合多个相关文档片段的信息,准确回答需要多个文档片段作为支持证据的问题.RFMD在NaturalQuestions和TriviaQA数据集上的EM得分分别达到45.8%和63.4%,验证了该模型在OpenQA任务中的有效性和优越性.

外文标题：Open-Domain Question Answering Task Enhanced by Multiple Documents Refinement and Fusion

外文摘要：Open-domain question answering(OpenQA)is a challenging task in natural language processing,the conventional machine learn-ing and deep learning techniques are commonly used to retrieve many candidate document fragments related to the question from the raw cor-pus for answer extraction.However,the candidate document fragments retrieved by current methods tend to include considerable noise and ir-relevant information to the question,and the previous OpenQA model falls short in accurately responding to questions that necessitate multiple document fragments as correlative evidence.Therefore,this paper proposes an open-domain question answering method based on refinement and fusion of multiple documents(RFMD).Specifically,RFMD designs a Transformer-based document refinement module during the retrieval stage to reduce noise information in the candidate documents.In the reading comprehension stage,RFMD employs a text generation-focused question answering module.By constructing a global attention mechanism across document fragments,it integrates information from multiple relevant document fragments to accurately answer questions that require multiple document fragments as supporting evidence.RFMD achieves EM scores of 45.8%and 63.4%on the NaturalQuestions and TriviaQA datasets respectively,verifying the effectiveness and superiority of the model in OpenQA tasks.

外文关键词：

open-domain question answeringpre-training modelgenerative modelsimilarity scorePrompt design

作者：

李博、朱天佑、刘俊健、吕宏伟、陈振宇

展开 >

作者单位：

国家电网有限公司大数据中心,北京 100053

关键词：

开放域问答预训练模型生成模型相似度分数 Prompt设计

出版年：

2024

DOI：