信息与电子工程前沿(英文)2024,Vol.25Issue(1) :121-134,后插19.DOI:10.1631/FITEE.2300296

基于细粒度强化学习增强噪声数据的低资源跨语言摘要

Enhancing low-resource cross-lingual summarization from noisy data with fine-grained reinforcement learning

黄于欣 顾怀领 余正涛 高玉梦 潘通 徐佳龙
信息与电子工程前沿(英文)2024,Vol.25Issue(1) :121-134,后插19.DOI:10.1631/FITEE.2300296

基于细粒度强化学习增强噪声数据的低资源跨语言摘要

Enhancing low-resource cross-lingual summarization from noisy data with fine-grained reinforcement learning

黄于欣 1顾怀领 1余正涛 1高玉梦 1潘通 1徐佳龙1
扫码查看

作者信息

  • 1. 昆明理工大学信息工程与自动化学院,中国 昆明市,650504;昆明理工大学云南省人工智能重点实验室,中国 昆明市,650504
  • 折叠

摘要

跨语言摘要是从源语言文档生成目标语言摘要的任务.最近,端到端跨语言摘要模型通过使用大规模、高质量数据集取得令人瞩目的结果,这些数据集通常是通过将单语摘要语料库翻译成跨语言摘要语料库而构建的.然而,由于低资源语言翻译模型性能有限,翻译噪声会严重降低模型性能.提出一种细粒度强化学习方法解决基于噪声数据的低资源跨语言摘要问题.引入源语言摘要作为黄金信号,减轻翻译后噪声目标摘要的影响.具体来说,通过计算源语言摘要和生成目标语言摘要之间的词相关性和词缺失度设计强化奖励,并将其与交叉熵损失相结合优化跨语言摘要模型.为验证所提出模型性能,构建汉语-越南语和越南语-汉语跨语言摘要数据集.实验结果表明,所提出模型在ROUGE分数和BERTScore方面优于其他基线.

Abstract

Cross-lingual summarization(CLS)is the task of generating a summary in a target language from a document in a source language.Recently,end-to-end CLS models have achieved impressive results using large-scale,high-quality datasets typically constructed by translating monolingual summary corpora into CLS corpora.However,due to the limited performance of low-resource language translation models,translation noise can seriously degrade the performance of these models.In this paper,we propose a fine-grained reinforcement learning approach to address low-resource CLS based on noisy data.We introduce the source language summary as a gold signal to alleviate the impact of the translated noisy target summary.Specifically,we design a reinforcement reward by calculating the word correlation and word missing degree between the source language summary and the generated target language summary,and combine it with cross-entropy loss to optimize the CLS model.To validate the performance of our proposed model,we construct Chinese-Vietnamese and Vietnamese-Chinese CLS datasets.Experimental results show that our proposed model outperforms the baselines in terms of both the ROUGE score and BERTScore.

关键词

跨语言摘要/低资源语言/噪声数据/细粒度强化学习/词相关性/词缺失度

Key words

Cross-lingual summarization/Low-resource language/Noisy data/Fine-grained reinforcement learning/Word correlation/Word missing degree

引用本文复制引用

基金项目

National Natural Science Foundation of China(U21B2027)

National Natural Science Foundation of China(62266027)

National Natural Science Foundation of China(61972186)

National Natural Science Foundation of China(62241604)

Yunnan Provincial Major Science and Technology Special Plan Projects,China(202302AD080003)

Yunnan Provincial Major Science and Technology Special Plan Projects,China(202103AA080015)

Yunnan Provincial Major Science and Technology Special Plan Projects,China(202202AD080003)

General Projects of Basic Research in Yunnan Province,China(202301AT070471)

General Projects of Basic Research in Yunnan Province,China(202301AT070393)

Kunming University of science and Technology"Double First-Class"Joint Project,China(202201BE070001-021)

出版年

2024
信息与电子工程前沿(英文)
浙江大学

信息与电子工程前沿(英文)

CSTPCD
影响因子:0.371
ISSN:2095-9184
参考文献量3
段落导航相关论文