CINOSUM:面向多民族低资源语言的抽取式摘要模型

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：针对现有的模型无法处理多民族低资源语言自动摘要生成的问题,基于CINO提出了一种面向多民族低资源语言的抽取式摘要模型CINOSUM.为扩大文本摘要的语言范围,首先构建了多种民族语言的摘要数据集MESUM.为解决以往模型在低资源语言上效果不佳的问题,构建了一个框架,采用统一的句子抽取器,以进行不同民族语言的抽取式摘要生成.此外,提出采用多语言数据集的联合训练方法,旨在弥补知识获取上的不足,进而扩展在低资源语言上的应用,显著增强模型的适应性与灵活性.最终,在MESUM数据集上开展了广泛的实验研究,实验结果表明CINOSUM模型在包括藏语和维吾尔语在内的多民族低资源语言环境中表现卓越,并且在ROUGE评价体系下取得了显著的性能提升.

外文标题：CINOSUM:An Extractive Summarization Model for Low-resource Multi-ethnic Language

外文摘要：To address the issue of existing models being unable to handle abstractive summarization for low-resource multilingual languages,this paper proposes an extractive summarization model,CINOSUM,based on CINO(a Chinese minority pre-trained language model).We construct a multi-ethnic language summarization dataset,MESUM,to extend the linguistic scope of text summarization.To overcome the poor performance of previous models on low-resource languages,a unified sentence extraction framework is employed for extractive summarization across various ethnic languages.In addition,we introduce a joint training strategy for multilingual datasets that effectively expands applications in low-resource languages,thereby greatly improving the model's adaptability and flexibility.Ultimately,this paper conducts extensive experimental study on the MESUM dataset,and the results reveal that the CINOSUM model demonstrates superior performance in multilingual low-resource linguistic environments,including Tibetan and Uyghur languages,achieving significant improvements in the ROUGE evaluation metric.

外文关键词：

Extractive summarizationMultilingual pre-trained modelLow-resource language processingKnowledge transfer

作者：

翁彧、罗皓予、超木日力格、刘轩、董俊、刘征

展开 >

作者单位：

中央民族大学民族语言智能分析与安全治理教育部重点实验室北京 100081

中央民族大学中国少数民族语言文学学院北京 100081

关键词：

抽取式摘要多语言预训练模型低资源语言信息处理知识迁移

基金：

国家重点研发计划国家自然科学基金国家自然科学基金

项目编号：

2020YFB1406702-36177257562006257

出版年：

2024

DOI：

10.11896/jsjkx.231100201

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(7)

参考文献量1