首页|CINOSUM:面向多民族低资源语言的抽取式摘要模型

CINOSUM:面向多民族低资源语言的抽取式摘要模型

扫码查看
针对现有的模型无法处理多民族低资源语言自动摘要生成的问题,基于CINO提出了一种面向多民族低资源语言的抽取式摘要模型CINOSUM.为扩大文本摘要的语言范围,首先构建了多种民族语言的摘要数据集MESUM.为解决以往模型在低资源语言上效果不佳的问题,构建了一个框架,采用统一的句子抽取器,以进行不同民族语言的抽取式摘要生成.此外,提出采用多语言数据集的联合训练方法,旨在弥补知识获取上的不足,进而扩展在低资源语言上的应用,显著增强模型的适应性与灵活性.最终,在MESUM数据集上开展了广泛的实验研究,实验结果表明CINOSUM模型在包括藏语和维吾尔语在内的多民族低资源语言环境中表现卓越,并且在ROUGE评价体系下取得了显著的性能提升.
CINOSUM:An Extractive Summarization Model for Low-resource Multi-ethnic Language
To address the issue of existing models being unable to handle abstractive summarization for low-resource multilingual languages,this paper proposes an extractive summarization model,CINOSUM,based on CINO(a Chinese minority pre-trained language model).We construct a multi-ethnic language summarization dataset,MESUM,to extend the linguistic scope of text summarization.To overcome the poor performance of previous models on low-resource languages,a unified sentence extraction framework is employed for extractive summarization across various ethnic languages.In addition,we introduce a joint training strategy for multilingual datasets that effectively expands applications in low-resource languages,thereby greatly improving the model's adaptability and flexibility.Ultimately,this paper conducts extensive experimental study on the MESUM dataset,and the results reveal that the CINOSUM model demonstrates superior performance in multilingual low-resource linguistic environments,including Tibetan and Uyghur languages,achieving significant improvements in the ROUGE evaluation metric.

Extractive summarizationMultilingual pre-trained modelLow-resource language processingKnowledge transfer

翁彧、罗皓予、超木日力格、刘轩、董俊、刘征

展开 >

中央民族大学民族语言智能分析与安全治理教育部重点实验室 北京 100081

中央民族大学中国少数民族语言文学学院 北京 100081

抽取式摘要 多语言预训练模型 低资源语言信息处理 知识迁移

国家重点研发计划国家自然科学基金国家自然科学基金

2020YFB1406702-36177257562006257

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(7)
  • 1