AIGC驱动古籍自动摘要研究:从自然语言理解到生成

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：文本自动摘要作为自然语言处理中的关键任务,旨在压缩长文本信息、解决文本信息过载问题.文章以《二十四史》中的人物列传语料为例,从抽取式和生成式方法出发,探索AIGC 技术驱动下古籍文本自动摘要应用的可行路径,为古籍资源的创造性转化和创新性发展提供参考,助力数字人文理念下的古籍内容价值实现.首先基于GujiBERT、SikuBERT、BERT-ancient-Chinese模型进行语义表征,并使用LexRank 算法进行重要性排序以抽取摘要.然后利用GPT-3.5-turbo、GPT-4 和ChatGLM3 模型生成摘要,并构建ChatGLM3 和GPT-3.5-turbo 微调模型.最后采用信息覆盖率和信息多样性指标对抽取式摘要结果进行评测,采用rouge 和mauve 指标对生成式摘要结果进行评测.研究表明:SikuBERT 在抽取式摘要任务中对古文的语义表征能力和理解能力较强;通用大语言模型在古籍领域的自动摘要能力各有特色,但主旨提炼能力有所欠缺;通过小样本数据集微调GPT-3.5-turbo 和ChatGLM3 模型能有效提升模型的摘要生成能力.

外文标题：AIGC-Driven Research on Automatic Summarization of Ancient Classics:From Natural Language Understanding to Natural Language Generation

外文摘要：As a key task in natural language processing,automatic text summarization aims at compressing long text information and solving the problem of text information overload.Taking the biography corpus in the Twenty-four Histories as an example,this article explores the feasible ways of automatic abstraction for ancient texts driven by AIGC technology from the extractive and generative approaches,provides reference for the creative transformation and innovative development of ancient classics resources,and helps to realize the value of ancient classics under the concept of digital humanities.Firstly,semantic representations are created based on GujiBERT,SikuBERT and BERT-ancient-Chinese models,and the importance is ranked by LexRank algorithm.Secondly,abstracts are generated using GPT-3.5-turbo,GPT-4 and ChatGLM3 models,and fine-tuned ChatGLM3 and GPT-3.5-turbo models are developed.Finally,the extracted abstracts are evaluated using information coverage rate and information diversity metrics,while the generated abstracts are evaluated using rouge and mauve metrics.The study shows that SikuBERT has a stronger ability in semantic representation and comprehension of ancient texts in the extractive summarization task,and common large language models each have their own distinctive abilities in automatic summarization in the field of ancient classics,but lack the ability to summarize the main ideas.The summarization capability of the GPT-3.5-turbo and ChatGLM3 models can be effectively improved by fine-tuning with small sample datasets.

外文关键词：

re-valuing ancient classicsautomatic summarizationSikuBERTlarge language models

作者：

吴娜、刘畅、刘江峰、王东波

展开 >

作者单位：

南京农业大学信息管理学院

南京大学信息管理学院

关键词：

古籍价值再造自动摘要 SikuBERT 大语言模型

基金：

国家社会科学基金重大项目

项目编号：

21&ZD331

出版年：

2024

图书馆论坛

广东省立中山图书馆

图书馆论坛

CSTPCDCSSCICHSSCD北大核心

影响因子：1.864

ISSN：1002-1167

年,卷(期)：2024.44(9)