基于开源LLMs的中文学术文本标题生成研究—

基于开源LLMs的中文学术文本标题生成研究——以人文社科领域为例

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：[目的/意义]标题作为论文的压缩表示和主旨精髓,在检索、标引等环节中发挥着重要作用.以人文社会科学领域的学术文本标题生成任务为例,为大语言模型在学术文本挖掘中的应用提供参考.[方法/过程]从实证的角度出发,探索当前的开源中文大语言模型Qwen-7B在学术文本标题生成任务中的有效性,以及将人文社会科学领域的学术文本数据知识注入开源基座大语言模型的可行性.使用ROUGE和BLUE指标进行词汇级召回率和准确率评分,同时使用ChatGPT智能对话系统进行语句流畅度和语义相关性评分.[结果/结论]研究发现将中文人文社会科学领域的学术文本知识注入Qwen-7B基座模型中并不能有效提升模型在标题生成任务中的能力,开源基座大模型Qwen-7B在中文上的特征和语义学习能力有待进一步增强;LLaMA2-7B模型在中文学术文本标题生成上的能力优于Qwen-7B模型.[创新/局限]基于Qwen-7B模型和人文社会科学领域的学术全文本数据,论证了当前国内的主流开源大语言模型在学术文本标题生成上的应用能力和应用路径,为学术全文本挖掘和组织提供了理论与实践参考.本文使用的对照模型和训练方法受资源限制较为单一,有待进一步拓展以充分地探索大语言模型在学术全文本知识挖掘和组织中的边界.

外文标题：Chinese Academic Text Title Generation Based on Open Source Large Language Models——Taking the Field of Humanities and Social Sciences as an Example

外文摘要：[Purpose/significance]As a compressed representation and the essence of the main idea of a dissertation,the title plays an important role in searching and citation.Taking the task of academic text title generation in the field of humanities and social sciences as an example,it provides a reference for the application of large language models in academic text mining.[Method/process]From an empirical perspective,we explore the effectiveness capability of the current open-source Chinese large language model Qwen-7B in the task of academic text title generation,and the feasibility of injecting the knowledge of academic text data into the open-source base large language model in the field of humanities and social sciences.Vocabulary-level recall and accuracy scores are performed using ROUGE and BLUE metrics,while utterance fluency and semantic relevance scores are performed using the ChatGPT intelligent dialog system.[Result/conclusion]It is found that injecting academic text knowledge in Chinese humanities and social sciences into the Qwen-7B base model does not effectively improve the model's ability in the title generation task,and the feature and semantic learn-ing ability of the open-source base large model Qwen-7B on Chinese needs to be further enhanced;the LLaMA2-7B model outper-forms the Qwen-7B model in the generation of Chinese academic text titles model.[Innovation/limitation]Based on the Qwen-7B model and academic full text data in the field of humanities and social sciences,the current mainstream open-source large language model in China is demonstrated to have the ability to be applied in the generation of academic text headings and the application paths,which provides theoretical and practical references for the academic full text mining and organization.The control models and training methods used in this paper are relatively homogeneous due to resource constraints,and need to be further extended to fully explore the boundaries of large language models in academic full text knowledge mining and organization.

外文关键词：

natural language processingautomatic title generationacademic textlarge language modelsChatGPT

作者：

吴娜、沈思、王东波

展开 >

作者单位：

南京农业大学信息管理学院,江苏南京 210031

南京理工大学经济管理学院,江苏南京 210094

南京农业大学人文与社会计算研究中心,江苏南京 210031

关键词：

自然语言处理标题自动生成学术文本大语言模型 ChatGPT

出版年：

2024

DOI：

10.13833/j.issn.1007-7634.2024.07.015

情报科学

中国科学技术情报学会吉林大学

情报科学

CSTPCDCSSCICHSSCD北大核心

影响因子：2.275

ISSN：1007-7634

年,卷(期)：2024.42(7)