首页|多重特征关联和图注意力网络融合的文献分类方法研究——以中文医学文献为例

多重特征关联和图注意力网络融合的文献分类方法研究——以中文医学文献为例

扫码查看
近年来,科学文献呈现增速迅猛、内容复杂、主题细化等特点,给文献分类任务带来了挑战.在此背景下,推动文献自动分类技术的发展,实现科学文献在《中国图书馆分类法》上的正确分类对于信息资源的智能化管理和科学研究的效率化检索具有重要意义.本文提出了多重特征关联和图注意力网络融合的层次分类(hierarchical text clas-sification networks based on multiple feature correlation and graph attention network,HTCN-MCGAT)模型.该模型由三个模块组成.首先是文献表示与增强模块.为适配文献分类任务,采用表示和增强两阶段流程,重新设计BERT(bi-directional encoder representation from transformers)预训练模型的微调阶段,使其能够从文献摘要、标题和关键词的内部字符关联以及外部文档关联两个级别实现当前文献的增强表示.其次是标签关联建模模块.使用图注意力网络实现标签语义和层次结构的关系建模.最后是层次交互分类模块.先构建文献和标签的层次融合注意力机制,实现特征空间的文献语义信息与符号空间的层次标签信息的特征关联;再基于多任务学习视角,通过全局和局部信息融合的层次分类网络实现文献分类.本文以中文医学文献作为研究对象,设计系列实验,相较于逐层和平面多分类方法,HTCN-MCGAT模型在F1-score上提高了4.34%~13.21%.此外,还通过样例分析综合验证了本文模型的有效性.本文从特征关联丰富化和层次关系建模两方面对文献分类模型展开优化,在文献分类任务中发挥了较好的应用价值,未来可以推广至更多具有层次结构的分类任务领域.
Research on Literature Classification Methods Based on Multiple Feature Correlation and a Graph Attention Network Model:A Case Study of Chinese Medical Literature
The increasing complexity and detailed themes of scientific literature pose a significant challenge for efficient classification.A potential solution is the development of automatic literature classification technology,enabling the intelli-gent management of information resources and efficient scientific research retrieval.In response,this research presents a Hierarchical Text Classification Networks based on Multiple feature Correlation and Graph Attention Network(HTCN-MCGAT)to overcome the limitations of traditional methods.The HTCN-MCGAT model comprises three integral compo-nents:(1)The text representation and enhancement module redesigns the fine-tuning stage of the Bidirectional Encoder Representation from Transformers pre-training model to enhance the representation of the current literature at two levels:the internal character correlations of literature abstracts,titles,and keywords,and external document correlation;(2)The la-bel association modeling module employs the Graph Attention Network to model the hierarchy and relationships between label semantics;and(3)The hierarchical interaction classification module incorporates a hierarchical fusion attention mechanism and a hierarchical classification framework that consists of global and local information based on multi-task learning for integrating high-level features classification.The proposed model is applied to the Chinese medical literature domain and tested with a series of experiments.The results demonstrate the HTCN-MCGAT model's superior perfor-mance compared to traditional literature classification methods,improving the F1-score by 4.34%-13.21%.This research offers an optimized approach to literature classification from text-semantic enrichment and hierarchical-relationship-model-ing perspectives.The findings hold potential for applications not only in literature classification tasks but also in hierarchi-cal classification fields.

literature classificationpre-training modelgraph attention networkattention mechanism

陈帅朴、钱宇星、钱志强、刘政昊、张志剑

展开 >

武汉大学信息管理学院,武汉 430072

武汉大学大数据研究院,武汉 430072

武汉大学信息资源研究中心,武汉 430072

文献分类 预训练模型 图注意力网络 注意力机制

国家社会科学基金"加快构建中国特色哲学社会科学学科体系、学术体系、话语体系"研究专项

19VXK09

2024

情报学报
中国科学技术情报学会 中国科学技术信息研究所

情报学报

CSTPCDCSSCICHSSCD北大核心
影响因子:1.296
ISSN:1000-0135
年,卷(期):2024.43(4)
  • 78