基于对比学习的视觉增强多模态命名实体识别

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：多模态命名实体识别(MNER)的目的是在给定的图像-文本对中检测实体范围并将其分类为相应的实体类型.尽管现存的MNER方法取得了成功,但它们都集中在使用图像编码器提取视觉特征后,不做增强或过滤处理,直接送入跨模态交互机制.此外,由于文本和图像的表示来自不同的编码器,很难弥合两种模态之间的语义鸿沟,因此,提出了一个基于对比学习的视觉增强多模态命名实体识别模型(MCLAug).首先,使用ResNet收集图像特征,在此基础上提出金字塔双向融合策略,将低层次高分辨率和高层次强语义的图像信息结合起来,以增强视觉特征.其次,利用CLIP模型中的多模态对比学习思想,计算并最小化对比损失,使两种模态的表示更加一致.最后,利用跨模态注意力机制和门控融合机制获得融合后的图像和文本表示,并通过CRF解码器来执行MNER任务.在两个公开数据集上进行了对比实验并进行消融研究和案例研究,结果证明了所提模型的有效性.

外文标题：Vision-enhanced Multimodal Named Entity Recognition Based on Contrastive Learning

外文摘要：Multimodal named entity recognition(MNER)aims to detect ranges of entities in a given image-text pair and classifies them into corresponding entity types.Although existing MNER methods have achieved success,they all focus on using image en-coder to extract visual features,without enhancement or filtering,and directly feed them into cross-modal interaction mechanism.Moreover,since the representations of text and images come from different encoders,it is difficult to bridge the semantic gap be-tween the two modalities.Therefore,a vision-enhanced multimodal named entity recognition model based on contrastive learning(MCLAug)is proposed.First,ResNet is used to collect image features.On this basis,a pyramid bidirectional fusion strategy is proposed to combine low-level high-resolution with high-level strong semantic image information to enhance visual features.Se-condly,using the idea of multimodal contrastive learning in the CLIP model,calculate and minimize the contrastive loss to make the representations of the two modalities more consistent.Finally,the fused image and text representations are obtained using a cross-modal attention mechanism and a gated fusion mechanism,and a CRF decoder is used to perform the MNER task.Compara-tive experiments,ablation studies and case studies on 2 publicly datasets demonstrate the effectiveness of the proposed model.

外文关键词：

Multimodal named entity recognitionCLIPMultimodal contrastive learningFeature pyramidTransformerGated fusion mechanism

作者：

于碧辉、谭淑月、魏靖烜、孙林壮、卜立平、赵艺曼

展开 >

作者单位：

中国科学院大学北京100049

中国科学院沈阳计算技术研究所沈阳110168

关键词：

多模态命名实体识别 CLIP 多模态对比学习特征金字塔 Transformer 门控融合机制

基金：

辽宁省应用基础研究计划

项目编号：

2022JH2/101300258

出版年：

2024

DOI：

10.11896/jsjkx.230400052

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(6)

参考文献量36