Vision-enhanced Multimodal Named Entity Recognition Based on Contrastive Learning
Multimodal named entity recognition(MNER)aims to detect ranges of entities in a given image-text pair and classifies them into corresponding entity types.Although existing MNER methods have achieved success,they all focus on using image en-coder to extract visual features,without enhancement or filtering,and directly feed them into cross-modal interaction mechanism.Moreover,since the representations of text and images come from different encoders,it is difficult to bridge the semantic gap be-tween the two modalities.Therefore,a vision-enhanced multimodal named entity recognition model based on contrastive learning(MCLAug)is proposed.First,ResNet is used to collect image features.On this basis,a pyramid bidirectional fusion strategy is proposed to combine low-level high-resolution with high-level strong semantic image information to enhance visual features.Se-condly,using the idea of multimodal contrastive learning in the CLIP model,calculate and minimize the contrastive loss to make the representations of the two modalities more consistent.Finally,the fused image and text representations are obtained using a cross-modal attention mechanism and a gated fusion mechanism,and a CRF decoder is used to perform the MNER task.Compara-tive experiments,ablation studies and case studies on 2 publicly datasets demonstrate the effectiveness of the proposed model.
Multimodal named entity recognitionCLIPMultimodal contrastive learningFeature pyramidTransformerGated fusion mechanism