To solve the visual semantic understanding bias and multimodal semantic bias in multimodal named entity recognition,the confidence learning guides label fusion (CLGLF) method for multimodal named entity recognition is pro-posed. This method invokes the BLIP-2 pre-trained model to generate image captions,concatenates them with the input texts,and performs joint coding to achieve multimodal feature fusion. The candidate labels and text labels are obtained after decoding the multimodal representations and text representations. Based on using the KL divergence loss function to align the two groups of labels,the confidence score is calculated to evaluate the quality of the multimodal representation,and a confidence threshold is set to help screen out the biased candidate labels,the text labels in the corresponding positions are used to replace the biased candidate labels,to achieve the label fusion,and finally complete the multimodal named entity recognition. In order to verify the proposed method,experiments are carried out on the Twitter-2015 and Twitter-2017 mul-timodal datasets,and the experimental results are compared with 7 mainstream methods,such as MSB and UMT. The exper-imental results show the effectiveness of the CLGLF.
关键词
多模态命名实体识别/图像描述/置信学习/多模态语义偏差/信息抽取
Key words
multimodal named entity recognition/image caption/confidence learning/multimodal semantic bias/information extraction