A Multi-Modal Named Entity Recognition Method Based on Visual and Textual Semantic Enhancement
In view of a solution of the partial semantic loss in the fusion of visual and textual features,which leads to a significant deviation in the supplementation of visual information to textual information,a multimodal named entity recognition method has thus been proposed based on visual and textual semantic enhancement.A feature interaction unit based on collaborative cross attention mechanism is designed for an enhancement of the semantic relationship between visual information and textual information by integrating BERT text feature extraction and CLIP(contrastive language image pre-training)visual feature extraction methods.CLIP pre-trains through a contrastive learning framework to optimize the model for a correct matching of visual and corresponding text descriptions,thus maximizing the similarity of positive samples(matched visual text pairs)while minimizing the similarity of negative samples(mismatched visual text pairs).The general domain datasets TWITTER-2015 and TWITTER-2017 are adopted as experimental datasets in this article.Experimental results show that compared with traditional methods,this model is characterized with a significantly improved accuracy,recall,and F1 score in multi-modal named entity recognition tasks.