Analysis and mining method of multi-level relations between image and text guided by local-global features
Text and image data with semantic relevance can enhance semantic understanding from different perspectives due to their complementarity.Therefore,the key to make full use of image and text data lies in the mining of semantic relations between image and text.In order to solve the problems of insufficient mining of deep semantic relations of image and text data and inaccurate prediction in retrieval stage,an analysis and mining method of multi-level relations between image and text guided by local-global features is proposed in this paper.Transformer with multi-head self-attention mechanism is used to model image relations.By constructing an image-guided text attention module,the fine-grained relationship between image region and global text is explored.Furthermore,the local and global features are fused to effectively enhance the semantic relationship between image and text data.To verify the proposed method,the experiments were carried out on the data sets of Flickr30K,MSCOCO-1K and MSCOCO-3K.Compared with 12 other methods such as VSM and SGRAF,the recall rate of searching for image by text in this method has increased by 0.62%on average,and the recall rate of searching for text by image has increased by 0.5%on average.The experimental results well verify the effectiveness of this method.
image and text relation miningmulti-headed self-attention mechanismlocal-global features