Image-text Retrieval Algorithm of Dynamic Multi-view Reasoning Hierarchical Similarity
Cross-modal image-text retrieval usually refers to visible light images and normal text.Among them,image-text similarity based on scalar has limitations and cannot fully represent cross-modal alignment.At the same time,there is a complex interaction between local region—word correlation and global image—text dependence,so the modules used to infer the two modal features have a certain degree of uncertainty.In view of the above problems,this paper proposes a dynamic multi-view reasoning method of image-text matching based on hierarchical similarity network.Firstly,the method uses global and local similarity based on scalar and vector.Secondly,four types of units are designed as the basic units to explore the global—local similarity interaction.Finally,a learnable selection confidence mechanism is introduced,and experiments on Flickr30K and MSCOCO data set show the excellent performance of the algorithm.