Cross-modal image and text retrieval based on graph convolution and multi-head attention
Aiming at the problem that the existing cross-modal retrieval methods are difficult to measure the weight of data at each node,and there are limitations in mining local consistency within modalities,a cross-modal image and text retrieval method based on multi-head attention mechanism is proposed.Firstly,a single image and text sample serves as an independent node when constructing the modal diagram,and graph convolution is used to extract the interaction information between each sample to improve the local consistency in different modal data.Then,attention mechanism is introduced into graph convolution to adaptively learn the weight coefficients of each neighboring node,thereby distinguishing the influence of different neighboring nodes on the central node.Finally,a multi-head attention layer with weight parameters is constructed to fully learn multiple sets of related features between nodes.Compared with the existing 8 methods,the mAP values obtained by this method in experiments on the Wikipedia dataset and Pascal Sentence dataset increase by 2.6%to 42.5%and 3.3%to 54.3%,respectively.