Text-Information-Guided Attention Mechanism for Fine-Grained Image Classification
Scene texts with explicit semantic information in natural images can provide important clues to solve corresponding computer vision problems.In texts,they generally focus on using multimodal content in the form of visual and textual cues to solve fine-grained image classification and retrieval tasks.Specifically,this paper employs graph convolutional networks to perform multi-modal reasoning and obtain relation-enhanced features by learning the common semantic space between explicit objects and text found in images,by obtaining an enhanced set of visual and textual features,the proposed model outperforms the state-of-the-art by a large margin on two different tasks(fine-grained classification and image retrieval in contextual text).
fine-grained analysis of imagesmultimodal reasoningGCN