Multi-Band Image Caption Generation Method Based on Feature Fusion
This study proposes a multi-band detection image caption generation method based on feature fusion to address the common problem of poor performance in describing nighttime scenes,occluded target scenes,and captured blurred images in existing image caption generation methods.Incorporating infrared detection imaging into image captioning involves a sequential process.Initially,multi-layer Convolutional Neural Networks(CNN)are employed to independently extract features from both visible light and infrared images.Subsequently,to harness the complementary nature of these different detection bands,a spatial attention module,primarily structured around a multi-head attention mechanism,is developed to integrate the features from each specific band.Finally,a channel attention mechanism is used to consolidate information across the spatial domain,thereby facilitating the generation of diverse word types tailored to the captured images.Based on the traditional additive attention mechanism,an attention enhancement module is constructed to calculate the correlation weight coefficients between the attention result graph and the query vector,eliminate the interference of irrelevant variables,and thus achieve image caption generation.Multiple experiments on the visible image-infrared image caption dataset demonstrate that the method can effectively fuse semantic features of dual bands.The application of the Bilingual Evaluation Understudy4(BLEU4)and Consensus-based Image Description Evaluation(CIDEr)indices demonstrate substantial improvements in image caption accuracy reaching scores of 58.3%and 136.1%,respectively.These enhancements significantly bolster the utility of this technology for complex scene analysis tasks such as security monitoring and military reconnaissance.