图像描述是一种结合计算机视觉和自然语言处理的跨模态任务,旨在理解图像内容并生成恰当的句子.现有的图像描述方法通常使用自注意力机制来捕获样本内的长距离依赖关系,但这种方式不仅忽略了样本间的潜在相关性,而且缺乏对先验知识的利用,导致生成内容与参考描述存在一定差异.针对上述问题,文中提出了一种基于外部先验和自先验注意力(Ex-ternal Prior and Self-prior Attention,EPSPA)的图像描述方法.其中,外部先验模块能够隐式地考虑到样本间的潜在相关性进而减少来自其他样本的干扰信息.同时,自先验注意力能够充分利用上一层的注意力权重来模拟先验知识,使其指导模型进行特征提取.在公开数据集上使用多种指标对EPSPA进行评估,实验结果表明该方法能够在保持低参数量的前提下表现出优于现有方法的性能.
Image Captioning Generation Method Based on External Prior and Self-prior Attention
Image captioning,a multimodal task that combines computer vision and natural language processing,aims to compre-hend the content of images and generate appropriate textual captions.Existing image captioning methods often employ self-atten-tion mechanisms to capture long-range dependencies within samples.However,this approach overlooks the potential correlations among different samples and fails to utilize prior knowledge,resulting in discrepancies between the generated content and refe-rence captions.To address these issues,this paper proposes an image description approach based on external prior and self-prior attention(EPSPA).The external prior module implicitly considers the potential correlations among samples and removes interfe-rence from other samples.Meanwhile,the self-prior attention effectively utilizes attention weights from previous layers to simu-late prior knowledge and guide the model in feature extraction.Evaluation results of EPSPA on publicly available datasets using multiple metrics demonstrates its superior performance compared to existing methods while maintaining a low parameter count.