传感技术学报2024,Vol.37Issue(7) :1193-1201.DOI:10.3969/j.issn.1004-1699.2024.07.012

基于语言和视觉融合Transformer的指代图像分割

Referring Image Segmentation Based on Language and Visual Fusion Transformer

段勇 刘铁
传感技术学报2024,Vol.37Issue(7) :1193-1201.DOI:10.3969/j.issn.1004-1699.2024.07.012

基于语言和视觉融合Transformer的指代图像分割

Referring Image Segmentation Based on Language and Visual Fusion Transformer

段勇 1刘铁1
扫码查看

作者信息

  • 1. 沈阳工业大学信息科学与工程学院,辽宁 沈阳110870
  • 折叠

摘要

针对指代图像分割任务中存在语言表达歧义、多模态特征对齐不充分、对图像整体理解不全面等问题,提出一种基于Transformer特征融合与对齐的多模态深度学习模型.该模型使用优化的Darknet53图像特征提取骨干网络,加强了对全局特征理解能力.使用了卷积神经网络结构、双向门控循环单元Bi-GRU结构和自注意力机制相互结合的语言特征提取结构,挖掘深层次语义特征,消除语言表达的歧义性.构建了基于Transformer的特征对齐结构,以提升模型的分割细节和分割精度.最后,采用平均的交并比mIoU和在不同阈值的识别精度作为模型评估指标,通过实验证明所提模型可以充分融合多模态的特征,理解多模态特征的深层语义信息,模型识别结果更加准确.

Abstract

To solve the problems of ambiguous language expression,insufficient multimodal feature alignment and incomplete under-standing of the image as a whole in referring image segmentation tasks,a multimodal deep learning model based on Transformer feature fusion and alignment is proposed.The model uses an optimized Darknet53 image feature extraction backbone network to enhance global feature understanding.It also adopts convolutional neural network structure,Bi-directional gated recurrent unit Bi-GRU structure and self-attentive mechanism to combine with each other for linguistic feature extraction to tap deep semantic features and eliminate the am-biguity of linguistic expressions.Furthermore,a feature alignment structure based on Transformer is constructed to enhance the segmen-tation details and segmentation accuracy of the model.Finally,the average intersection over union mIoU and the recognition accuracy at different thresholds are used as model evaluation indexes.By experiments,the effectiveness of model is verified.It can fully fuse the multimodal features,understand the deep semantic information of the features,and the model recognition results are more accurate.

关键词

深度学习/指代图像分割/自然语言处理/注意力机制/Transformer模型

Key words

deep learning/referring image segmentation/natural language processing/attention mechanism/transformer model

引用本文复制引用

基金项目

辽宁省高等学校优秀科技人才支持计划(LR15045)

辽宁省教育厅科学研究经费面上项目(LJKZ0139)

出版年

2024
传感技术学报
东南大学 中国微米纳米技术学会

传感技术学报

CSTPCDCSCD北大核心
影响因子:1.276
ISSN:1004-1699
参考文献量24
段落导航相关论文