首页|基于语言和视觉融合Transformer的指代图像分割

基于语言和视觉融合Transformer的指代图像分割

扫码查看
针对指代图像分割任务中存在语言表达歧义、多模态特征对齐不充分、对图像整体理解不全面等问题,提出一种基于Transformer特征融合与对齐的多模态深度学习模型.该模型使用优化的Darknet53图像特征提取骨干网络,加强了对全局特征理解能力.使用了卷积神经网络结构、双向门控循环单元Bi-GRU结构和自注意力机制相互结合的语言特征提取结构,挖掘深层次语义特征,消除语言表达的歧义性.构建了基于Transformer的特征对齐结构,以提升模型的分割细节和分割精度.最后,采用平均的交并比mIoU和在不同阈值的识别精度作为模型评估指标,通过实验证明所提模型可以充分融合多模态的特征,理解多模态特征的深层语义信息,模型识别结果更加准确.
Referring Image Segmentation Based on Language and Visual Fusion Transformer
To solve the problems of ambiguous language expression,insufficient multimodal feature alignment and incomplete under-standing of the image as a whole in referring image segmentation tasks,a multimodal deep learning model based on Transformer feature fusion and alignment is proposed.The model uses an optimized Darknet53 image feature extraction backbone network to enhance global feature understanding.It also adopts convolutional neural network structure,Bi-directional gated recurrent unit Bi-GRU structure and self-attentive mechanism to combine with each other for linguistic feature extraction to tap deep semantic features and eliminate the am-biguity of linguistic expressions.Furthermore,a feature alignment structure based on Transformer is constructed to enhance the segmen-tation details and segmentation accuracy of the model.Finally,the average intersection over union mIoU and the recognition accuracy at different thresholds are used as model evaluation indexes.By experiments,the effectiveness of model is verified.It can fully fuse the multimodal features,understand the deep semantic information of the features,and the model recognition results are more accurate.

deep learningreferring image segmentationnatural language processingattention mechanismtransformer model

段勇、刘铁

展开 >

沈阳工业大学信息科学与工程学院,辽宁 沈阳110870

深度学习 指代图像分割 自然语言处理 注意力机制 Transformer模型

辽宁省高等学校优秀科技人才支持计划辽宁省教育厅科学研究经费面上项目

LR15045LJKZ0139

2024

传感技术学报
东南大学 中国微米纳米技术学会

传感技术学报

CSTPCD北大核心
影响因子:1.276
ISSN:1004-1699
年,卷(期):2024.37(7)