To solve the problems of ambiguous language expression,insufficient multimodal feature alignment and incomplete under-standing of the image as a whole in referring image segmentation tasks,a multimodal deep learning model based on Transformer feature fusion and alignment is proposed.The model uses an optimized Darknet53 image feature extraction backbone network to enhance global feature understanding.It also adopts convolutional neural network structure,Bi-directional gated recurrent unit Bi-GRU structure and self-attentive mechanism to combine with each other for linguistic feature extraction to tap deep semantic features and eliminate the am-biguity of linguistic expressions.Furthermore,a feature alignment structure based on Transformer is constructed to enhance the segmen-tation details and segmentation accuracy of the model.Finally,the average intersection over union mIoU and the recognition accuracy at different thresholds are used as model evaluation indexes.By experiments,the effectiveness of model is verified.It can fully fuse the multimodal features,understand the deep semantic information of the features,and the model recognition results are more accurate.
关键词
深度学习/指代图像分割/自然语言处理/注意力机制/Transformer模型
Key words
deep learning/referring image segmentation/natural language processing/attention mechanism/transformer model