Referring Image Segmentation Based on Language and Visual Fusion Transformer
To solve the problems of ambiguous language expression,insufficient multimodal feature alignment and incomplete under-standing of the image as a whole in referring image segmentation tasks,a multimodal deep learning model based on Transformer feature fusion and alignment is proposed.The model uses an optimized Darknet53 image feature extraction backbone network to enhance global feature understanding.It also adopts convolutional neural network structure,Bi-directional gated recurrent unit Bi-GRU structure and self-attentive mechanism to combine with each other for linguistic feature extraction to tap deep semantic features and eliminate the am-biguity of linguistic expressions.Furthermore,a feature alignment structure based on Transformer is constructed to enhance the segmen-tation details and segmentation accuracy of the model.Finally,the average intersection over union mIoU and the recognition accuracy at different thresholds are used as model evaluation indexes.By experiments,the effectiveness of model is verified.It can fully fuse the multimodal features,understand the deep semantic information of the features,and the model recognition results are more accurate.
deep learningreferring image segmentationnatural language processingattention mechanismtransformer model