Multiframe spatiotemporal attention-guided semisupervised video segmentation
Objective Video object segmentation(VOS)aims to provide high-quality segmentation of target object instances throughout an input video sequence,obtaining pixel-level masks of the target objects,thereby finely segmenting the target from the background images.Compared with tasks such as object tracking and detection,which involve bounding-box level tasks(using rectangular frames to select targets),VOS has pixel-level accuracy,which is more condu-cive to locating the target accurately and outlining the details of the target's edge.Depending on the supervision informa-tion provided,VOS can be divided into three scenarios:semisupervised VOS,interactive VOS,and unsupervised VOS.In this study,we focus on the semisupervised task.In the scenario of semisupervised VOS,pixel-level annotated masks of the first frame of the video are provided,and subsequent prediction frames can fully utilize the annotated mask of the first frame to assist in computing the segmentation results of each prediction frame.With the development of deep neural network tech-nology,current semisupervised VOS methods are mostly based on deep learning.These methods can be divided into the fol-lowing three categories:detection-,matching-,and propagation-based methods.Detection-based object segmentation algo-rithms treat VOS tasks as image object segmentation tasks without considering the temporal association of videos,believing that only a strong frame-level object detector and segmenter are needed to perform target segmentation frame by frame.Matching-based works typically segment video objects by calculating pixel-level matching scores or semantic feature match-ing scores between the template frame and the current prediction frame.Propagation-based methods propagate the multi-frame feature information before the prediction frame to the prediction frame and calculate the correlation between the pre-diction frame feature and the previous frame feature to represent video context information.This context information locates the key areas of the entire video and can guide single-frame image segmentation.Motion-based propagation methods have two types:one introduces optical flow to train the VOS model,and the other learns deep target features from the previous frame's target mask and refines the target mask in the current frame.Existing semisupervised video segmentation is mostly based on optical flow methods to model the feature association between key frames and the current frame.However,the optical flow method is prone to errors due to occlusions,special textures and other situations,leading to issues in multi-frame fusion.Aiming to integrate multiframe features,this study extracts the appearance feature information of the first frame and the positional information of the adjacent key frames and fuses the features through the Transformer and the improved path aggregation network(PAN)module,thereby learning and integrating features based on multiframe spatio-temporal attention.Method In this study,we propose a semisupervised VOS method based on the fusion of features using the Transformer mechanism.This method integrates multiframe appearance feature information and positional feature infor-mation.Specifically,the algorithm is divided into the following steps:1)appearance information feature extraction net-work:first,we construct an appearance information feature extraction network.This module,based on CSPDarknet53,is modified and consists of CBS(convolution,batch normalization,and Silu)modules,cross stage partial residual network(CSPRes)modules,residual spatial pyramid pooling(ResSPP)modules,and receptive field enhancement and pyramid pooling(REP)modules.The first frame of the video serves as the input,which is passed through three CBS modules to obtain the shallow features Fs.These features are then processed through six CSPRes modules,followed by a ResSPP mod-ule,and finally another CBS module to produce the output Fd,representing the appearance information extracted from the first frame of the video.2)Current frame feature extraction network:we then build a network to extract features from the current frame.This network comprises three cascaded CBS modules,which are used to extract the current frame's feature information.Simultaneously,the Transformer feature fusion module merges the features of the current frame with those of the first frame.The appearance information from the first frame guides the extraction of feature information from the current frame.Within this,the Transformer module consists of an encoder and a decoder.3)Local feature matching:with the aid of the mask maps from several adjacent frames and the feature map of the current frame,local feature matching is per-formed.This process determines the frames with positional information that has a strong correlation with the current frame and treats them as nearby keyframes.These keyframes are then used to guide the extraction of positional information from the current frame.4)Enhanced PAN feature aggregation module:finally,the input feature maps are passed through a spa-tial pyramid pooling(SPP)module that contains max-pooling layers of sizes 3 × 3,5 × 5,and 9×9.The improved PAN structure powerfully fuses the features across different layers.The feature maps undergo a concatenation operation,which integrates deep semantic information with shallow semantic information.By integrating these steps,the proposed method aims to improve the accuracy and robustness of VOS tasks.Result In the experimental section,the proposed method did not require online fine tuning and postprocessing.Our algorithm was compared with the current 10 mainstream methods on the DAVIS-2016 and DAVIS-2017 datasets and with five methods on the YouTube-VOS dataset.On the DAVIS-2016 data-set,the algorithm achieved commendable performance,with a region similarity score J and contour accuracy score F of 81.5%and 80.9%,respectively,which is an improvement of 1.2%over the highest-performing comparison method.On the DAVIS-2017 dataset,J and F scores reached 78.4%and 77.9%,respectively,an improvement of 1.3%over the highest-performing comparison method.The running speed of our algorithm is 22 frame/s,ranking it second,slightly lower than the pixel-level matching(PLM)algorithm by 1.6%.On the YouTube-VOS dataset,competitive results were also achieved,with average J and F scores reaching 71.2%,surpassing all comparison methods.Conclusion The semisuper-vised video segmentation algorithm based on multiframe spatiotemporal attention can effectively integrate global and local information while segmenting target objects.Thus,the loss of detailed information is minimized;while maintaining high efficiency,it can also effectively improve the accuracy of semisupervised video segmentation.
video object segmentation(VOS)feature extraction networkappearance feature informationspatiotempo-ral attentionfeature aggregation