Real-time high-resolution video portrait matting network combined with background image
Objective Video matting is one of the most commonly used operations in visual image processing.It aims to separate a certain part of an image from the original image into a separate layer and further apply it to specific scenes for later video synthesis.In recent years,real-time portrait matting that uses neural networks has become a research hotspot in the field of computer vision.Existing related networks cannot meet real-time requirements when processing high-resolution video.Moreover,the matting results at the edges of high-resolution image targets still have blurry issues.To solve these problems,several recently proposed methods that use various auxiliary information to guide high-resolution image for mask estimation have demonstrated good performance.However,many methods cannot perfectly learn information about the edges and details of portraits.Therefore,this study proposes a high-rcsolution video real-time portrait matting network com-bined with background images.Method A double-layer network composed of a base network and a refinement network is presented.To achieve a lightweight network,high-resolution feature maps are first downsampled at sampling rate D.In the base network,the multi-scale features of video frames are extracted by the encoder module,and these features are fused by the pyramid pooling module,because the input of the cyclic decoder network is beneficial for the cyclic decoder to learn the multi-scale features of video frames.In the cyclic decoder,a residual gated recurrent unit(GRU)is used to aggregate the time information between consecutive video frames.The masked map,foreground residual map,and hidden feature map are generated.A residual structure is used to reduce model parameters and improve the real-time performance of the net-work.In the residual GRU,the time information of the video is fully utilized to promote the construction of the masked map of the video frame sequence based on time information.To improve the real-time matting performance of high-resolution images,the high-resolution information guidance module designed in the refinement network,and the initial high-resolution video frames and low-resolution predicted features(masked map,foreground residual map,and hidden feature map)are used as input to pass the high-resolution information guidance module,generating high-quality portrait matting results by guiding low-resolution images with high-resolution image information.In the high-resolution information guid-ance module,the combination of covariance means filtering,variance means filtering,and pointwise convolution process-ing can effectively extract the matting quality of the detailed areas of character contours in a high-resolution video frame.Under the synergistic effects of the benchmark and refinement networks,the designed network cannot only fully extract multi-scale information from low-resolution video frames,but can also more fully learn the edge information of portraits in high-resolution video frames.This condition is conducive to more accurate prediction of masked maps and foreground images in the network structure and can also improve the generalization ability of the matting network at multiple resolu-tions.In addition,the high-resolution image downsampling scheme,lightweight pyramid pooling module,and residual link structure designed in the network further reduce the number of network parameters,improving the real-time perfor-mance of the network.Result We use PyTorch to implement our network on NVIDIA GTX 1080Ti GPU with 11 GB RAM.Batch size is 1,and the optimizer used is Adam.This study trains the benchmark network on three datasets in sequence:the Video240K SD dataset,with an input frame sequence of 15.After 8 epochs of training,the fine network is trained on the Video240K HD dataset for 1 epoch.To improve the robustness of the model in processing high-resolution videos,the refinement network was further trained on the Human2K dataset,with a downsampling rate D of 0.25 and an input frame sequence of 2 for 50 epochs of training.Compared with related network models in recent years,the experimental results show that the proposed method is superior to other methods on the Video240K SD dataset and the Human2K dataset.On the Video240K SD dataset,26.1%,50.6%,56.9%,and 39.5%of the evaluation indicators(sum of absolute difference(SAD),mean squared error(MSE),gradient error(Grad),and connectivity error(Coon))were optimized,respectively.In particular,on the high-resolution Human2K dataset,the proposed method is significantly superior to other state-of-the-art methods,optimizing the evaluation indicators(SAD,MSE,Grad,and Coon)by 18.8%,39.2%,40.7%,and 20.9%,respectively.Simultaneously achieving the lowest network complexity at 4 K resolution(28.78 GMac).The run-ning speed of processing low-resolution video(512 × 288 pixels)can reach 49 frame/s,and the running speed of process-ing medium-resolution video(1 024 × 576 pixels)can reach 42.4 frame/s.In particular,the running speed of processing 4 K resolution video can reach 26 frame/s,while the running speed of processing HD-resolution video can reach 43 frame/s on NVIDIA GTX 1080Ti GPU.This value is significantly improved compared with other state-of-the-art methods.Conclu-sion The network model proposed in this study can better complete the real-time matting task of high-resolution portraits.The pyramid pooling module in the benchmark network effectively extracts and integrates multi-scale information of video frames,while the residual GRU module significantly aggregates continuous inter-frame time information.The high-resolution information guidance module captures high-resolution information in images and guides low-resolution images to learn high-resolution information.The improved network effectively enhances the matting information of high-resolution human-oriented edges.The experiments on the high-resolution dataset Human2K show that the proposed network is more effective in predicting high-resolution montage maps.It has high real-time processing speed and can provide better support for advanced applications,such as film and television,short video social networking,and online conference.
real-time human figure mattingneural networkmultiscale featurestime informationhigh resolution