Video Decaptioning Based on Decoupled Spatial-Temporal Transformer
While video captions deliver information,the captions solidified in the video also hinder the reuse of the video.This paper proposes a decoupled spatial-temporal Transformer based video subtitle removal model that can remove subtitle text from video sequences with subtitle text and recover the background obscured by subtitle regions.The overall frame-work is divided into two parts,the subtitle mask extraction module,and the subtitle removal module.The former obtains the binary subtitle mask of the input video sequence quickly and accurately and feeds the obtained binary subtitle mask as auxiliary information to the decoupled spatial-temporal Transformer based subtitle removal module for subtitle text removal and background texture recovery to achieve the removal of video captions.
video decaptioningdeep learningTransformerattention mechanism