Referring video object segmentation(RVOS)is a hot research topic in the cross-media task spanning video and language.It aims to segment correlated entities in a given video with textual descriptions.Unlike conventional visual segmentation task that depends on pre-defined classes,the RVOS task is to understand the given expressions to locate and segment the referring entities without the help of pre-defined classes.Due to the randomness of the textual expressions and no pixel-wise masks serving as a reference,the RVOS task is more challenging than the conventional video segmenta-tion task.Although RVOS is a new task in cross-modal understanding,it has essential application prospects for many tasks(e.g.,security monitoring,vehicle tracking,person re-identification,and so on),thus increasing number of signifi-cant methods are being proposed consecutively.Specifically,the solutions are roughly divided into four categories according to the differences in research approaches,such as dynamic convolution based,attention based,multi-level information learning based and end-to-end sequence prediction based methods.Later,qualitative and quantitative performance com-parisons are presented for analysis.Lastly,the paper summarizes several issues existing in current methods,and then some suggestions are proposed to further improve the performance of RVOS tasks in future work.
cross-modal searchreferring video object segmentationcross-modal understanding