Self-supervised Learning Based on Multi-modal Arbitrary Rotation for RGB-D Semantic Segmentation
Self-supervised learning on RGB-D datasets has attracted extensive attention.However,most methods focus on global-level representation learning,which tends to lose local details that are crucial for recognizing the objects.The geometric consistency between image and depth in RGB-D data can be used as a clue to guide self-supervised feature learning for the RGB-D data.In this study,ArbRot is proposed,which can not only rotate the angle without restriction and generate multiple pseudo-labels for pretext tasks,but also establish the relationship between global and local context.The ArbRot can be jointly trained with contrastive learning methods for establishing a multi-modal,multiple pretext task self-supervised learning framework,so as to enforce feature consistency within image and depth views,thereby providing an effective initialization for RGB-D semantic segmentation.The experimental results on the datasets of SUN RGB-D and NYU Depth Dataset V2 show that the quality of feature representation obtained by multi-modal,arbitrary-orientation rotation self-supervised learning is better than the baseline models.