An occlusion object detection method based on self-supervised mask image modeling
As a fundamental pursuit within computer vision,object detection addresses the challenge of categorizing objects and accurately pinpointing their locations.Nevertheless,the intricacies of real-world scenarios frequently give rise to instances where objects are either partially or entirely obscured,introducing substantial complications for detection models. To bolster the versatility and detection proficiency of object detection networks when confronted with a multitude of occlusion scenarios,this paper introduces an innovative self-supervised approach to image modeling.The new approach is structured into two principal stages:pre-training and fine-tuning.During the pre-training phase,a surrogate task that entails the deliberate use of localized masking is employed,followed by the reconstruction of unlabeled images.This deliberate proxy task equips our model with valuable pre-training experiences,enabling it to acclimate to a spectrum of occlusion patterns and degrees.In the subsequent fine-tuning stage,the intrinsic challenges associated with detecting objects of varying scales and diverse sizes within occluded environments are addressed.A pyramid structure is proposed based on the Visual Transformer (ViT),a state-of-the-art architectural paradigm within computer vision.The ViT-FPN (Vision Transformer Feature Pyramid Network)substantially augments our detector's proficiency in effectively managing a diverse range of occlusion scenarios.The method's performance undergoes rigorous evaluation on benchmark datasets,including CrowdHuman and CityPersons.Our experimental results demonstrates the self-supervised masked image modeling approach presented in this study outperforms other methods in detecting occluded objects.