Weakly supervised semantic segmentation based on deep learning
Semantic segmentation is an important and fundamental task in the field of computer vision.Its goal is to assign a semantic category label to each pixel in an image,achieving pixel-level understanding.It has wide applications in areas,such as autonomous driving,virtual reality,and medical image analysis.Given the development of deep learning in recent years,remarkable progress has been achieved in fully supervised semantic segmentation,which requires a large amount of training data with pixel-level annotations.However,accurate pixel-level annotations are difficult to provide because it sac-rifices substantial time,money,and human-label resources,thus limiting their widespread application in reality.To reduce the cost of annotating data and further expand the application scenarios of semantic segmentation,researchers are paying increasing attention to weakly supervised semantic segmentation(WSSS)based on deep learning.The goal is to develop a semantic segmentation model that utilizes weak annotations information instead of dense pixel-level annotations to predict pixel-level segmentation accurately.Weak annotations mainly include image-level,bounding-box,scribble,and point annotations.The key problem in WSSS lies in how to find a way to utilize the limited annotation information,incorpo-rate appropriate training strategies,and design powerful models to bridge the gap between weak supervision and pixel-level annotations.This study aims to classify and summarize WSSS methods based on deep learning,analyze the challenges and problems encountered by recent methods,and provide insights into future research directions.First,we introduce WSSS as a solution to the limitations of fully supervised semantic segmentation.Second,we introduce the related datasets and evalu-ation metrics.Third,we review and discuss the research progress of WSSS from three categories:image-level annotations,other weak annotations,and assistance from large-scale models,where the second category includes bounding-box,scribble,and point annotations.Specifically,image-level annotations only provide object categories information contained in the image,without specifying the positions of the target objects.Existing methods always follow a two-stage training pro-cess:producing a class activation map(CAM),also known as initial seed regions used to generate high-quality pixel-level pseudo labels;and training a fully supervised semantic segmentation model using the produced pixel-level pseudo labels.According to whether the pixel-level pseudo labels are updated or not during the training process in the second stage,WSSS based on image-level annotations can be further divided into offline and online approaches.For offline approaches,existing research treats two stages independently,where the initial seed regions are optimized to obtain more reliable pixel-level pseudo labels that remain unchanged throughout the second stage.They are often divided into six classes according to dif-ferent optimization strategies,including the ensemble of CAM,image erasing,co-occurrence relationship decoupling,affinity propagation,additional supervised information,and self-supervised learning.For online approaches,the pixel-level pseudo labels keep updating during the entire training process in the second stage.The production of pixel-level pseudo labels and the semantic segmentation model are jointly optimized.The online counterparts can be trained end to end,making the training process more efficient.Compared with image-level annotations,other weak annotations,includ-ing bounding box,scribble,and point,are more powerful supervised signals.Among them,bounding-box annotations not only provide object category labels but also include information of object positions.The regions outside the bounding-box are always considered background,while box regions simultaneously contain foreground and background areas.Therefore,for bounding-box annotations,existing research mainly starts from accurately distinguishing foreground areas from back-ground regions within the bounding-box,thereby producing more accurate pixel-level pseudo labels,used for training fol-lowing semantic segmentation networks.Scribble and point annotations not only indicate the categories of objects contained in the image but also provide local positional information of the target objects.For scribble annotations,more complete pseudo labels can be produced to supervise semantic segmentation by inferring the category of unlabeled regions from the annotated scribble.For point annotations,the associated semantic information is expanded to the entire image through label propagation,distance metric learning,and loss function optimization.In addition,with the rapid development of large-scale models,this paper further discusses the recent research achievements in using large-scale models to assist WSSS tasks.Large-scale models can leverage their pretrained universal knowledge to understand images and generate accu-rate pixel-level pseudo labels,thus improving the final segmentation performance.This paper also reports the quantitative segmentation results on pattern analysis,statistical modeling and computational learning visual object classes 2012(PASCAL VOC 2012)dataset to evaluate the performance of different WSSS methods.Finally,four challenges and poten-tial future research directions are provided.First,a certain performance gap remains between weakly supervised and fully supervised methods.To bridge this gap,research should keep on improving the accuracy of pixel-level pseudo labels.Sec-ond,when WSSS models are applied to real-world scenarios,they may encounter object categories that have never appeared in the training data.This encounter requires the models to have a certain adaptability to identify and segment unknown objects.Third,existing research mainly focuses on improving the accuracy without considering the model size and inference speed of WSSS networks,posing a major challenge for the deployment of the model in real-world applications that require real-time estimations and online decisions.Fourth,the scarcity of relevant datasets used to evaluate different WSSS models and algorithms is also a major obstacle,which leads to performance degradation and limits generalization capability.There-fore,large-scale WSSS datasets with high quality,great diversity,and wide variation of image types must be constructed.