Unsupervised semantic segmentation of single-frame aerial images based on pretrained models
To address the challenges of high cost,limited generalizability,and low accuracy in semantic segmentation of aerial images,a two-stage unsupervised semantic segmentation net(TUSSNet)was proposed to train single-frame aerial images and generate the final semantic segmentation outcomes.The algorithm was divided into two stages.Firstly,the contrastive language-image pretraining(CLIP)model was applied to generate coarse-grained semantic labels for aerial images,followed by network warm-up training.Secondly,on the basis of the first phase,the segment anything model(SAM)was leveraged to predict the fine-grained categories of aerial images and generate refined category mask pseudo-labels.Then,the network was iteratively optimized to achieve the ultimate semantic segmentation outcomes.Experimental results demonstrate a significant enhancement in segmentation accuracy compared with existing unsupervised methods for aerial images.Moreover,the algorithm offers precise semantic information.