A crowd counting network based on multi-scale pyramid Transformer
A crowd counting network based on multi-scale pyramid Transformer(MSPT-Net)is proposed to address the problem of low accuracy in crowd counting in dense crowd scenes caused by complex backgrounds and large target scale variations.A pyramid transformer backbone network structure based on depth separable self-attention is designed in the feature extraction phase to effectively capture local as well as global information of the image,thereby effectively addressing the problem of low counting accuracy in crowd density images caused by complex backgrounds.A feature pyramid fusion module and a regression head with multi-scale receptive fields are designed to efficiently integrate shal-low detail features and deep semantic features in dense crowd scenes,enhancing the network's ability to capture targets of different scales.Lastly,the proposed model is validated using a deep supervision training method on three publicly available datasets.The experimental results show that the proposed MSPT-Net achieves higher crowd counting accur-acy in the fully supervised and weakly supervised learning strategies as compared to mainstream crowd counting net-works,overcoming the issue of low counting accuracy in dense crowd images with complex backgrounds and signific-ant changes in target scales.At the same time,the method in this paper keeps the parameter number and calculation amount smaller.