A saliency object detection model based on ConvMixer backbone
Saliency object detection(SOD)algorithms mostly use a backbone network based on Convolutional Neural Network(CNN)to extract features.However,CNN cannot obtain long-range feature dependence of images.Vision Transformer(ViT)divides the image into patches and propagates the global context information between patches through the transformer to obtain long-range feature dependence,but the transformer's self attention layer has quadratic time complexity.Therefore,we propose a low-complexity patch-based SOD algorithm CM-PoolNet,which improves the backbone network of the classical PoolNet model for saliency target detection,replaces VGG and ResNet using the convolutional model ConvMixer and proposes a new feature fusion method.Specifically,based on the U-shaped structure,the encoder performs Patch Embedding on the input image and feeds it into the ConvMixer feature extractor consisting of deep detachable convolution and dilatation convolution,which is stacked repeatedly.A patch-based feature fusion module is designed for the decoder.Three kinds of losses,BCE,SSIM and IOU,are designed to guide the model to learn the conversion between the input image and the truth image at the pixel level,block level and feature level.Experiments on DUTS datasets and ECSSD datasets show that the proposed model can effectively segment prominent target areas and accurately predict fine structures with clear boundaries.
saliency object detectionpatch embeddingmixed loss functionPoolNetConvMixer