Segmentation of abdominal CT and cardiac MR images with multi scale visual attention
Objective Medical image segmentation is one of the important steps in computer-aided diagnosis and surgery planning.However,due to the complex,diverse structure of various human organs,blurred tissue edges,size,and other problems,the segmentation performance is poor and the segmentation effect needs to be further improved,while more accu-rate segmentation performance can more effectively help doctors to carry out treatment and provide advice.Recently,deep-learning-based methods have become a hot spot for researching medical image segmentation.Vision Transformer(ViT),which has achieved great success in the field of natural language processing,has also flourished in the field of computer vision;therefore,it is favored by medical image segmentation researchers.However,current medical image segmentation networks based on ViT flatten image features into 1D sequences,ignoring the 2D structure of images and the connections between them.Moreover,the secondary computational complexity of the multihead self-attention(MHSA)mechanism of ViT increases the required computational overhead.Method To address the above problems,this paper proposes MSVA-TransUNet,a U-shaped network structure with Transformer as the backbone network based on multi scale vision attention,an attention mechanism implemented by multiple stripe convolutions.The structure is similar to the multihead attention mechanism,which uses convolutional operations to obtain long-distance dependencies.First,the network uses convolution kernels of different sizes to extract features of images of dissimilarsizes,uses a pair of strip convolution operations to approximate a large kernel convolution instead,and does not use dissimilarsizes of strip convolution to approximate diverse large kernel convolutions,which can capture local information using convolution,while large convolution kernels can also learn long-distance dependence of images.Second,strip convolution belongs to lightweight convolution,which can remark-ably reduce the number of parameters and floating-point operations of the network and lower the required computational overhead,because the computational overhead of convolution is much smaller than the overhead required by the secondary computational complexity of multihead attention.Further,it avoids converting the image into a 1D sequence for input to vision Transformer and makes full use of the 2D structure of the image to learn the features of the image.Finally,replacing the first patch embedding in the encoding stage with a convolution stem avoids directly converting low channel counts to high channel counts,which runs counter to the typical structure of convolutional neural networks(CNNs)while retaining the structure of patch embeddings elsewhere.Result Experimental results on the abdominal multiorgan segmentation data-set(mainly containing eight organs)and the heart segmentation dataset(comprising three parts of the heart)show the seg-mentation accuracy of the proposed network in this paper is improved compared with the baseline model.The average Dice of the abdominal multiorgan segmentation dataset is improved by 3.74%,and the average Dice of the heart segmentation dataset is improved by 1.58%.Their floating-point operations and number of parameters are reduced compared with the MHSA mechanism and the large kernel convolution.The MHSA mechanism's floating-point operation is 1/278 of the self-attention mechanism,and the number of network parameters is 15.31 M,which is 1/6.88 of the TransUNet.Conclusion Experimental results show the proposed network is comparable to or even exceeds the current state-of-the-art networks.The multiscale visual attention mechanism is used instead of the multihead self-attention mechanism,which can also cap-ture long-distance relationships and extract graphic long-distance features.Segmentation performance is improved while reducing computational overhead,that is,the proposed network exhibit certain advantages.However,due to the specific-ity of the location and small size of some organs,the networks do not have enough feature learning ability for this part of the organs;hence,its segmentation accuracy still needs to be further improved,and we will continue to study how to improve the segmentation performance of this part of the organs in depth.The code of this paper will be open source soon:https://github.com/BeautySilly/VA-TransUNet.
medical image segmentationvisual attentionTransformerattention mechanismabdominal multi-organ segmentationcardiac segmentation