采用多尺度视觉注意力分割腹部CT和心脏MR图像

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：目的医学图像分割是计算机辅助诊断和手术规划的重要步骤,但是由于人体器官结构复杂、组织边缘模糊等问题,其分割效果还有待提高.由于视觉Transformer(vision Transformer,ViT)在计算机视觉领域取得了成功,受到医学图像分割研究者的青睐.但是基于ViT的医学图像分割网络,将图像特征展平成一维序列,忽视了图像的二维结构,且ViT所需的计算开销相当大.方法针对上述问题,提出了以多尺度视觉注意力(multi scale visual attention,MSVA)为基础、Transformer作为主干网络的U型网络结构MSVA-TransUNet.其采用的多尺度视觉注意力是一种由多个条状卷积实现的注意力机制,采用一个条状卷积对近似一个大核卷积的操作,采用不同的条状卷积对近似不同的大核卷积,从不同的尺度获取图像的信息.结果在腹部多器官分割和心脏分割数据集上的实验结果表明:本文网络与基线模型相比,平均Dice分别提高了 3.74％和1.58％,其浮点数运算量是多头注意力机制的1/278,网络参数量为15.31 M,是TransUNet的1/6.88.结论本文网络媲美当前较先进的网络TransUNet和Swin-UNet,采用多尺度视觉注意力代替多头注意力,在减少计算开销的同时在分割性能上同样具有优势.本文代码开源地址:https://github.com/BeautySilly/VA-TransUNet.

外文标题：Segmentation of abdominal CT and cardiac MR images with multi scale visual attention

外文摘要：Objective Medical image segmentation is one of the important steps in computer-aided diagnosis and surgery planning.However,due to the complex,diverse structure of various human organs,blurred tissue edges,size,and other problems,the segmentation performance is poor and the segmentation effect needs to be further improved,while more accu-rate segmentation performance can more effectively help doctors to carry out treatment and provide advice.Recently,deep-learning-based methods have become a hot spot for researching medical image segmentation.Vision Transformer(ViT),which has achieved great success in the field of natural language processing,has also flourished in the field of computer vision;therefore,it is favored by medical image segmentation researchers.However,current medical image segmentation networks based on ViT flatten image features into 1D sequences,ignoring the 2D structure of images and the connections between them.Moreover,the secondary computational complexity of the multihead self-attention(MHSA)mechanism of ViT increases the required computational overhead.Method To address the above problems,this paper proposes MSVA-TransUNet,a U-shaped network structure with Transformer as the backbone network based on multi scale vision attention,an attention mechanism implemented by multiple stripe convolutions.The structure is similar to the multihead attention mechanism,which uses convolutional operations to obtain long-distance dependencies.First,the network uses convolution kernels of different sizes to extract features of images of dissimilarsizes,uses a pair of strip convolution operations to approximate a large kernel convolution instead,and does not use dissimilarsizes of strip convolution to approximate diverse large kernel convolutions,which can capture local information using convolution,while large convolution kernels can also learn long-distance dependence of images.Second,strip convolution belongs to lightweight convolution,which can remark-ably reduce the number of parameters and floating-point operations of the network and lower the required computational overhead,because the computational overhead of convolution is much smaller than the overhead required by the secondary computational complexity of multihead attention.Further,it avoids converting the image into a 1D sequence for input to vision Transformer and makes full use of the 2D structure of the image to learn the features of the image.Finally,replacing the first patch embedding in the encoding stage with a convolution stem avoids directly converting low channel counts to high channel counts,which runs counter to the typical structure of convolutional neural networks(CNNs)while retaining the structure of patch embeddings elsewhere.Result Experimental results on the abdominal multiorgan segmentation data-set(mainly containing eight organs)and the heart segmentation dataset(comprising three parts of the heart)show the seg-mentation accuracy of the proposed network in this paper is improved compared with the baseline model.The average Dice of the abdominal multiorgan segmentation dataset is improved by 3.74％,and the average Dice of the heart segmentation dataset is improved by 1.58％.Their floating-point operations and number of parameters are reduced compared with the MHSA mechanism and the large kernel convolution.The MHSA mechanism's floating-point operation is 1/278 of the self-attention mechanism,and the number of network parameters is 15.31 M,which is 1/6.88 of the TransUNet.Conclusion Experimental results show the proposed network is comparable to or even exceeds the current state-of-the-art networks.The multiscale visual attention mechanism is used instead of the multihead self-attention mechanism,which can also cap-ture long-distance relationships and extract graphic long-distance features.Segmentation performance is improved while reducing computational overhead,that is,the proposed network exhibit certain advantages.However,due to the specific-ity of the location and small size of some organs,the networks do not have enough feature learning ability for this part of the organs;hence,its segmentation accuracy still needs to be further improved,and we will continue to study how to improve the segmentation performance of this part of the organs in depth.The code of this paper will be open source soon:https://github.com/BeautySilly/VA-TransUNet.

外文关键词：

medical image segmentationvisual attentionTransformerattention mechanismabdominal multi-organ segmentationcardiac segmentation

作者：

蒋婷、李晓宁

展开 >

作者单位：

四川师范大学计算机科学学院,成都 610101

吉利学院智能科技学院,成都 641423

可视化计算与虚拟现实四川省重点实验室,成都 610066

关键词：

医学图像分割视觉注意力 Transformer 注意力机制腹部多器官分割心脏分割

出版年：

2024

DOI：

10.11834/jig.221032

中国图象图形学报

中国科学院遥感应用研究所,中国图象图形学学会 ,北京应用物理与计算数学研究所

中国图象图形学报

CSTPCD北大核心

影响因子：1.111

ISSN：1006-8961

年,卷(期)：2024.29(1)

参考文献量34