Dual-encoder global-local cross-attention network for medical image segmentation
Objective With the rapid advancement of medical imaging technology,medical image segmentation has become a popular topic in the field of medical image processing and has been the subject of extensive study.Medical image segmen-tation has a wide range of applications and research values in medical research and practice.The segmentation results of medical images can be used by physicians to determine the location,size,and shape of lesions,providing an accurate basis for diagnosis and treatment.In recent years,UNet based on convolutional neural networks(CNNs)has become a baseline architecture for medical image segmentation.However,this architecture cannot effectively extract global context information due to the limited receptive field of CNNs.The Transformer was originally designed to solve this problem but was limited in capturing local information.Therefore,hybrid networks of CNN and Transformer based on UNet architecture are gradually becoming popular.However,existing methods encounter some shortcomings.For example,these methods typically cannot effectively combine the global and local information extracted by CNN and Transformer.By contrast,while the original skip connection can recover some location information lost by the target features in the downsampling stage,this connection may fail to capture all the fine-grained details,ultimately affecting the accuracy of the predicted segmenta-tion.This paper proposes a dual-encoder global-local cross-attention network with CNN and Transformer(DGLCANet)to address these issues.Method First,a dual-encoder network is adopted to extract rich local and global information from the images,which combines the advantages of CNNs and Transformer networks.In the encoder stage,Transformer and CNN branches are used to extract global and local information,respectively.In addition,the CSWin Transformer with low calcu-lation costs is used in the Transformer branch to reduce the calculation cost of the model.Next,a global-local cross-attention Transformer module is proposed to fully utilize the global and local information extracted by the dual-encoder branch.The core of this module is the cross-attention mechanism,which can further obtain the correlation between global and local features by interacting the information of the two branches.Finally,a feature adaptation block is designed in the skip connection of DGLCANet to compensate for the shortcomings of the original skip connections.The feature adaptation module aims to adaptively match the features between the encoder and decoder,reducing the feature gap between them and improving the adaptive capability of the model.Meanwhile,the module can also recover detailed positional information lost during the encoder downsampling process.Tests are performed on four public datasets,including ISIC-2017,ISIC-2018,BUSI,and the 2018 Data Science Bowl.Among them,ISIC-2017 and ISIC-2018 are used for dermoscopic images of mela-noma detection,containing 2 000 and 2 596 images,respectively.The BUSI dataset,which contains 780 images,is a breast ultrasound dataset for detecting breast cancer.The 2018 Data Science Bowl dataset,which contains a total of 670 images,is used for examining cell nuclei in different microscope images.The resolution of all images is set to 256 × 256 pixels and randomly divided into training and test sets according to the ratio of 8:2.DGLCANet is implemented in the PyTorch framework and was trained on an NVIDIA GeForce RTX 3090Ti GPU with 24 GB of memory.In the experiment,the binary cross-entropy and dice loss functions are mixed in proportion to construct a new loss function.Furthermore,the Adam optimizer with an initial learning rate of 0.001,a momentum parameter of 0.9,and a weight decay of 0.000 1 is employed.Result In this study,four evaluation metrics,including intersection over union,Dice coefficient(Dice),accu-racy,and recall,are used to evaluate the effectiveness of the proposed method.In theory,large values of these evaluation metrics lead to superior segmentation effects.Experimental results show that on the four datasets,the dice coefficient reaches 91.88%,90.82%,80.71%,and 92.25%,which are 5.87%,5.37%,4.65%,and 2.92%higher than the clas-sic method UNet,respectively.Compared with recent state-of-the-art methods,the proposed method also demonstrates its superiority.Furthermore,the graph of the visualized results demonstrates that the proposed method effectively predicts the boundary area of the image and distinguishes the lesion area from the normal area.Meanwhile,compared with other meth-ods,the proposed method can still achieve better segmentation results under the condition of multiple interference factors such as brightness,which are remarkably close to the ground truth.The results of a series of ablation experiments also show that each of the proposed components demonstrates satisfactory performance.Conclusion In this study,a dual-encoder medical image segmentation method that integrates global-local attention mechanism is proposed.The experimen-tal results demonstrate that the proposed method not only improves segmentation accuracy but also obtains satisfactory seg-mentation results when processing complex medical images.Future work will focus on further optimization and in-depth research to promote the practical application of this method and will contribute to important breakthroughs and advance-ments in the field of medical image segmentation.
medical image segmentationconvolutional neural network(CNN)dual-encodercross attention mecha-nismTransformer