A segmentation method that fuses CNN and ViT is proposed to address the problems of large differences in shape and size of tumor regions of breast ultrasound images that lead to difficulty in segmentation,limitations in long-range dependency and spatial correlation in convolutional neural network(CNN)modeling,and the huge amount of data required by vision Transformer(ViT).Global and local detail features were extracted using a modified Swin Trans-former module and a CNN encoder module based on deformable convolution,respectively.The design uses a cross-at-tention mechanism to fuse the feature representations of the two scales,and the training process adopts a binary cross-entropy loss combined with a boundary loss function.This approach effectively improves the segmentation accuracy.Experimental results on two public datasets show that the segmentation findings of the proposed method have been sig-nificantly improved compared with those of the existing classical algorithms,with a 3.841 2%improvement in the dice coefficient.This outcome verifies the effectiveness and feasibility of the proposed method.