A model based on improved YOLOv5m was proposed for wolfberry pest detection in a complex environment.The next generation vision transformer(Next-ViT)was used as the backbone network to improve the feature extraction ability of the model,and the key target features were given more attention by the model.An adaptive fusion context enhancement module was added to the neck to enhance the model's ability to understand and process contextual information,and the precision of the model for the small object(aphids)detection was improved.The C3 module in the neck network was replaced by using the C3_Faster module to reduce the model footprint and further improve the model precision.Experimental results showed that the proposed model achieved a precision of 97.0%and a recall of 92.1%.The mean average precision(mAP50)was 94.7%,which was 1.9 percentage points higher than that of the YOLOv5m,and the average precision of aphid detection was improved by 9.4 percentage points.The mAP50 of different models were compared and the proposed was 1.6,1.6,2.8,3.5,and 1.0 percentage points higher than the mainstream models YOLOv7,YOLOX,DETR,EfficientDet-D1,and Cascade R-CNN,respectively.The proposed model improves the detection performance while maintaining a reasonable model footprint.