This paper proposes a lightweight end-to-end vehicle detection model which can be deployed on UAV terminals. In the backbone network, the focus mechanism was first used to down sample the input original image losslessly, and then the depthwise sep-arable convolution kernel with a light attention module was used to established the feature extraction layer, finally the multi-layer fu-sion across scales is performed in the feature pyramid to improve the information complexity in the output feature maps of the three lev-els. The open-source UAV image dataset VisDrone is mixed with UAV road images collected in multiple periods, and the model was trained as a training set after enhanced processing. The experimental results show that the model proposed in this paper shows stable detection performance for all kinds of vehicle targets, and is significantly better than several groups of control models in terms of com-prehensive detection accuracy. At the same time, the volume of the model after training is small, and can be deployed and carried out real-time detection on the embedded hardware terminal of the test environment.