With the rapid advancement in deep learning,deepfake technology has gained significant momentum as a form of image manipulation based on generative models.The proliferation of deepfake videos and images has a detrimental sociopolitical impact,highlighting the increasing significance of deepfake detection techniques.Existing deepfake detection methods based on Convolutional Neural Networks(CNN)and Vision Transformers(ViT)commonly suffer from challenges such as large sizes of model parameters,slow training speeds,susceptibility to overfitting,and limited robustness against video compression and noise.To address these challenges,a multi-scale deepfake detection method that integrates spatial features is proposed herein.Firstly,an Automatic White Balance(AWB)algorithm is employed to adjust the contrast of input images,thereby enhancing robustness of the model.Subsequently,Multi-scale ViT(MViT)and CNN are separately utilized to extract the multi-scale global and local features,respectively,of the input images.These global and local features are then fused together using an improved sparse cross-attention mechanism to enhance the recognition performance of the model.Finally,the fused features are classified using a Multi-Layer Perceptron(MLP).According to the experimental results,the proposed model achieves frame-level Area Under the Curve(AUC)scores of 0.986,0.984,and 0.988 on the Deepfakes,FaceSwap,and Celeb-DF(v2)datasets,respectively,demonstrating strong robustness in cross-compression experiments.Additionally,comparative experiments before and after specific model improvements have validated the gains provided by each module in terms of detection results.