Addressing the issues of limited receptive field size and weak feature interaction learning capabilities in traditional convolutional neural networks,resulting in relatively singular feature extraction in conventional convolutional neural network-based deepfake face detection techniques,a deepfake face detection method based on enhanced Swin Transformer is proposed in this pa-per.This method introduces local multi-head self-attention and global multi-head self-attention mechanisms,leveraging the strengths of Swin Transformer to effectively capture image context information and video temporal relationships,with strong global receptive fields and long-distance dependency modeling capabilities.Experimental results on the DFDC dataset demonstrate that our approach outperforms baseline methods,exhibiting superior deepfake face detection capabilities.
enhanced Swin Transformerdeepfake face detectionaudiovisual decompositionconsistency analysisfeature fusion