Combining ViT with contrastive learning for facial expression recognition
Objective Facial expression is one of the important factors in human communication to help understand the intentions of others.The task of facial expression recognition is to output the category of facial expression corresponding to a given face picture.Facial expression has broad applications in areas such as security monitoring,education,and human-computer interaction.Currently,facial expression recognition under uncontrolled conditions suffers from low accuracy due to factors such as pose variations,occlusions,and lighting differences.Addressing these issues will remarkably advance the development of facial expression recognition in real-world scenarios and hold great relevance in the field of artificial intelligence.Self-supervised learning is proposed to utilize specific data augmentations on input data and generate pseudo labels for training or pretraining models.Self-supervised learning leverages a large amount of unlabeled data and extracts the prior knowledge distribution of the images themselves to improve the performance of downstream tasks.Contrast learn-ing belongs to self-supervised learning,which can further learn the intrinsic consistent feature information between similar images under the change of posture and light by increasing the difficulty of the task.This paper proposes an unsupervised contrastive learning-based facial expression classification method to address the problem of low accuracy caused by occlu-sion,pose variation,and lighting changes in facial expression recognition.Method To address the issue of occlusions in facial expression recognition datasets under real-world conditions,a method based on negative sample-based self-supervised contrastive learning is employed.The method consists of two stages:contrastive learning pretraining and model fine-tuning.First,in the pretraining stage of contrastive learning,an unsupervised contrastive loss is introduced to reduce the distance between images of the same type and increase the distance between images of different classes to improve the discrimination ability of intraclass diversity and interclass similarity images of facial expression images.This method involves adding positive sample pairs for contrastive learning between the original images and occlusion-augmented images,enhancing the robustness of the model to image occlusion and illumination changes.Additionally,a dictionary mechanism is applied to MoCo v3 to overcome the issue of insufficient memory during training.The recognition model is pretrained on the ImageNet dataset.Next,the model is fine-tuned on the facial expression recognition dataset to improve the classifica-tion accuracy for facial expression recognition tasks.This approach effectively enhances the performance of facial expres-sion recognition in the presence of occlusions.Moreover,the Transformer-based vision Transformer(ViT)network is employed as the backbone network to enhance the model's feature extraction capability.Result Experiments were con-ducted on four datasets to evaluate the performance of the proposed method compared with the latest 13 methods.In the RAF-DB dataset,compared with the Face2Exp model,the recognition accuracy increased by 0.48%;in the FERPlus data-set,compared with the knowledgeable teacher network(KTN)model,The recognition accuracy increased by 0.35%;in the AffectNet-8 dataset,compared with the self-cure network(SCN)model,the recognition accuracy increased by 0.40%;in the AffectNet-7 dataset,compared with the deep attentive center loss(DACL)model,the recognition accuracy was slightly lower by 0.26%,which proves the effectiveness of the method in this paper.Conclusion A self-supervised con-trastive learning-based method for facial expression recognition is proposed to address the challenges of occlusion,pose variation,and illumination changes in uncontrolled conditions.The method consists of two stages:pretraining and fine-tuning.The contributions of this paper lie in the integration of ViT into the contrastive learning framework,which enables the utilization of a large amount of unlabeled,noise-occluded data to learn the distribution characteristics of facial expres-sion data.The proposed method achieves promising accuracy on facial expression recognition datasets,including RAF-DB,FERPlus,AffectNet-7,and AffectNet-8.By leveraging the contrastive learning framework and advanced feature extraction networks,this work enhances the application of deep learning methods in everyday visual tasks.
facial expression recognitioncomparative learningself-supervised learningTransformerpositive and nega-tive samples