This paper adopts Swin Transformer and Vision Transformer networks, combined with a migration learning approach, to conduct an in-depth study on the task of face expression recognition. In order to verify the performance of different networks, four commonly used face expression datasets, RAF-DB, Fer2013, CK+and JAFFE, are selected for the study. By comparing different models of different networks, the experimental results show that the WMSA and SWMS self-attentive structures in Swin Transformer network are more capable of focusing on the expression features correctly, and achieve 99.48%, 95.60%, and 86.73% recognition accuracy rate on the datasets CK+, JAFFE, and RAF-DB, respectively. The experimental results finally verified that the Transformer network based on the self-attention mechanism has a great potential in the face expression recognition task.
关键词
深度学习/表情识别/迁移学习/Transformer/自注意力
Key words
deep learning/facial expression recognition/Transformer/self-attention