Lipreading Method Based on Multi-Scale Spatiotemporal Convolution
Most of the existing lipreading models use a combination of single-layer 3D convolution and 2D convolutional neural networks to extract spatio-temporal joint features from lip video sequences.However,due to the limitations of single-layer 3D convolutions in capturing temporal information and the restricted capability of 2D convolutional neural networks in exploring fine-grained lipreading features,a Multi-Scale Lipreading Network(MS-LipNet)is proposed to improve lip reading tasks.In this paper,3D spatio-temporal convolution is used to replace traditional two-dimensional convolution in Res2Net network to better extract spatio-temporal joint features,and a spatio-temporal coordinate attention module is proposed to make the network focus on task-related important regional features.The effectiveness of the proposed method was verified through experiments conducted on the LRW and LRW-1000 datasets.