首页|基于多尺度时空卷积的唇语识别方法

基于多尺度时空卷积的唇语识别方法

扫码查看
现有的唇语识别模型大多采用将单层的3维卷积与2维卷积神经网络结合的方式,从唇语视频序列中挖掘出时空联合特征.然而,由于单层的3维卷积不能很好地提取时间信息,同时2维卷积神经网络对细粒度的唇语特征的挖掘能力有限,该文提出一种多尺度唇语识别网络(MS-LipNet)以改善唇语识别任务.该文在Res2Net网络中,采用3维时空卷积替代传统的2维卷积以更好地提取时空联合特征,同时提出时空坐标注意力模块,使网络关注于任务相关的重要区域特征.在LRW和LRW-1000数据集上进行实验,验证了所提方法的有效性.
Lipreading Method Based on Multi-Scale Spatiotemporal Convolution
Most of the existing lipreading models use a combination of single-layer 3D convolution and 2D convolutional neural networks to extract spatio-temporal joint features from lip video sequences.However,due to the limitations of single-layer 3D convolutions in capturing temporal information and the restricted capability of 2D convolutional neural networks in exploring fine-grained lipreading features,a Multi-Scale Lipreading Network(MS-LipNet)is proposed to improve lip reading tasks.In this paper,3D spatio-temporal convolution is used to replace traditional two-dimensional convolution in Res2Net network to better extract spatio-temporal joint features,and a spatio-temporal coordinate attention module is proposed to make the network focus on task-related important regional features.The effectiveness of the proposed method was verified through experiments conducted on the LRW and LRW-1000 datasets.

LipreadingMulti-scale spatiotemporal convolutional networkRes2NetSpatiotemporal coordinate attentionData augmentation

叶鸿、危劲松、贾兆红、郑辉、梁栋、唐俊

展开 >

安徽大学互联网学院 合肥 230039

安徽大学电子信息工程学院 合肥 230601

唇语识别 多尺度时空卷积网络 Res2Net 时空坐标注意力 数据增强

2024

电子与信息学报
中国科学院电子学研究所 国家自然科学基金委员会信息科学部

电子与信息学报

CSTPCD北大核心
影响因子:1.302
ISSN:1009-5896
年,卷(期):2024.46(11)