电子与信息学报2024,Vol.46Issue(11) :4170-4177.DOI:10.11999/JEIT240161

基于多尺度时空卷积的唇语识别方法

Lipreading Method Based on Multi-Scale Spatiotemporal Convolution

叶鸿 危劲松 贾兆红 郑辉 梁栋 唐俊
电子与信息学报2024,Vol.46Issue(11) :4170-4177.DOI:10.11999/JEIT240161

基于多尺度时空卷积的唇语识别方法

Lipreading Method Based on Multi-Scale Spatiotemporal Convolution

叶鸿 1危劲松 1贾兆红 1郑辉 1梁栋 1唐俊2
扫码查看

作者信息

  • 1. 安徽大学互联网学院 合肥 230039
  • 2. 安徽大学电子信息工程学院 合肥 230601
  • 折叠

摘要

现有的唇语识别模型大多采用将单层的3维卷积与2维卷积神经网络结合的方式,从唇语视频序列中挖掘出时空联合特征.然而,由于单层的3维卷积不能很好地提取时间信息,同时2维卷积神经网络对细粒度的唇语特征的挖掘能力有限,该文提出一种多尺度唇语识别网络(MS-LipNet)以改善唇语识别任务.该文在Res2Net网络中,采用3维时空卷积替代传统的2维卷积以更好地提取时空联合特征,同时提出时空坐标注意力模块,使网络关注于任务相关的重要区域特征.在LRW和LRW-1000数据集上进行实验,验证了所提方法的有效性.

Abstract

Most of the existing lipreading models use a combination of single-layer 3D convolution and 2D convolutional neural networks to extract spatio-temporal joint features from lip video sequences.However,due to the limitations of single-layer 3D convolutions in capturing temporal information and the restricted capability of 2D convolutional neural networks in exploring fine-grained lipreading features,a Multi-Scale Lipreading Network(MS-LipNet)is proposed to improve lip reading tasks.In this paper,3D spatio-temporal convolution is used to replace traditional two-dimensional convolution in Res2Net network to better extract spatio-temporal joint features,and a spatio-temporal coordinate attention module is proposed to make the network focus on task-related important regional features.The effectiveness of the proposed method was verified through experiments conducted on the LRW and LRW-1000 datasets.

关键词

唇语识别/多尺度时空卷积网络/Res2Net/时空坐标注意力/数据增强

Key words

Lipreading/Multi-scale spatiotemporal convolutional network/Res2Net/Spatiotemporal coordinate attention/Data augmentation

引用本文复制引用

出版年

2024
电子与信息学报
中国科学院电子学研究所 国家自然科学基金委员会信息科学部

电子与信息学报

CSTPCDCSCD北大核心
影响因子:1.302
ISSN:1009-5896
段落导航相关论文