首页|采用多视角注意力的声音事件定位与检测

采用多视角注意力的声音事件定位与检测

扫码查看
近年来,基于深度学习的方法有效改进了声音事件定位与检测的性能,但当场景中存在多声源重叠时,准确的声源时空信息估计依然较为困难,声音事件定位与检测的性能存在较大提升空间.为充分挖掘多通道深层表示所包含的关键信息,本文提出了一种多视角注意力网络模型MVANet(Multi-View Attention Network).首先,引入软参数共享网络架构实现不同任务之间的交互学习,计算多通道深层表示,在对比不同通道注意力结构的基础上,选择了一种轻量级的高效通道注意力模块ECA(Efficient Channel Attention)与多头自注意力模块MHSA(Multi-Head Self-Attention)结合,从通道、时间、频率三个视角关注深层表示中的关键特征,丰富高维特征信息.其次,对比了ECA模块和软参数共享架构在MVANet不同位置上的性能,确定了ECA模块和软参数共享在模型上的最佳实现位置,最大程度上提高模型对特征的挖掘能力.仿真结果表明,对于包含同类别重叠声事件的TAU-NIGENS Spatial Sound Events 2020数据集,本文提出的MVANet模型相比较于基线方法,检测和定位性能均得到了改善.在多声源场景下,检测错误率下降了0.03,定位误差下降了1.5°.
Sound Event Localization and Detection Model Based on Multi-View Attention
In recent years,the performance of sound event localization and detection(SELD)methods based on deep learning have quickly improved.However,in practical applications,the existence of multiple sound sources makes it difficult for the existing SELD models to accurately extract the spatiotemporal information of deep features,which seri-ously degrades the performance.To study the key information contained in learned multi-channel deep representations,this study investigated a SELD model fused with multi-view attention,which was called the multi-view attention net-work(MVANet).First,the model adopted a soft-parameter-sharing network as the basic architecture to realize interac-tive learning between different tasks and calculate a multi-channel deep representation.Based on a comparison of differ-ent channel attention mechanisms,we chose multi-head self-attention,which gives attention to intra-channel features,along with a lightweight implementation of channel attention called efficient channel attention(ECA),which gives at-tention to inter-channel features.The multi-view attention mechanism helped the model to pay more attention to the key features of a deep representation from the perspectives of the channel,time,and frequency,enriching the high-dimensional feature information.Second,based on a comparison of the performances of the ECA module and soft pa-rameter sharing architecture in different positions,we chose the best scheme to extract the multi-view attention and im-prove the feature representations of the model to the maximum extent.Experimental results showed that the MVANet model improved all the metrics in terms of localization and detection compared to the baseline methods on the TAU-NIGENS Spatial Sound Events 2020 dataset,which contains overlapping acoustic events of the same category.In par-ticular,the detection error rate was reduced by 0.03 and the localization error was reduced by 1.5° in a scenario where multiple sound sources coexisted.

sound event localization and detectiondeep learningmulti-view attentionchannel attentionmulti-head self-attention

杨吉斌、黄翔、张雄伟、张强、梅鹏程

展开 >

陆军工程大学指挥控制工程学院,江苏南京 210007

65334部队,吉林四平 136000

声音事件定位与检测 深度学习 多视角注意力 通道注意力 多头自注意力

国家自然科学基金国家自然科学基金校基础前沿科技创新项目

6147139462071484KYZYJKQTZQ23001

2024

信号处理
中国电子学会

信号处理

CSTPCD北大核心
影响因子:1.502
ISSN:1003-0530
年,卷(期):2024.40(2)
  • 25