基于多模态共享网络的自监督语音-人脸跨模态关联学习方法

Self-supervised Voice-face Cross-modal Association Learning Method via Multi-modal Shared Network

扫码查看

原文链接

维普
万方数据

中文摘要：现有的语音-人脸跨模态关联学习方法在语义关联和监督信息方面仍然面临挑战,尚未充分考虑语音与人脸之间的语义信息交互.为解决这些问题,提出一种基于多模态共享网络的自监督关联学习方法.首先,将语音和人脸模态的特征映射到单位球面,构建一个公共的特征空间;接着,通过多模态共享网络的残差块来挖掘复杂的非线性数据关系,并利用其中权重共享的全连接层来增强语音与人脸特征向量之间的关联性;最后,使用K均值聚类算法生成的伪标签作为监督信号来指导度量学习,从而完成4种跨模态关联学习任务.实验结果表明,本文提出的方法在语音-人脸跨模态验证、匹配和检索任务上均取得了良好的效果,多项评价指标相较于现有基线方法提升1％～4％的准确率.

外文摘要：Existing voice-face cross-modal association learning methods still face challenges in semantic correlation and supervised information,and have not yet fully considered the semantic information interaction between voice and face.To solve these problems,a self-supervised association learning method based on a multi-modal shared network was proposed.Firstly,the voice and face features were mapped into the unit sphere to establish a shared feature space.Secondly,complex nonlinear data relationships were explored using the residual block of the multi-modal shared network,while a weight-sharing fully connected layer was utilized to enhance the correlation between voice and face.Finally,pseudo-labels,generated by the K-means clustering algorithm,were utilized as supervised signals,guiding the metric learning process to accomplish the four cross-modal association learning tasks.Experimental results show that the method proposed in this paper achieves favorable outcomes in voice-face cross-modal verification,matching,and retrieval tasks,and several evaluation metrics improve 1％～4％accuracy compared with existing baseline methods.

外文关键词：

voice-face cross-modalmulti-modal shared networkpseudo labelassociation learning

作者：

李俊屿、卜凡亮、谭林、周禹辰、毛璟仪

展开 >

作者单位：

中国人民公安大学信息网络安全学院,北京 100038

公安部第一研究所,北京 100048

关键词：

语音-人脸跨模态多模态共享网络伪标签关联学习

基金：

中国人民公安大学安全防范工程双一流专项

项目编号：

2023SYL08

出版年：

2024

DOI：

10.12404/j.issn.1671-1815.2304393

科学技术与工程

中国技术经济学会

科学技术与工程

CSTPCD北大核心

影响因子：0.338

ISSN：1671-1815

年,卷(期)：2024.24(7)

参考文献量31