Existing voice-face cross-modal association learning methods still face challenges in semantic correlation and supervised information,and have not yet fully considered the semantic information interaction between voice and face.To solve these problems,a self-supervised association learning method based on a multi-modal shared network was proposed.Firstly,the voice and face features were mapped into the unit sphere to establish a shared feature space.Secondly,complex nonlinear data relationships were explored using the residual block of the multi-modal shared network,while a weight-sharing fully connected layer was utilized to enhance the correlation between voice and face.Finally,pseudo-labels,generated by the K-means clustering algorithm,were utilized as supervised signals,guiding the metric learning process to accomplish the four cross-modal association learning tasks.Experimental results show that the method proposed in this paper achieves favorable outcomes in voice-face cross-modal verification,matching,and retrieval tasks,and several evaluation metrics improve 1%~4%accuracy compared with existing baseline methods.