单声源声音的音高估计算法主要有音高跟踪的鲁棒算法(Robust Algorithm for Pitch Tracking,RAPT)、SWIPE(Sawtooth Waveform Inspired Pitch Estimator)、Harvest等,但在引入有音乐伴奏等复调音乐的声源时,这些算法在人声音高估计任务中存在明显不足.借鉴现有的研究成果,改进传统声调估计的鲁棒模型(Robust Model for Vocal Pitch Estimation,RMVPE),提出一种基于Mamba-UNet架构的Mamba-RMVPE,用于解决复调音乐等多声源声音的人声音高估计问题.相较于传统的RMVPE,Mamba-RMVPE的音高准确率(Raw Pitch Accuracy,RPA)、音色准确率(Raw Chroma Accuracy,RCA)、总体正确率(Overall Accuracy,OA)均有提升,推理时间也大幅缩短.
Pitch Estimation Model Based on Mamba-UNet Architecture
The pitch estimation algorithms for single source sound mainly include Robust Algorithm for Pitch Tracking (RAPT),Sawtooth Waveform Inspired Pitch estimator (SWIPE),Harvest,etc. However,when introducing polyphonic music sources with musical accompaniment,these algorithms have significant shortcomings in human voice high estimation tasks. Drawing on existing research results and improving traditional Robust Model for Vocal Pitch Estimation (RMVPE),a Mamba-RMVPE based on Mamba-UNet architecture is proposed to solve the problem of high estimation of human voice from multiple sound sources such as polyphonic music. Compared to traditional RMVPE,Mamba-RMVPE has improved Raw Pitch Accuracy (RPA),Raw Chroma Accuracy (RCA),and Overall Accuracy (OA),and significantly reduced inference time.
polyphonypitch estimationRobust Model for Vocal Pitch Estimation (RMVPE)Mamba-UNet