Sound Event Localization and Detection Model Based on Multi-View Attention
In recent years,the performance of sound event localization and detection(SELD)methods based on deep learning have quickly improved.However,in practical applications,the existence of multiple sound sources makes it difficult for the existing SELD models to accurately extract the spatiotemporal information of deep features,which seri-ously degrades the performance.To study the key information contained in learned multi-channel deep representations,this study investigated a SELD model fused with multi-view attention,which was called the multi-view attention net-work(MVANet).First,the model adopted a soft-parameter-sharing network as the basic architecture to realize interac-tive learning between different tasks and calculate a multi-channel deep representation.Based on a comparison of differ-ent channel attention mechanisms,we chose multi-head self-attention,which gives attention to intra-channel features,along with a lightweight implementation of channel attention called efficient channel attention(ECA),which gives at-tention to inter-channel features.The multi-view attention mechanism helped the model to pay more attention to the key features of a deep representation from the perspectives of the channel,time,and frequency,enriching the high-dimensional feature information.Second,based on a comparison of the performances of the ECA module and soft pa-rameter sharing architecture in different positions,we chose the best scheme to extract the multi-view attention and im-prove the feature representations of the model to the maximum extent.Experimental results showed that the MVANet model improved all the metrics in terms of localization and detection compared to the baseline methods on the TAU-NIGENS Spatial Sound Events 2020 dataset,which contains overlapping acoustic events of the same category.In par-ticular,the detection error rate was reduced by 0.03 and the localization error was reduced by 1.5° in a scenario where multiple sound sources coexisted.
sound event localization and detectiondeep learningmulti-view attentionchannel attentionmulti-head self-attention