Audio object detection network with multimodal cross level feature knowledge transfer
As one of the inherent properties of objects,sound can provide valuable information for target detection.At present,the method of target positioning only by monitoring environmental sound is less ro-bust.To solve this problem,a multi-modal self-supervised target detection network under cross-level fea-ture knowledge transfer was proposed.First of all,in view of the teachers network and students at the same characteristics of network learning ability of the limited problem,design based on the integration of teachers across level knowledge transfer loss,through the way of attention fusion deep and shallow charac-teristics of students,more efficient learning to the corresponding teacher middle layer characteristics,to ex-tract more knowledge,combined with KL divergence,realize the alignment of teachers and students net-work alignment.In addition,in order to solve the problem of missing localization information,localization distillation loss was added,and more localization information was obtained by fitting the distribution of the teacher.With the network trained in the multimodal audiovisual detection MAVD dataset,the mAP val-ues improve by 6.71%,14.36%and 10.32%from the baseline network at IOU values of 0.5,0.75 and average,respectively.The experimental results demonstrate the superiority of this detection network.