Modality Fusion Strategy Research Based on Multimodal Video Classification Task
Despite the success of AI-related technologies in many fields,they usually simulate only one type of human perception,which means that they are limited to process information from a single modality.Extracting features from multiple modal infor-mation and fusing them effectively is important for developing general AI.In this paper,a comparative study of different multimo-dal information fusion strategies based on an encoder-decoder architecture with early feature fusion for feature encoding of multi-modal information,late decision fusion for prediction results of each modal information,and a combination of both is conducted on a video classification task.This paper also compares two ways to involve audio modal information in modal fusion,i.e.,directly encoding audio with features and then participating in modal fusion or audio by speech-to-text and then participating in modal fu-sion in the form of text.Experiments show that decision fusion of the prediction results of text and audio modalities alone with those of the fused features of the other two modalities can further improve the classification prediction accuracy under the experi-mental approach of this study.Moreover,converting speech into text modal information by ASR(Automatic Speech Recognition)can make fuller use of the semantic information contained in it.