Multi-subspace multimodal sentiment analysis method based on Transformer
Multimodal sentiment analysis refers to recognizing the emotions expressed by characters in a video through textual,visual and acoustic information.Most of the existing methods learn multimodal coherence in-formation by designing complex fusion schemes,while ignoring inter-and intra-modal differentiation informa-tion,resulting in a lack of information complementary to multimodal fusion representations.To this end,we propose a multi-subspace Transformer fusion network for multimodal sentiment analysis(MSTFN)method.The method maps different modalities to private and shared subspaces to obtain private and shared representa-tions of different modalities,learning differentiated and unified information for each modality.Specifically,the initial feature representations of each modality are first mapped to their respective private and shared subspaces to learn the private representation containing unique information and the shared representation containing uni-fied information in each modality.Second,under the premise of strengthening the roles of textual and audio modalities,a binary collaborative attention cross-modal Transformer module is designed to obtain textual and audio-based tri-modal representations.Then,the final representation of each modality is generated using mo-dal private and shared representations and fused two by two to obtain a bimodal representation to further com-plement the information of the multimodal fusion representation.Finally,the unimodal representation,bimodal representation,and trimodal representation are stitched together as the final multimodal feature for sentiment prediction.Experimental results on two benchmark multimodal sentiment analysis datasets show that the pres-ent method improves on the binary classification accuracy metrics by 0.025 6/0.014 3 and 0.000 7/0.002 3,respectively,compared to the best benchmark method.