Objective To deal with the heterogeneity of multimodal data,and effectively fuse data with different modalities for sentiment analysis.Methods We have introduced a multimodal sentiment analysis model based on input space transformation,aim-ing to align the modalities of images and text.For the image modality,we employed an input space transformation module that gen-erates textual descriptions of the corresponding images in an autoregressive manner.In the case of the text modality,we combined the original text with the generated text,providing a rich textual dataset for the language model.We used the BERT language model to construct dynamic word embeddings and then employed Bi-GRU to capture essential semantic features in the context.Finally,we employed SoftMax for sentiment classification.Results We have surpassed the performance of baseline models on two multimodal Twitter datasets.Conclusion The model can effectively process multimodal data.
关键词
多模态情感分析/输入空间转换/模态融合/Bert
Key words
multimodal sentiment analysis/input space transformation/modality fusion/BERT