Although multimodal text classification techniques have potential when applied to specific scenarios,there are still some limitations.Existing multimodal fusion models require modal alignment in the input data,resulting in a large amount of incomplete multimodal data being directly discarded,thus limiting the scale and flexibility of available data for inference.To address this problem,we proposed a text classification model based on multimodal fusion enhancement and an insufficient multimodal resource training method.Compared with traditional methods,our model had shown an improved performance of an average of 4.25%on a standard dataset.Furthermore,when the missing rate of other modalities except for text input was 50%,using the insufficient multimodal resource training method improved the performance by about 4%compared with traditional multi-route strategies.The experimental results demonstrate the effectiveness of the proposed model and training method.
关键词
文本分类/交叉注意力/多模态融合/不充分多模态资源训练方法
Key words
text classification/cross attention/multimodal fusion/insufficient multimodal resource training method