Aiming at the problem that single-modal sentiment analysis cannot fully capture emotional information,a cross-modal sentiment analysis model(BERT-VistaNet)for visual and textual was proposed,instead of directly using visual information as features,visual information was used.As an alignment,an attention mechanism was used to point out important sentences in the text,resulting in a visual attention-based document representation.For text content that could not be fully covered by visual at-tention,the BERT model was used to perform sentiment analysis on the text to obtain a text-based document representation,and the features were fused for sentiment classification tasks.On the Yelp public restaurant dataset,the accuracy of this model is 43%higher than that of the baseline model TFN-aVGG,and 1.4%higher than that of the VistaNet model.