Current multimodal pre-training techniques for visual languages predominantly focus on aligning global se-mantic features between images and text,yet they inadequately explore the granular feature interactions between modalities.Addressing this gap,this paper proposes a novel multimodal pre-training strategy informed by cross-modal guidance and alignment.Our method employs a dual-stream feature extraction network designed for visual sequence compression,to fa-cilitate modality feature extraction.During this phase,a synergistic image-text guidance is integrated within the visual en-coder,orchestrating the compression of visual sequences layer by layer.This approach mitigates the obfuscation of modali-ty-specific fine-grained interactions by irrelevant visual information.Subsequently,in the modality feature alignment phase,we implement fine-grained relational reasoning on the image and textual features to achieve localized feature alignment among visual tokens and textual tokens.This advancement bolsters the model's comprehension of fine-grained alignment re-lationships.After fine-tuning,in the image-text retrieval tasks,our approach achieves an average recall rate of 86.4%for im-ages and 94.88%for texts,which represents a significant 5.36%improvement in zero-shot image-text retrieval over the ca-nonical CLIP(Contrastive Language-Image Pre-training)algorithm.Moreover,our method also surpasses existing main-stream multimodal pre-training methods in accuracy for classification tasks like visual question answering.