Cross-modal image-text retrieval,a pivotal topic in cross-modal computing research,has garnered considerable attention from both academia and industry. Over the past few decades,fueled by advancements in deep learning technologies,particularly deep neural networks,Trans-former architectures,and image-text contrastive learning,the field of image-text retrieval has witnessed significant progress and breakthroughs. Based on a systematically review of the develop-ment trajectory of image-text cross-modal retrieval,this paper focuses on five key steps in its mod-eling process:preparing training data,designing data input formats,selecting mechanisms for ex-tracting image-text features,selecting image-text modeling methods,and establishing optimiza-tion objectives. To objectively evaluate the existing model performances in cross-modal retrieval tasks,various models are compared across multiple authoritative benchmark datasets,revealing the practical performance boundaries of current retrieval methods. By analyzing and summarizing the evolution of each key step and considering current research outcomes,the paper predicts and envisions future trends in cross-modal learning. The research findings suggest significant advance-ments in current image-text retrieval technologies while highlighting opportunities for further en-hancement. Researchers can enhance the field by focusing on four areas:refined retrieval,economi-cal pre-training methods,new image-text interaction approaches,and image-text pre-training em-powered by Artificial Intelligence Generated Content (AIGC).