The advent of new infrastructure construction and the era of intelligent photogrammetry have facilitated the rapid development of global aerospace and aviation remote sensing technology.Numerous multi-sensors integrating stereoscopic observation facilities have been launched from spaceborne,airborne,and terrestrial platforms,and the types of sensors have also developed from traditional single-mode sensors(e.g.,optical sensors)to a new generation of multimodal sensors(e.g.,multispectral,hyperspectral,light detection and ranging(LiDAR),and synthetic aperture radar(SAR)sensors).These advanced sensor devices can dynamically provide multimodal remote sensing images with different spatial,tempo-ral,and spectral resolutions.They can obtain more reliable,comprehensive,and accurate observation results than single-modal sensors through joint processing of spaceborne,airborne,and terrestrial multimodal data.Therefore,investigating multimodal remote sensing image registration has great scientific significance.Multi-level and multi-perspective Earth observation can be effectively achieved only by fully integrating and utilizing various multimodal remote sensing images.In order to promote the development of multimodal remote sensing image registration research technology,we systematically sort out,analyze,introduce,and summarize the current mainstream registration methods for multimodal remote sensing images.We first sort out the research development and evolution process from single-modal to multimodal remote sensing image registration.We then analyze the core ideas of representative algorithms among area-based,feature-based,and deep-learning-based pipelines,while the contribution of the author team in the field of multimodal remote sensing image registration is introduced.Area-based registration(template matching)pipeline mainly includes two types:information theory-based and structural feature-based registration methods.The structural feature-based method consists of sparse structural features and dense structural features.From the perspective of the robustness and efficiency of comprehensive registration,dense-structure-feature-based methods have obvious effectiveness and advantages in handling significant non-linear radiation differences between multimodal remote sensing images and can meet many current application needs.By contrast,area-based registration pipeline generally relies on geo-referencing of remote sensing images to predict the rough range of template matching.Feature-based registration methods can be refined into three categories:feature registration based on gradient optimization,local self-similarity(LSS),and phase consistency.The feature registration of gradient opti-mization usually designs consistent gradients for specific multimodal images.The generalization of this type of method based on gradient optimization is generally poor,and it has difficulty maintaining the same performance on other types of multimodal images.The feature registration of LSS also has limitations,given that the relatively low discriminative power of LSS descriptors may result in the inability to maintain robust matching performance in the presence of complex nonlinear radiation differences.The feature registration of phase consistency has high computational complexity,and the registration process is generally time consuming.Feature-based registration pipeline utilizes the local spatial relationship between adja-cent pixels to construct a high-dimensional information feature vector for each feature point.Compared with template match-ing methods,they usually face a heavy computational burden,and inevitable serious outliers are prone to occur in match-ing,especially in multimodal registration situations where scale,rotation,and radiation differences exist simultaneously.In general,the registration robustness of feature-based methods is not as stable as that of area-based methods.The deep-learning-based pipeline can be divided into modular and end-to-end registration methods.The most common strategy for modular registration methods is to embed deep networks into feature-based or region-based methods.This approach takes advantage of the complete data-driven and high-dimensional deep feature extraction capability of deep learning to generate more robust features or more effective descriptors or similarity measures,which improves the robustness of image registra-tion.Modular registration methods can be subdivided into three categories:learning-based template matching,learning-based feature matching,and style transfer-based modal unification.Modular registration methods are easy to train and have strong flexibility,but it has difficulty avoiding the error accumulation problem that easily occurs in multi-stage tasks and may fall into local optimality.The end-to-end registration methods directly estimate the geometric transformation param-eters or deformation field to achieve image registration by directly constructing an end-to-end neural network structure.The training objectives of the end-to-end network are consistent and can obtain the global optimal solution.However,some problems arise,such as high training difficulty and poor interpretability.Moreover,no complete and comprehensive data-base containing all types of multimodal remote sensing image pairs is available to date,and the lack of training and testing data greatly limits the development of deep learning-based registration methods.Furthermore,we share existing public reg-istration datasets of multimodal remote sensing images,as well as supplement by a small number of registration datasets in the field of computer vision.Finally,the existing problems and challenges in the current research on high-precision registration of multimodal remote sensing images are analyzed.A forward-looking outlook on the development trend of future research is given,which aims to promote further breakthroughs and innovations in the field of multimodal remote sensing image registration.