Review of cross-view image geolocalization methods
The research field of cross-view image geolocalization aims to determine the geographic location of images obtained from various viewpoints or perspectives to provide technical support for subsequent tasks,such as automatic driv-ing,robot navigation,and three-dimensional reconstruction.This field involves matching images captured from different views,such as satellite and ground-level images,to accurately estimate their geographical coordinates.Cross-view image geolocalization presents difficulty due to differences in viewpoint,scale,illumination,and appearance among images.This process requires addressing the problems of viewpoint variation,geometric transformations,and handling the large search space of possible matching locations.Early studies on image geolocalization were mainly based on single-view images.Single-view image geolocalization can obtain the geolocation information of a given image by searching for the same-view reference image with prelabeled geolocation information from the image database.However,the traditional single-view image geolocalization method is usually limited by the quality and scale of the dataset,and thus,the positioning accu-racy is usually low.To overcome these limitations,the researchers have proposed a series of cross-view image geolocaliza-tion methods that utilize image data from multiple perspectives to increase the positioning accuracy through the comparison and matching various perspectives.Given the complexity of geolocalization tasks and solutions,existing methods of cross-view image geolocalization can be classified in multiple ways.This review introduces various classification methods of cross-view image geolocalization and representative methods for each type,and compares their advantages and disadvan-tages.On the one hand,the diversification of platforms and the increase in multisource data provide more source data choices for cross-view image geolocalization.Based on the differences in matching image sources,cross-view image geolo-calization methods can be classified into ground-satellite image-and drone-satellite image-oriented methods.Ground-satellite image-oriented geolocalization conducts image geolocalization on a satellite image based on a ground-view image to be queried.Although ground-satellite geolocalization has various application prospects,a huge visual difference exists between ground-and satellite-view images due to the large angle change,and thus,the matching task encounters diffi-culty.The drone-satellite geolocalization task,despite being a relatively new method of cross-view image geolocalization,is receiving increasing attention.Unlike the ground image,the drone experiences less occlusion,covers more scenes,and is found near the satellite perspective.The release of University-1652,a geolocalization dataset containing drone,ground,and satellite images,provides data support for related research.On the other hand,feature extraction can be used to solve the geographic location problem of horizontal images.Based on the diverse methods of image feature extraction and expres-sion,cross-view image geolocation methods can be classified into those that are based on artificially designed features and those based on self-learning features of deep neural networks.The former mainly comprise methods based on hand-crafted feature descriptors,such as scale-invariant feature transform,speeded-up robust features,and oriented FAST and rotated BRIEF,which can often be used for similarity measurement using Euclidean or cosine distance or be directly inputted into machine learning models,such as support vector machines and random forest models.Nevertheless,methods belonging to this category exhibit a weak robustness,cannot be finetuned for specific tasks,and have limited accuracy.With the rise of deep learning and the release of large annotated datasets,such as CVUSA and CVACT,deep neural networks have been applied to cross-view image geolocation.Based on whether view alignment is incorporated and the manner of its implemen-tation,methods based on self-learning features of deep neural networks can be subdivided into three categories,namely,those without view alignment processing,those with a view alignment based on traditional image transformations,and those with a view alignment based on image generation.Methods without a view alignment processing focus on end-to-end learn-ing of image feature representation with sufficient discriminative capability,and deep neural networks are mainly based on convolutional neural networks and attention mechanisms.This kind of method is dedicated to making full use of content information in images but often ignores the spatial relationship between images of different views(such as ground and aerial views).This defect is compensated by methods with view alignment based on traditional image transformations.Traditional image-transforming methods were used to explicitly provide additional spatial information for input images,which narrows the domain gap between cross-view images.This kind of method includes polar coordinate transformation and perspective image transformation.Methods with view alignment based on image generation usually utilize generative neural networks first to generate image samples with realistic view angles and match these generated images with real ones to infer their cor-responding geographical positions.The generative adversarial network is a representative method in this category.Apart from the description and categorization of methods,the commonly used datasets,including CVUSA,CVACT,and VIGOR for street view-satellite image matching,University-1652 for ground-drone-satellite image matching,and SUES-200 for drone-satellite image matching,and their characteristics for cross-view image geolocalization are summarized.In addition,this paper summarizes the commonly used metrics for model performance evaluation,including Recall@K,average preci-sion(AP),and Hit Rate-K.The evaluation was based on the performances of CVUSA,CVACT,and University-1625.Finally,this review offers an view on the application areas and future development directions of cross-view image geolocal-ization.Although this research field has achieved considerable breakthroughs and progress,it still faces certain obstacles and challenges,such as the lack of multimodal datasets,challenges in nonrigid scenarios,and the need for real-time and online geolocation.Possible solutions and future research priorities have been proposed to further promote the development and innovation shown in this field.Such solutions include the creation of multimode geolocalization datasets,combination of multiscale and multiview information to solve the geo-location problem in nonrigid scenes,and fusion of other sensor data to achieve real-time geolocation.