Abstract
© 2025 The AuthorsSparse-view camera pose estimation, which aims to recover 6-Degree-of-Freedom (6-DoF) poses from a limited number of unordered multi-view images, is fundamental yet challenging in remote sensing. Learning-based methods offer greater robustness than traditional Structure-from-Motion (SfM) pipelines by leveraging dense high-dimensional features and implicit learning, rather than sparse keypoints and limited geometric constraints. However, they often neglect pairwise translation cues between views, resulting in suboptimal performance in sparse-view scenarios. To address this limitation, we introduce T-Graph, a lightweight, plug-and-play module to enhance camera pose estimation in sparse-view settings. T-graph takes paired image features as input and maps them through a Multilayer Perceptron (MLP). It then constructs a fully connected translation graph, where nodes represent cameras and edges encode their translation relationships. It can be seamlessly integrated into most existing learning-based models as an additional branch in parallel with the original prediction, maintaining efficiency and ease of use. Furthermore, we introduce two pairwise translation representations, relative-t and pair-t, formulated under different local coordinate systems. While relative-t captures intuitive spatial relationships, pair-t offers a rotation-disentangled alternative. The two representations contribute to enhanced adaptability across diverse application scenarios, further improving our module's robustness. We further propose an indicator termed the Camera Axis Dispersion Ratio (CADR) to quantitatively assess which type of pairwise translation representation is better suited for a given camera configuration in a dataset. Extensive experiments on three representative methods (RelPose++, Forge and 8Pt-ViT) using public datasets (CO3D and IMC PhotoTourism) validate both the effectiveness and generalizability of T-Graph. The results demonstrate consistent improvements across various metrics, notably camera center accuracy, which improves up to 6% across 2 to 8 viewpoints.