Incremental Joint Learning of Depth, Pose, and Implicit Scene Representation on Monocular Camera in Large-Scale Scenes
/ Authors
/ Abstract
Dense scene reconstruction for photo-realistic view synthesis has various applications, such as VR/AR, and robotics navigation. Existing dense reconstruction methods are primarily designed for small room scenarios, but in practice, the scenes encountered by robots are typically large-scale environments. Most existing methods have difficulties in large-scale scenes due to three core challenges: (a) inaccurate depth input. Depth information is crucial for both scene geometry reconstruction and pose estimation. Accurate depth input is impossible to get in real-world large-scale scenes. (b) inaccurate pose estimation. Existing methods are not robust enough with the growth of cumulative errors in large scenes and long sequences. (c) insufficient scene representation capability. A single global radiance field lacks the capacity to scale effectively to large-scale scenes. To this end, we propose an incremental joint learning framework, which can achieve accurate depth, pose estimation, and large-scale dense scene reconstruction. For depth estimation, a vision transformer-based network is adopted as the backbone to enhance performance in scale information estimation. For pose estimation, a feature-metric bundle adjustment (FBA) method is designed for accurate and robust camera tracking in large-scale scenes and eliminates pose drift. In terms of implicit scene representation, we propose an incremental scene representation method to construct the entire large-scale scene as multiple local radiance fields to enhance the scalability of 3D scene representation. In local radiance fields, we propose a tri-plane based scene representation method to further improve the accuracy and efficiency of scene reconstruction. We conduct extensive experiments on various datasets, including our own collected data, to demonstrate the effectiveness and accuracy of our method in depth estimation, pose estimation, and large-scale scene reconstruction. The code has been open-sourced on https://github.com/dtc111111/incre-dpsr Note to Practitioners—In practical robot deployment scenarios, dense scene reconstruction can play a crucial role, such as in precise localization and obstacle avoidance. However, existing scene reconstruction methods are mainly suitable for small-room environments and struggle with larger-scale settings like long corridors or extended sequences. Our method introduces an incremental joint learning framework that simultaneously performs depth estimation, pose estimation, and scene reconstruction. Experimental results demonstrate the accuracy and effectiveness of this approach across different scene types.
Journal: IEEE Transactions on Automation Science and Engineering