Depth-Supervised Neural Surface Reconstruction from Airborne Imagery

Vincent Hackstein

Duration: 6 months
Completition: August 2023
Supervisor: Prof. Dr.-Ing. Norbert Haala (ifp), Dr.-Ing. Mathias Rothermel, Dr.-Ing. Patrick Tutzauer (beide nFrames GmbH)
Examiner: Prof. Dr.-Ing. Norbert Haala

Introduction and Motivation

3D Surface Reconstruction from multi-view images is useful for many applications including urban modeling, environmental studies, simulations, robotics, and virtual reality. In the last decade, much of the research has focused on improving single steps of multi-view stereo (MVS) pipelines by learning-based approaches. More recently, neural surface reconstruction based on Neural Radiance Fields (NeRF) [1] has emerged as a promising contender to MVS approaches. Neural-based reconstruction methods utilize neural networks to reconstruct an implicit representation of the geometry and appearance of a scene. Despite impressive reconstruction results, all of these methods suffer from common limitations: the enormous computational cost of training such models [1,3], and convergence issues when available data evidence is low, ambiguous, or contradictory [2,4,5,6]. One key reason for these challenges is that the optimization is mostly driven by photometric consistency which becomes ambiguous in shady or textureless regions. This leads to an under-constrained optimization problem, resulting in imprecise, incomplete, or non-smooth surfaces. To further constrain the optimization, previous works have shown the capabilities of incorporating prior knowledge about the scene geometry into the process.

The objective of this work is to evaluate the use of depth priors in neural surface reconstruction from airborne imagery. The main contributions are:

Focus on reconstruction from airborne imagery.
Evaluations of dense, low-resolution depth priors for geometric initialization.
Evaluations of sparse depth priors obtained through Aerial Triangulation (AT) for geometric initialization and guided optimization.
A proposed method to enhance robustness against outliers within priors obtained through AT.

Method

Reconstruction Pipeline. The reconstruction pipeline is based on volSDF [7] which is an adaption of NeRF [1]. To speed up training, the positional encoding [1] of volSDF is replaced with the multi-resolution hash grid encoding proposed by Instant-NGP [9]. The training can be supervised by different loss functions: photometric consistency loss [1], Eikonal [7] or explicit surface smoothness [8] loss for smoothness regularization, signed-distance [4], and free-space [4] loss for depth supervision.

Generation of Depth Priors. Low-resolution depth maps are rendered from 2.5D meshes that are computed by multi-view-stereo (MVS) software SURE [10]. The meshes are computed based on different image resolutions to generate multiple sets of priors with different precision levels. These priors are used for geometric initialization.

Sparse depth maps are generated by triangulating tie points (TPs). The set of TPs is obtained by matching and filtering corresponding image features across multi-view images. These priors are incorporated into the reconstruction process for initialization and guided training.

To increase robustness against outliers within TP-clouds, this thesis proposes two robustness measures that can be incorporated into training: a reliability measure based on the fold (the number of connected images per TP) and a precision measure based on covariances estimated within triangulation.

Experiment Setup

Three different datasets were used for the experiments in this work. They were selected based on their different characteristics. The properties of the datasets are displayed in table 1. Mesh and Digital-Surface-Model (DSM) representations were extracted for qualitative analysis. For quantitative analysis, accuracy [12], completeness [12], and noise scores were computed.

Table 1: Properties of the individual datasets used in this thesis. The provided GSDs are average values estimated by SURE [10].

Results

Baseline. Baseline models are trained to analyze improvements when using different types of depth priors for surface reconstruction. Therefore, the smoothness regularization is fine-tuned and a combination of the Eikonal term [2] and explicit surface smoothness term [3] works best in terms of accuracy and completeness. This combination is used for subsequent experiments.

Dense Low-Resolution Depth Priors. The priors are used to initialize the geometry of the scenes. In many experiments, the initialization reduced local minima and ensured a more complete and accurate reconstruction compared to the baseline models. Observed improvements increase with the precision of the priors. Using the most precise priors, a speed-up for volumetric reconstruction of approximately factor 3 can be observed for the two datasets Domkirk and Hessigheim. The Frauenkirche set exposes, however, that a better convergence is not guaranteed. In some experiments, the optimization did not recover from the imprecise initialization leading to deteriorated convergence compared to the baseline. Based on these observations, precise depth priors were further investigated.

Sparse Depth Priors from AT. All sets of priors improve the reconstruction in terms of accuracy, completeness, and runtime. Precise tie points reduce local minima convergence issues by guiding the optimization, for example, due to low data evidence, or ambiguous or contradictory photometric information. By making topology more reliable, a volumetric reconstruction of a scene can be reached more reliably and faster (up to factor 3) compared to the reconstructions without depth priors. The reconstruction of detail is not accelerated. Outliers within the tie point clouds can cause noise or incompleteness in the reconstruction. The proposed method to increase robustness against erroneous tie points successfully addresses the introduced issues. However, it also mitigates the benefit provided by valid depth estimates and slightly decreases overall accuracy and completeness of reconstructed surfaces.

Figure 1: Reconstruction results for the Domkirk dataset after approximately 3 hours of training. Left: Using RGB input only. Right: Depth-supervised training based on AT-derived priors.

Figure 2: Reconstruction results for the Frauenkirche dataset after approximately 3 hours of training. Left: Using RGB input only. Right: Depth-supervised training based on AT-derived priors.

Figure 3: Reconstruction results for the Hessigheim dataset after approximately 3 hours of training. Left: Using RGB input only. Right: Depth-supervised training based on AT-derived priors.

Figure 4: Accuracy (solid line) and completeness (dashed line) scores for the Frauenkirche dataset use the different priors. The model supported by the AT-derived prior reaches the best scores. Note that in some cases the accuracy score is equal to the completeness score.

Conclusion

The evaluations of this thesis show that neural surface reconstruction can be improved using depth priors in terms of completeness, accuracy, and runtime.

While the geometric initialization by low-resolution depth priors successfully improves reconstruction results for some datasets, the usage of priors derived from AT improves the results across all sets. The most significant improvements are achieved in the first stages of training, which are the geometric initialization and the subsequent shape refinement to topologically correct reconstructions. Topology becomes more reliable and is reached in less time. Metric scores and qualitative analysis indicate a speed-up of up to factor 3. The reconstruction of detail, required for an end-to-end reconstruction, is only slightly accelerated and still requires long training durations (> 10 hours for the tested scenes). As expected, the reconstruction issues remain in areas, where no tie points are computed. Erroneous tie points can cause noise and errors in the reconstruction. This thesis proposes a method that enhances robustness against incorrect depth estimates. While this method successfully addresses issues introduced by outliers, it slightly mitigates the benefit provided by valid estimates and slightly decreases overall accuracy and completeness.

References (Selection)

[1]: Ben Mildenhall et al. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In: ECCV. 2020.

[2]: Barbara Roessle et al. Dense Depth Priors for Neural Radiance Fields from Sparse Input Views. In: CVPR. June 2022.

[3]: Kangle Deng et al. Depth-supervised NeRF: Fewer Views and Faster Training for Free. In: CVPR. June 2022

[4]: Dejan Azinović et al. Neural RGB-D Surface Reconstruction. In: CVPR. June 2022.

[5]: Qiancheng Fu et al. Geo-Neus: Geometry-Consistent Neural Implicit Surfaces Learning for Multi-view Reconstruction. In: NeurIPS. 2022.

[6]: Zehao Yu et al. MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction. In: NeurIPS. 2022.

[7]: Lior Yariv et al. Volume rendering of neural implicit surfaces. In: NeurIPS. 2021.

[8]: Michael Oechsle et al. UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction. In: ICCV. 2021.

[9]: Thomas Müller et al. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. In: ACM Trans Graph. July 2022.

[10]: Mathias Rothermel et al. SURE: PHOTOGRAMMETRIC SURFACE RECONSTRUCTION FROM IMAGERY. 2012.

[11]: Michael Kölle et al. The Hessigheim 3D (H3D) benchmark on semantic segmentation of high-resolution 3D point clouds and textured meshes from UAV LiDAR and Multi-View-Stereo. In: ISPRS 2021.

[12]: Thomas Schöps et al. A Multi-View Stereo Benchmark with High-Resolution Images and Multi-Camera Videos. In: CVPR. 2017

Depth-Supervised Neural Surface Reconstruction from Airborne Imagery

Vincent Hackstein