Depth completion involves predicting a dense depth map from sparse depth measurements and a synchronized RGB image. Traditional deep learning-based methods solving this task often lack flexibility and struggle with generalization, especially in zero-shot settings under a different data distribution from training. Meanwhile, diffusion models, originally introduced for image generation, are trained on large-scale datasets and have shown remarkable generalization capabilities across different modalities. Recently, these models have been successfully repurposed for other downstream computer vision tasks, including depth estimation from a single RGB image. Marigold, an affine-invariant monocular depth estimator derived from Stable Diffusion, has achieved state-of-the-art performance and demonstrated unprecedented levels of detail and spatial understanding. Our work aims to bridge the gap between monocular depth estimation, a rapidly advancing field, and depth completion, which lags behind recent innovations. Building on the Marigold framework, we reformulate depth completion as a monocular depth estimation task, conditioned on extra information. Our novel plug-and-play approach using diffusion guidance integrates the available depth measurements into the diffusion process at inference time, without requiring retraining or architectural changes. As Marigold's depth predictions are affine-invariant, we iteratively learn a suitable scale and shift to produce results directly in metric space. Through extensive experiments, we demonstrate that this approach achieves state-of-the-art results in zero-shot depth completion across multiple diverse datasets. Our method outperforms specialized architectures, particularly in complex scenes with very sparse depth data, showcasing its robustness and superior generalization due to the world knowledge embedded in the base model weights. The significance of our findings extends beyond depth completion. By solving this task efficiently using a frozen, pretrained model, we open the door for broader applications of diffusion models in other computer vision tasks that require integrating additional constraints or data modalities, further demonstrating the versatility and potential of this class of generative models.
Sparse Depth Meets Monocular Depth Estimation: A Guided Diffusion Framework for Zero-Shot Depth Completion
VIOLA, MASSIMILIANO
2023/2024
Abstract
Depth completion involves predicting a dense depth map from sparse depth measurements and a synchronized RGB image. Traditional deep learning-based methods solving this task often lack flexibility and struggle with generalization, especially in zero-shot settings under a different data distribution from training. Meanwhile, diffusion models, originally introduced for image generation, are trained on large-scale datasets and have shown remarkable generalization capabilities across different modalities. Recently, these models have been successfully repurposed for other downstream computer vision tasks, including depth estimation from a single RGB image. Marigold, an affine-invariant monocular depth estimator derived from Stable Diffusion, has achieved state-of-the-art performance and demonstrated unprecedented levels of detail and spatial understanding. Our work aims to bridge the gap between monocular depth estimation, a rapidly advancing field, and depth completion, which lags behind recent innovations. Building on the Marigold framework, we reformulate depth completion as a monocular depth estimation task, conditioned on extra information. Our novel plug-and-play approach using diffusion guidance integrates the available depth measurements into the diffusion process at inference time, without requiring retraining or architectural changes. As Marigold's depth predictions are affine-invariant, we iteratively learn a suitable scale and shift to produce results directly in metric space. Through extensive experiments, we demonstrate that this approach achieves state-of-the-art results in zero-shot depth completion across multiple diverse datasets. Our method outperforms specialized architectures, particularly in complex scenes with very sparse depth data, showcasing its robustness and superior generalization due to the world knowledge embedded in the base model weights. The significance of our findings extends beyond depth completion. By solving this task efficiently using a frozen, pretrained model, we open the door for broader applications of diffusion models in other computer vision tasks that require integrating additional constraints or data modalities, further demonstrating the versatility and potential of this class of generative models.File | Dimensione | Formato | |
---|---|---|---|
Viola_Massimiliano.pdf
embargo fino al 21/10/2025
Dimensione
14.61 MB
Formato
Adobe PDF
|
14.61 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/74894