Depth completion involves predicting a dense depth map from sparse depth measurements and a synchronized RGB image. Traditional deep learning-based methods solving this task often lack flexibility and struggle with generalization, especially in zero-shot settings under a different data distribution from training. Meanwhile, diffusion models, originally introduced for image generation, are trained on large-scale datasets and have shown remarkable generalization capabilities across different modalities. Recently, these models have been successfully repurposed for other downstream computer vision tasks, including depth estimation from a single RGB image. Marigold, an affine-invariant monocular depth estimator derived from Stable Diffusion, has achieved state-of-the-art performance and demonstrated unprecedented levels of detail and spatial understanding. Our work aims to bridge the gap between monocular depth estimation, a rapidly advancing field, and depth completion, which lags behind recent innovations. Building on the Marigold framework, we reformulate depth completion as a monocular depth estimation task, conditioned on extra information. Our novel plug-and-play approach using diffusion guidance integrates the available depth measurements into the diffusion process at inference time, without requiring retraining or architectural changes. As Marigold's depth predictions are affine-invariant, we iteratively learn a suitable scale and shift to produce results directly in metric space. Through extensive experiments, we demonstrate that this approach achieves state-of-the-art results in zero-shot depth completion across multiple diverse datasets. Our method outperforms specialized architectures, particularly in complex scenes with very sparse depth data, showcasing its robustness and superior generalization due to the world knowledge embedded in the base model weights. The significance of our findings extends beyond depth completion. By solving this task efficiently using a frozen, pretrained model, we open the door for broader applications of diffusion models in other computer vision tasks that require integrating additional constraints or data modalities, further demonstrating the versatility and potential of this class of generative models.

Sparse Depth Meets Monocular Depth Estimation: A Guided Diffusion Framework for Zero-Shot Depth Completion

VIOLA, MASSIMILIANO
2023/2024

Abstract

Depth completion involves predicting a dense depth map from sparse depth measurements and a synchronized RGB image. Traditional deep learning-based methods solving this task often lack flexibility and struggle with generalization, especially in zero-shot settings under a different data distribution from training. Meanwhile, diffusion models, originally introduced for image generation, are trained on large-scale datasets and have shown remarkable generalization capabilities across different modalities. Recently, these models have been successfully repurposed for other downstream computer vision tasks, including depth estimation from a single RGB image. Marigold, an affine-invariant monocular depth estimator derived from Stable Diffusion, has achieved state-of-the-art performance and demonstrated unprecedented levels of detail and spatial understanding. Our work aims to bridge the gap between monocular depth estimation, a rapidly advancing field, and depth completion, which lags behind recent innovations. Building on the Marigold framework, we reformulate depth completion as a monocular depth estimation task, conditioned on extra information. Our novel plug-and-play approach using diffusion guidance integrates the available depth measurements into the diffusion process at inference time, without requiring retraining or architectural changes. As Marigold's depth predictions are affine-invariant, we iteratively learn a suitable scale and shift to produce results directly in metric space. Through extensive experiments, we demonstrate that this approach achieves state-of-the-art results in zero-shot depth completion across multiple diverse datasets. Our method outperforms specialized architectures, particularly in complex scenes with very sparse depth data, showcasing its robustness and superior generalization due to the world knowledge embedded in the base model weights. The significance of our findings extends beyond depth completion. By solving this task efficiently using a frozen, pretrained model, we open the door for broader applications of diffusion models in other computer vision tasks that require integrating additional constraints or data modalities, further demonstrating the versatility and potential of this class of generative models.
2023
Sparse Depth Meets Monocular Depth Estimation: A Guided Diffusion Framework for Zero-Shot Depth Completion
Depth Completion
RGB-D Fusion
Diffusion Models
Depth Estimation
Computer Vision
File in questo prodotto:
File Dimensione Formato  
Viola_Massimiliano.pdf

embargo fino al 21/10/2025

Dimensione 14.61 MB
Formato Adobe PDF
14.61 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/74894