Sparse Depth Meets Monocular Depth Estimation: A Guided Diffusion Framework for Zero-Shot Depth Completion

Depth completion involves predicting a dense depth map from sparse depth measurements and a synchronized RGB image. Traditional deep learning-based methods solving this task often lack flexibility and struggle with generalization, especially in zero-shot settings under a different data distribution from training. Meanwhile, diffusion models, originally introduced for image generation, are trained on large-scale datasets and have shown remarkable generalization capabilities across different modalities. Recently, these models have been successfully repurposed for other downstream computer vision tasks, including depth estimation from a single RGB image. Marigold, an affine-invariant monocular depth estimator derived from Stable Diffusion, has achieved state-of-the-art performance and demonstrated unprecedented levels of detail and spatial understanding. Our work aims to bridge the gap between monocular depth estimation, a rapidly advancing field, and depth completion, which lags behind recent innovations. Building on the Marigold framework, we reformulate depth completion as a monocular depth estimation task, conditioned on extra information. Our novel plug-and-play approach using diffusion guidance integrates the available depth measurements into the diffusion process at inference time, without requiring retraining or architectural changes. As Marigold's depth predictions are affine-invariant, we iteratively learn a suitable scale and shift to produce results directly in metric space. Through extensive experiments, we demonstrate that this approach achieves state-of-the-art results in zero-shot depth completion across multiple diverse datasets. Our method outperforms specialized architectures, particularly in complex scenes with very sparse depth data, showcasing its robustness and superior generalization due to the world knowledge embedded in the base model weights. The significance of our findings extends beyond depth completion. By solving this task efficiently using a frozen, pretrained model, we open the door for broader applications of diffusion models in other computer vision tasks that require integrating additional constraints or data modalities, further demonstrating the versatility and potential of this class of generative models.

Sparse Depth Meets Monocular Depth Estimation: A Guided Diffusion Framework for Zero-Shot Depth Completion

VIOLA, MASSIMILIANO

2023/2024

Abstract

Depth completion involves predicting a dense depth map from sparse depth measurements and a synchronized RGB image. Traditional deep learning-based methods solving this task often lack flexibility and struggle with generalization, especially in zero-shot settings under a different data distribution from training. Meanwhile, diffusion models, originally introduced for image generation, are trained on large-scale datasets and have shown remarkable generalization capabilities across different modalities. Recently, these models have been successfully repurposed for other downstream computer vision tasks, including depth estimation from a single RGB image. Marigold, an affine-invariant monocular depth estimator derived from Stable Diffusion, has achieved state-of-the-art performance and demonstrated unprecedented levels of detail and spatial understanding. Our work aims to bridge the gap between monocular depth estimation, a rapidly advancing field, and depth completion, which lags behind recent innovations. Building on the Marigold framework, we reformulate depth completion as a monocular depth estimation task, conditioned on extra information. Our novel plug-and-play approach using diffusion guidance integrates the available depth measurements into the diffusion process at inference time, without requiring retraining or architectural changes. As Marigold's depth predictions are affine-invariant, we iteratively learn a suitable scale and shift to produce results directly in metric space. Through extensive experiments, we demonstrate that this approach achieves state-of-the-art results in zero-shot depth completion across multiple diverse datasets. Our method outperforms specialized architectures, particularly in complex scenes with very sparse depth data, showcasing its robustness and superior generalization due to the world knowledge embedded in the base model weights. The significance of our findings extends beyond depth completion. By solving this task efficiently using a frozen, pretrained model, we open the door for broader applications of diffusion models in other computer vision tasks that require integrating additional constraints or data modalities, further demonstrating the versatility and potential of this class of generative models.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				COMPUTER ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				Sparse Depth Meets Monocular Depth Estimation: A Guided Diffusion Framework for Zero-Shot Depth Completion
			
	Parola chiave
	
				Depth Completion
RGB-D Fusion
Diffusion Models
Depth Estimation
Computer Vision
			
	Relatore
	
				SUSTO, GIAN ANTONIO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Viola_Massimiliano.pdf embargo fino al 21/10/2025 Dimensione 14.61 MB Formato Adobe PDF	14.61 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/74894