In recent years, Neural Radiance Fields (NeRF) have been used to reconstruct a 3D scene starting from RGB images. They represent a density function and a color function using neural networks which are trained by minimizing a reconstruction loss with respect to the input views. A limitation of NeRFs is that they require many input views to achieve a good reconstruction. To overcome this problem, many research works use generative models to optimize the NeRF parameters. These approaches often involve the process of Score Distillation Sampling, which leverages the prior of large diffusion models to achieve a believable 3D representation. Score Distillation Sampling can be used alongside the minimization of the reconstruction loss, obtaining a 3D reconstruction starting from a smaller number of input views. In this thesis, we study the recent work on the field and develop a new approach by building upon MVDream, an existing text-to-3D model. MVDream leverages multi-view diffusion models, which generate more than one view at the same time. At every iteration, the method renders four orthogonal views and uses them to compute the Score Distillation Sampling gradient. The denoising process is conditioned on a textual prompt describing the object. Our pipeline takes a set of input views with pose information and learns a NeRF representation by jointly minimizing a reconstruction loss and the Score Distillation Sampling loss, evaluated using the MVDream denoising network. We show that the multi-view diffusion model allows to effectively reconstruct areas of the object that were not seen in the input views.
Leveraging generative models for the optimization of 3D implicit representations
TOSO, SIMONE
2023/2024
Abstract
In recent years, Neural Radiance Fields (NeRF) have been used to reconstruct a 3D scene starting from RGB images. They represent a density function and a color function using neural networks which are trained by minimizing a reconstruction loss with respect to the input views. A limitation of NeRFs is that they require many input views to achieve a good reconstruction. To overcome this problem, many research works use generative models to optimize the NeRF parameters. These approaches often involve the process of Score Distillation Sampling, which leverages the prior of large diffusion models to achieve a believable 3D representation. Score Distillation Sampling can be used alongside the minimization of the reconstruction loss, obtaining a 3D reconstruction starting from a smaller number of input views. In this thesis, we study the recent work on the field and develop a new approach by building upon MVDream, an existing text-to-3D model. MVDream leverages multi-view diffusion models, which generate more than one view at the same time. At every iteration, the method renders four orthogonal views and uses them to compute the Score Distillation Sampling gradient. The denoising process is conditioned on a textual prompt describing the object. Our pipeline takes a set of input views with pose information and learns a NeRF representation by jointly minimizing a reconstruction loss and the Score Distillation Sampling loss, evaluated using the MVDream denoising network. We show that the multi-view diffusion model allows to effectively reconstruct areas of the object that were not seen in the input views.File | Dimensione | Formato | |
---|---|---|---|
Toso_Simone.pdf
accesso riservato
Dimensione
6.03 MB
Formato
Adobe PDF
|
6.03 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/74201