This thesis explores the integration of geometric and semantic scene understanding in computer vision. The work builds upon recent advances in neural scene representation, particularly Neural Radiance Fields, which have revolutionized how we can reconstruct 3D scenes from images. These advances are combined with self-supervised learning techniques, specifically DINO descriptors, which allow for semantic understanding without requiring extensive labeled datasets. The proposed method extends the HashNeRF framework by incorporating semantic feature prediction capabilities. This approach maintains the efficient encoding and quick training characteristics of HashNeRF while adding the ability to learn and predict DINO features. The experimental results demonstrate several interesting findings about how such a unified system can be effectively trained. Contrary to initial expectations, attempting to introduce semantic learning gradually during training provided no significant benefits compared to learning both aspects simultaneously from the beginning. The research found that a careful balance of different loss components is crucial for optimal performance, with various loss functions showing different strengths in preserving either geometric or semantic fidelity. The results also showed that while increasing emphasis on geometric reconstruction can improve visual quality, this comes at the cost of semantic feature accuracy, highlighting an inherent trade-off in such unified representations. These findings contribute to our understanding of how neural scene representations can be enhanced with semantic information, potentially enabling more sophisticated scene understanding and manipulation capabilities in various computer vision applications.
Apprendimento di Dense Semantic Embeddings per Rappresentazioni Descrittive di Scene 3D
GOBBIN, ALBERTO
2023/2024
Abstract
This thesis explores the integration of geometric and semantic scene understanding in computer vision. The work builds upon recent advances in neural scene representation, particularly Neural Radiance Fields, which have revolutionized how we can reconstruct 3D scenes from images. These advances are combined with self-supervised learning techniques, specifically DINO descriptors, which allow for semantic understanding without requiring extensive labeled datasets. The proposed method extends the HashNeRF framework by incorporating semantic feature prediction capabilities. This approach maintains the efficient encoding and quick training characteristics of HashNeRF while adding the ability to learn and predict DINO features. The experimental results demonstrate several interesting findings about how such a unified system can be effectively trained. Contrary to initial expectations, attempting to introduce semantic learning gradually during training provided no significant benefits compared to learning both aspects simultaneously from the beginning. The research found that a careful balance of different loss components is crucial for optimal performance, with various loss functions showing different strengths in preserving either geometric or semantic fidelity. The results also showed that while increasing emphasis on geometric reconstruction can improve visual quality, this comes at the cost of semantic feature accuracy, highlighting an inherent trade-off in such unified representations. These findings contribute to our understanding of how neural scene representations can be enhanced with semantic information, potentially enabling more sophisticated scene understanding and manipulation capabilities in various computer vision applications.File | Dimensione | Formato | |
---|---|---|---|
Gobbin_Alberto.pdf
embargo fino al 05/12/2027
Dimensione
2.42 MB
Formato
Adobe PDF
|
2.42 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/78055