Monocular depth estimation represents a critical capability in computer vision systems, with applications ranging from autonomous driving to augmented reality. Despite significant advances in this field, current solutions often require a trade-off between model performance and computational efficiency, limiting their practical deployment. This thesis introduces an innovative architectural integration approach that combines the complementary strengths of two leading models: Xi-Net and Lite-Mono, to address this fundamental challenge. The methodology involves a systematic evaluation of architectural fusion strategies, carefully analyzing how components from both source models can be effectively combined while preserving their respective strengths. We conduct extensive experiments using the KITTI dataset, a comprehensive benchmark suite for autonomous driving applications, to validate our approach across diverse real-world scenarios including urban, residential, and highway environments. The model variants we develop offer different trade-offs between performance and resource utilization, making them suitable for a range of deployment scenarios from mobile devices to more powerful computing platforms. This work contributes to the field of computer vision by establishing a new paradigm for model integration that could serve as a blueprint for future efforts in creating efficient, high-performance vision systems. Furthermore, our findings advance the understanding of architectural design principles that enable effective model scaling across different computational constraints, potentially enabling broader adoption of advanced computer vision capabilities in resource-constrained environments.

Monocular depth estimation represents a critical capability in computer vision systems, with applications ranging from autonomous driving to augmented reality. Despite significant advances in this field, current solutions often require a trade-off between model performance and computational efficiency, limiting their practical deployment. This thesis introduces an innovative architectural integration approach that combines the complementary strengths of two leading models: Xi-Net and Lite-Mono, to address this fundamental challenge. The methodology involves a systematic evaluation of architectural fusion strategies, carefully analyzing how components from both source models can be effectively combined while preserving their respective strengths. We conduct extensive experiments using the KITTI dataset, a comprehensive benchmark suite for autonomous driving applications, to validate our approach across diverse real-world scenarios including urban, residential, and highway environments. The model variants we develop offer different trade-offs between performance and resource utilization, making them suitable for a range of deployment scenarios from mobile devices to more powerful computing platforms. This work contributes to the field of computer vision by establishing a new paradigm for model integration that could serve as a blueprint for future efforts in creating efficient, high-performance vision systems. Furthermore, our findings advance the understanding of architectural design principles that enable effective model scaling across different computational constraints, potentially enabling broader adoption of advanced computer vision capabilities in resource-constrained environments.

Hybrid Encoder and Architectures for Advanced Monocular Depth Estimation: A Comparative Synthesis Approach

DI LABBIO, DANIELA
2024/2025

Abstract

Monocular depth estimation represents a critical capability in computer vision systems, with applications ranging from autonomous driving to augmented reality. Despite significant advances in this field, current solutions often require a trade-off between model performance and computational efficiency, limiting their practical deployment. This thesis introduces an innovative architectural integration approach that combines the complementary strengths of two leading models: Xi-Net and Lite-Mono, to address this fundamental challenge. The methodology involves a systematic evaluation of architectural fusion strategies, carefully analyzing how components from both source models can be effectively combined while preserving their respective strengths. We conduct extensive experiments using the KITTI dataset, a comprehensive benchmark suite for autonomous driving applications, to validate our approach across diverse real-world scenarios including urban, residential, and highway environments. The model variants we develop offer different trade-offs between performance and resource utilization, making them suitable for a range of deployment scenarios from mobile devices to more powerful computing platforms. This work contributes to the field of computer vision by establishing a new paradigm for model integration that could serve as a blueprint for future efforts in creating efficient, high-performance vision systems. Furthermore, our findings advance the understanding of architectural design principles that enable effective model scaling across different computational constraints, potentially enabling broader adoption of advanced computer vision capabilities in resource-constrained environments.
2024
Hybrid Encoder and Architectures for Advanced Monocular Depth Estimation: A Comparative Synthesis Approach
Monocular depth estimation represents a critical capability in computer vision systems, with applications ranging from autonomous driving to augmented reality. Despite significant advances in this field, current solutions often require a trade-off between model performance and computational efficiency, limiting their practical deployment. This thesis introduces an innovative architectural integration approach that combines the complementary strengths of two leading models: Xi-Net and Lite-Mono, to address this fundamental challenge. The methodology involves a systematic evaluation of architectural fusion strategies, carefully analyzing how components from both source models can be effectively combined while preserving their respective strengths. We conduct extensive experiments using the KITTI dataset, a comprehensive benchmark suite for autonomous driving applications, to validate our approach across diverse real-world scenarios including urban, residential, and highway environments. The model variants we develop offer different trade-offs between performance and resource utilization, making them suitable for a range of deployment scenarios from mobile devices to more powerful computing platforms. This work contributes to the field of computer vision by establishing a new paradigm for model integration that could serve as a blueprint for future efforts in creating efficient, high-performance vision systems. Furthermore, our findings advance the understanding of architectural design principles that enable effective model scaling across different computational constraints, potentially enabling broader adoption of advanced computer vision capabilities in resource-constrained environments.
Vision
Depth
Estimation
Optimization
File in questo prodotto:
File Dimensione Formato  
DiLabbio_Daniela.pdf

accesso aperto

Dimensione 19.39 MB
Formato Adobe PDF
19.39 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/81802