Image semantics understanding has evolved from early feature-based categorization toward unified multimodal reasoning systems that integrate perception and language. Classical approaches framed semantics as a mapping from handcrafted visual features to predefined object labels, establishing foundational principles of representation and supervised prediction but remaining limited in relational and compositional expressiveness. These limitations led to the introduction of attributes and scene graphs to capture interactions and spatial relationships, enabling richer semantic reasoning at the cost of annotation complexity and limited scalability. The emergence of vision-language learning marked a paradigm shift by treating natural language as an open semantic space for visual representation. Contemporary models differ primarily in how visual information is interfaced with language processing components. Embedding-level alignment methods provide scalable global representations but impose a semantic bottleneck. Bridged feature interaction models expose curated visual tokens to language models, improving compositional grounding while preserving modularity. Large multimodal models integrate visual and textual tokens within unified architectures, allowing semantics to emerge dynamically through context-dependent reasoning. This thesis analyzes these paradigms through the lens of semantic interface design, highlighting the trade-offs between scalability, grounding fidelity, and reasoning depth. We argue that modern image understanding is best characterized not as static representation extraction but as a dynamic multimodal construction process shaped by architectural interface choices.
Image semantics understanding has evolved from early feature-based categorization toward unified multimodal reasoning systems that integrate perception and language. Classical approaches framed semantics as a mapping from handcrafted visual features to predefined object labels, establishing foundational principles of representation and supervised prediction but remaining limited in relational and compositional expressiveness. These limitations led to the introduction of attributes and scene graphs to capture interactions and spatial relationships, enabling richer semantic reasoning at the cost of annotation complexity and limited scalability. The emergence of vision-language learning marked a paradigm shift by treating natural language as an open semantic space for visual representation. Contemporary models differ primarily in how visual information is interfaced with language processing components. Embedding-level alignment methods provide scalable global representations but impose a semantic bottleneck. Bridged feature interaction models expose curated visual tokens to language models, improving compositional grounding while preserving modularity. Large multimodal models integrate visual and textual tokens within unified architectures, allowing semantics to emerge dynamically through context-dependent reasoning. This thesis analyzes these paradigms through the lens of semantic interface design, highlighting the trade-offs between scalability, grounding fidelity, and reasoning depth. We argue that modern image understanding is best characterized not as static representation extraction but as a dynamic multimodal construction process shaped by architectural interface choices.
Image semantics understanding
RODER, FRANCESCO
2025/2026
Abstract
Image semantics understanding has evolved from early feature-based categorization toward unified multimodal reasoning systems that integrate perception and language. Classical approaches framed semantics as a mapping from handcrafted visual features to predefined object labels, establishing foundational principles of representation and supervised prediction but remaining limited in relational and compositional expressiveness. These limitations led to the introduction of attributes and scene graphs to capture interactions and spatial relationships, enabling richer semantic reasoning at the cost of annotation complexity and limited scalability. The emergence of vision-language learning marked a paradigm shift by treating natural language as an open semantic space for visual representation. Contemporary models differ primarily in how visual information is interfaced with language processing components. Embedding-level alignment methods provide scalable global representations but impose a semantic bottleneck. Bridged feature interaction models expose curated visual tokens to language models, improving compositional grounding while preserving modularity. Large multimodal models integrate visual and textual tokens within unified architectures, allowing semantics to emerge dynamically through context-dependent reasoning. This thesis analyzes these paradigms through the lens of semantic interface design, highlighting the trade-offs between scalability, grounding fidelity, and reasoning depth. We argue that modern image understanding is best characterized not as static representation extraction but as a dynamic multimodal construction process shaped by architectural interface choices.| File | Dimensione | Formato | |
|---|---|---|---|
|
Roder_Francesco.pdf
accesso aperto
Dimensione
2.3 MB
Formato
Adobe PDF
|
2.3 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/104343