Image semantics understanding has evolved from early feature-based categorization toward unified multimodal reasoning systems that integrate perception and language. Classical approaches framed semantics as a mapping from handcrafted visual features to predefined object labels, establishing foundational principles of representation and supervised prediction but remaining limited in relational and compositional expressiveness. These limitations led to the introduction of attributes and scene graphs to capture interactions and spatial relationships, enabling richer semantic reasoning at the cost of annotation complexity and limited scalability. The emergence of vision-language learning marked a paradigm shift by treating natural language as an open semantic space for visual representation. Contemporary models differ primarily in how visual information is interfaced with language processing components. Embedding-level alignment methods provide scalable global representations but impose a semantic bottleneck. Bridged feature interaction models expose curated visual tokens to language models, improving compositional grounding while preserving modularity. Large multimodal models integrate visual and textual tokens within unified architectures, allowing semantics to emerge dynamically through context-dependent reasoning. This thesis analyzes these paradigms through the lens of semantic interface design, highlighting the trade-offs between scalability, grounding fidelity, and reasoning depth. We argue that modern image understanding is best characterized not as static representation extraction but as a dynamic multimodal construction process shaped by architectural interface choices.

Image semantics understanding has evolved from early feature-based categorization toward unified multimodal reasoning systems that integrate perception and language. Classical approaches framed semantics as a mapping from handcrafted visual features to predefined object labels, establishing foundational principles of representation and supervised prediction but remaining limited in relational and compositional expressiveness. These limitations led to the introduction of attributes and scene graphs to capture interactions and spatial relationships, enabling richer semantic reasoning at the cost of annotation complexity and limited scalability. The emergence of vision-language learning marked a paradigm shift by treating natural language as an open semantic space for visual representation. Contemporary models differ primarily in how visual information is interfaced with language processing components. Embedding-level alignment methods provide scalable global representations but impose a semantic bottleneck. Bridged feature interaction models expose curated visual tokens to language models, improving compositional grounding while preserving modularity. Large multimodal models integrate visual and textual tokens within unified architectures, allowing semantics to emerge dynamically through context-dependent reasoning. This thesis analyzes these paradigms through the lens of semantic interface design, highlighting the trade-offs between scalability, grounding fidelity, and reasoning depth. We argue that modern image understanding is best characterized not as static representation extraction but as a dynamic multimodal construction process shaped by architectural interface choices.

Image semantics understanding

RODER, FRANCESCO
2025/2026

Abstract

Image semantics understanding has evolved from early feature-based categorization toward unified multimodal reasoning systems that integrate perception and language. Classical approaches framed semantics as a mapping from handcrafted visual features to predefined object labels, establishing foundational principles of representation and supervised prediction but remaining limited in relational and compositional expressiveness. These limitations led to the introduction of attributes and scene graphs to capture interactions and spatial relationships, enabling richer semantic reasoning at the cost of annotation complexity and limited scalability. The emergence of vision-language learning marked a paradigm shift by treating natural language as an open semantic space for visual representation. Contemporary models differ primarily in how visual information is interfaced with language processing components. Embedding-level alignment methods provide scalable global representations but impose a semantic bottleneck. Bridged feature interaction models expose curated visual tokens to language models, improving compositional grounding while preserving modularity. Large multimodal models integrate visual and textual tokens within unified architectures, allowing semantics to emerge dynamically through context-dependent reasoning. This thesis analyzes these paradigms through the lens of semantic interface design, highlighting the trade-offs between scalability, grounding fidelity, and reasoning depth. We argue that modern image understanding is best characterized not as static representation extraction but as a dynamic multimodal construction process shaped by architectural interface choices.
2025
Image semantics understanding
Image semantics understanding has evolved from early feature-based categorization toward unified multimodal reasoning systems that integrate perception and language. Classical approaches framed semantics as a mapping from handcrafted visual features to predefined object labels, establishing foundational principles of representation and supervised prediction but remaining limited in relational and compositional expressiveness. These limitations led to the introduction of attributes and scene graphs to capture interactions and spatial relationships, enabling richer semantic reasoning at the cost of annotation complexity and limited scalability. The emergence of vision-language learning marked a paradigm shift by treating natural language as an open semantic space for visual representation. Contemporary models differ primarily in how visual information is interfaced with language processing components. Embedding-level alignment methods provide scalable global representations but impose a semantic bottleneck. Bridged feature interaction models expose curated visual tokens to language models, improving compositional grounding while preserving modularity. Large multimodal models integrate visual and textual tokens within unified architectures, allowing semantics to emerge dynamically through context-dependent reasoning. This thesis analyzes these paradigms through the lens of semantic interface design, highlighting the trade-offs between scalability, grounding fidelity, and reasoning depth. We argue that modern image understanding is best characterized not as static representation extraction but as a dynamic multimodal construction process shaped by architectural interface choices.
Image semantics
Image quality
Image understanding
File in questo prodotto:
File Dimensione Formato  
Roder_Francesco.pdf

accesso aperto

Dimensione 2.3 MB
Formato Adobe PDF
2.3 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/104343