Image semantics understanding

Image semantics understanding has evolved from early feature-based categorization toward unified multimodal reasoning systems that integrate perception and language. Classical approaches framed semantics as a mapping from handcrafted visual features to predefined object labels, establishing foundational principles of representation and supervised prediction but remaining limited in relational and compositional expressiveness. These limitations led to the introduction of attributes and scene graphs to capture interactions and spatial relationships, enabling richer semantic reasoning at the cost of annotation complexity and limited scalability. The emergence of vision-language learning marked a paradigm shift by treating natural language as an open semantic space for visual representation. Contemporary models differ primarily in how visual information is interfaced with language processing components. Embedding-level alignment methods provide scalable global representations but impose a semantic bottleneck. Bridged feature interaction models expose curated visual tokens to language models, improving compositional grounding while preserving modularity. Large multimodal models integrate visual and textual tokens within unified architectures, allowing semantics to emerge dynamically through context-dependent reasoning. This thesis analyzes these paradigms through the lens of semantic interface design, highlighting the trade-offs between scalability, grounding fidelity, and reasoning depth. We argue that modern image understanding is best characterized not as static representation extraction but as a dynamic multimodal construction process shaped by architectural interface choices.

Image semantics understanding

RODER, FRANCESCO

2025/2026

Abstract

Image semantics understanding has evolved from early feature-based categorization toward unified multimodal reasoning systems that integrate perception and language. Classical approaches framed semantics as a mapping from handcrafted visual features to predefined object labels, establishing foundational principles of representation and supervised prediction but remaining limited in relational and compositional expressiveness. These limitations led to the introduction of attributes and scene graphs to capture interactions and spatial relationships, enabling richer semantic reasoning at the cost of annotation complexity and limited scalability. The emergence of vision-language learning marked a paradigm shift by treating natural language as an open semantic space for visual representation. Contemporary models differ primarily in how visual information is interfaced with language processing components. Embedding-level alignment methods provide scalable global representations but impose a semantic bottleneck. Bridged feature interaction models expose curated visual tokens to language models, improving compositional grounding while preserving modularity. Large multimodal models integrate visual and textual tokens within unified architectures, allowing semantics to emerge dynamically through context-dependent reasoning. This thesis analyzes these paradigms through the lens of semantic interface design, highlighting the trade-offs between scalability, grounding fidelity, and reasoning depth. We argue that modern image understanding is best characterized not as static representation extraction but as a dynamic multimodal construction process shaped by architectural interface choices.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				INGEGNERIA DELL'INFORMAZIONE Laurea di Primo Livello (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Image semantics understanding
			
	Abstract in italiano
	
				Image semantics understanding has evolved from early feature-based categorization toward unified multimodal reasoning systems that integrate perception and language. Classical approaches framed semantics as a mapping from handcrafted visual features to predefined object labels, establishing foundational principles of representation and supervised prediction but remaining limited in relational and compositional expressiveness. These limitations led to the introduction of attributes and scene graphs to capture interactions and spatial relationships, enabling richer semantic reasoning at the cost of annotation complexity and limited scalability. The emergence of vision-language learning marked a paradigm shift by treating natural language as an open semantic space for visual representation. Contemporary models differ primarily in how visual information is interfaced with language processing components. Embedding-level alignment methods provide scalable global representations but impose a semantic bottleneck. Bridged feature interaction models expose curated visual tokens to language models, improving compositional grounding while preserving modularity. Large multimodal models integrate visual and textual tokens within unified architectures, allowing semantics to emerge dynamically through context-dependent reasoning. This thesis analyzes these paradigms through the lens of semantic interface design, highlighting the trade-offs between scalability, grounding fidelity, and reasoning depth. We argue that modern image understanding is best characterized not as static representation extraction but as a dynamic multimodal construction process shaped by architectural interface choices.
			
	Parola chiave
	
				Image semantics
Image quality
Image understanding
			
	Relatore
	
				BATTISTI, FEDERICA
			
	Appare nelle tipologie:
	
				Lauree triennali

File in questo prodotto:

File	Dimensione	Formato
Roder_Francesco.pdf accesso aperto Dimensione 2.3 MB Formato Adobe PDF Visualizza/Apri	2.3 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/104343