Improving LLaVA for Numerosity Perception: Analyzing Numerical Representations in a Multimodal LLM

Numbers are fundamental to human civilization. Manipulating numbers effectively is so im portant for almost all aspects of human life that counting is one of the first skills we are teach ing to children. Despite central importance of numerical manipulations to human civilization and daily life, the mechanisms that allow us to effectively manipulate numbers are not yet fully revealed. Counting is not merely the ability to perform arithmetic; it involves numerosity per ception, abstraction, and reasoning about quantities in context. In the age of AI, this raises an important and practical question: can machines count? Multimodal LLMs, the models at the cutting edge of AI systems that can reason integrating multiple modalities currently struggle to count. In this work I will improve the numerosity perception of LLaVA—a multimodal LLM that introduced visual instruction tuning for better performance in vision-language tasks. My main goal will be to identify the weaknesses and to provide a theoretically grounded approach to improve the numerosity perception ability. Beyond improvement in performance, I will focus on ensuring robustness to distribution shifts. To facilitate this I will make use of CLEVR— a synthetically generated dataset consisting of visual scenes with objects of diverse shapes and spatial arrangements.

Improving LLaVA for Numerosity Perception: Analyzing Numerical Representations in a Multimodal LLM

ONER, TIMUR

2024/2025

Abstract

Numbers are fundamental to human civilization. Manipulating numbers effectively is so im portant for almost all aspects of human life that counting is one of the first skills we are teach ing to children. Despite central importance of numerical manipulations to human civilization and daily life, the mechanisms that allow us to effectively manipulate numbers are not yet fully revealed. Counting is not merely the ability to perform arithmetic; it involves numerosity per ception, abstraction, and reasoning about quantities in context. In the age of AI, this raises an important and practical question: can machines count? Multimodal LLMs, the models at the cutting edge of AI systems that can reason integrating multiple modalities currently struggle to count. In this work I will improve the numerosity perception of LLaVA—a multimodal LLM that introduced visual instruction tuning for better performance in vision-language tasks. My main goal will be to identify the weaknesses and to provide a theoretically grounded approach to improve the numerosity perception ability. Beyond improvement in performance, I will focus on ensuring robustness to distribution shifts. To facilitate this I will make use of CLEVR— a synthetically generated dataset consisting of visual scenes with objects of diverse shapes and spatial arrangements.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE  Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Improving LLaVA for Numerosity Perception: Analyzing Numerical Representations in a Multimodal LLM
			
	Abstract in italiano
	
				Numbers are fundamental to human civilization. Manipulating numbers effectively is so im
portant for almost all aspects of human life that counting is one of the first skills we are teach
ing to children. Despite central importance of numerical manipulations to human civilization
 and daily life, the mechanisms that allow us to effectively manipulate numbers are not yet fully
 revealed. Counting is not merely the ability to perform arithmetic; it involves numerosity per
ception, abstraction, and reasoning about quantities in context. In the age of AI, this raises an
 important and practical question: can machines count? Multimodal LLMs, the models at the
 cutting edge of AI systems that can reason integrating multiple modalities currently struggle to
 count. In this work I will improve the numerosity perception of LLaVA—a multimodal LLM
 that introduced visual instruction tuning for better performance in vision-language tasks. My
 main goal will be to identify the weaknesses and to provide a theoretically grounded approach
 to improve the numerosity perception ability. Beyond improvement in performance, I will focus 
on ensuring robustness to distribution shifts. To facilitate this I will make use of CLEVR—
 a synthetically generated dataset consisting of visual scenes with objects of diverse shapes and
 spatial arrangements.
			
	Parola chiave
	
				LLaVA
Numerosity
Vision Transformer
Finetuning
			
	Relatore
	
				TESTOLIN, ALBERTO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Data_Science_Thesis_Timur_OnerFINAL3.pdf Accesso riservato Dimensione 8.72 MB Formato Adobe PDF	8.72 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/102128