Numbers are fundamental to human civilization. Manipulating numbers effectively is so im portant for almost all aspects of human life that counting is one of the first skills we are teach ing to children. Despite central importance of numerical manipulations to human civilization and daily life, the mechanisms that allow us to effectively manipulate numbers are not yet fully revealed. Counting is not merely the ability to perform arithmetic; it involves numerosity per ception, abstraction, and reasoning about quantities in context. In the age of AI, this raises an important and practical question: can machines count? Multimodal LLMs, the models at the cutting edge of AI systems that can reason integrating multiple modalities currently struggle to count. In this work I will improve the numerosity perception of LLaVA—a multimodal LLM that introduced visual instruction tuning for better performance in vision-language tasks. My main goal will be to identify the weaknesses and to provide a theoretically grounded approach to improve the numerosity perception ability. Beyond improvement in performance, I will focus on ensuring robustness to distribution shifts. To facilitate this I will make use of CLEVR— a synthetically generated dataset consisting of visual scenes with objects of diverse shapes and spatial arrangements.

Numbers are fundamental to human civilization. Manipulating numbers effectively is so im portant for almost all aspects of human life that counting is one of the first skills we are teach ing to children. Despite central importance of numerical manipulations to human civilization and daily life, the mechanisms that allow us to effectively manipulate numbers are not yet fully revealed. Counting is not merely the ability to perform arithmetic; it involves numerosity per ception, abstraction, and reasoning about quantities in context. In the age of AI, this raises an important and practical question: can machines count? Multimodal LLMs, the models at the cutting edge of AI systems that can reason integrating multiple modalities currently struggle to count. In this work I will improve the numerosity perception of LLaVA—a multimodal LLM that introduced visual instruction tuning for better performance in vision-language tasks. My main goal will be to identify the weaknesses and to provide a theoretically grounded approach to improve the numerosity perception ability. Beyond improvement in performance, I will focus on ensuring robustness to distribution shifts. To facilitate this I will make use of CLEVR— a synthetically generated dataset consisting of visual scenes with objects of diverse shapes and spatial arrangements.

Improving LLaVA for Numerosity Perception: Analyzing Numerical Representations in a Multimodal LLM

ONER, TIMUR
2024/2025

Abstract

Numbers are fundamental to human civilization. Manipulating numbers effectively is so im portant for almost all aspects of human life that counting is one of the first skills we are teach ing to children. Despite central importance of numerical manipulations to human civilization and daily life, the mechanisms that allow us to effectively manipulate numbers are not yet fully revealed. Counting is not merely the ability to perform arithmetic; it involves numerosity per ception, abstraction, and reasoning about quantities in context. In the age of AI, this raises an important and practical question: can machines count? Multimodal LLMs, the models at the cutting edge of AI systems that can reason integrating multiple modalities currently struggle to count. In this work I will improve the numerosity perception of LLaVA—a multimodal LLM that introduced visual instruction tuning for better performance in vision-language tasks. My main goal will be to identify the weaknesses and to provide a theoretically grounded approach to improve the numerosity perception ability. Beyond improvement in performance, I will focus on ensuring robustness to distribution shifts. To facilitate this I will make use of CLEVR— a synthetically generated dataset consisting of visual scenes with objects of diverse shapes and spatial arrangements.
2024
Improving LLaVA for Numerosity Perception: Analyzing Numerical Representations in a Multimodal LLM
Numbers are fundamental to human civilization. Manipulating numbers effectively is so im portant for almost all aspects of human life that counting is one of the first skills we are teach ing to children. Despite central importance of numerical manipulations to human civilization and daily life, the mechanisms that allow us to effectively manipulate numbers are not yet fully revealed. Counting is not merely the ability to perform arithmetic; it involves numerosity per ception, abstraction, and reasoning about quantities in context. In the age of AI, this raises an important and practical question: can machines count? Multimodal LLMs, the models at the cutting edge of AI systems that can reason integrating multiple modalities currently struggle to count. In this work I will improve the numerosity perception of LLaVA—a multimodal LLM that introduced visual instruction tuning for better performance in vision-language tasks. My main goal will be to identify the weaknesses and to provide a theoretically grounded approach to improve the numerosity perception ability. Beyond improvement in performance, I will focus on ensuring robustness to distribution shifts. To facilitate this I will make use of CLEVR— a synthetically generated dataset consisting of visual scenes with objects of diverse shapes and spatial arrangements.
LLaVA
Numerosity
Vision Transformer
Finetuning
File in questo prodotto:
File Dimensione Formato  
Data_Science_Thesis_Timur_OnerFINAL3.pdf

Accesso riservato

Dimensione 8.72 MB
Formato Adobe PDF
8.72 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/102128