Numerosity estimation, which we share with many animal species, is a cornerstone in human cognitive development leading to higher mathematical competencies. However, the ability of large-language models (LLMs) to comprehend this abstract numerical knowledge is still a matter of debate, in spite of the remarkable capabilities they show in statistical learning. To fill this gap, this study systematically evaluates both open-source (LLaMA-3 family, Mistral-7B) and proprietary models (Gemini 1.5 Pro, Claude 3.5 Sonnet, GPT-4o) by probing their numerical reasoning abilities across a range of numerosity-related tasks. These tasks, which are focused on numerosity generation and numerosity naming, analyze the capacity of LLMs for numerical encoding, sequence generation, and estimation, possibly revealing fundamental differences in how these models internally process numerical information. Our results indicate that, while proprietary systems exhibit greater numerical consistency, none of the tested models yet possess systematic enumeration skills, as indicated by the pattern of errors. In order to find out whether these struggles reflect systematic biases in numerical tokenization and processing, we further analyze the embeddings of numbers and textual prompts of open-source models. The findings underscore the need for architectural refinements, enhanced training methodologies, and multimodal learning approaches to bridge the gap between human and artificial numerical cognition. Mitigating these challenges could enable more numerically grounded AI systems, capable of reliable quantitative reasoning and interaction.
Numerosity estimation, which we share with many animal species, is a cornerstone in human cognitive development leading to higher mathematical competencies. However, the ability of large-language models (LLMs) to comprehend this abstract numerical knowledge is still a matter of debate, in spite of the remarkable capabilities they show in statistical learning. To fill this gap, this study systematically evaluates both open-source (LLaMA-3 family, Mistral-7B) and proprietary models (Gemini 1.5 Pro, Claude 3.5 Sonnet, GPT-4o) by probing their numerical reasoning abilities across a range of numerosity-related tasks. These tasks, which are focused on numerosity generation and numerosity naming, analyze the capacity of LLMs for numerical encoding, sequence generation, and estimation, possibly revealing fundamental differences in how these models internally process numerical information. Our results indicate that, while proprietary systems exhibit greater numerical consistency, none of the tested models yet possess systematic enumeration skills, as indicated by the pattern of errors. In order to find out whether these struggles reflect systematic biases in numerical tokenization and processing, we further analyze the embeddings of numbers and textual prompts of open-source models. The findings underscore the need for architectural refinements, enhanced training methodologies, and multimodal learning approaches to bridge the gap between human and artificial numerical cognition. Mitigating these challenges could enable more numerically grounded AI systems, capable of reliable quantitative reasoning and interaction.
Probing the Sequential Enumeration Skills of Large Language Models
DAIBASOGLU, KAAN
2024/2025
Abstract
Numerosity estimation, which we share with many animal species, is a cornerstone in human cognitive development leading to higher mathematical competencies. However, the ability of large-language models (LLMs) to comprehend this abstract numerical knowledge is still a matter of debate, in spite of the remarkable capabilities they show in statistical learning. To fill this gap, this study systematically evaluates both open-source (LLaMA-3 family, Mistral-7B) and proprietary models (Gemini 1.5 Pro, Claude 3.5 Sonnet, GPT-4o) by probing their numerical reasoning abilities across a range of numerosity-related tasks. These tasks, which are focused on numerosity generation and numerosity naming, analyze the capacity of LLMs for numerical encoding, sequence generation, and estimation, possibly revealing fundamental differences in how these models internally process numerical information. Our results indicate that, while proprietary systems exhibit greater numerical consistency, none of the tested models yet possess systematic enumeration skills, as indicated by the pattern of errors. In order to find out whether these struggles reflect systematic biases in numerical tokenization and processing, we further analyze the embeddings of numbers and textual prompts of open-source models. The findings underscore the need for architectural refinements, enhanced training methodologies, and multimodal learning approaches to bridge the gap between human and artificial numerical cognition. Mitigating these challenges could enable more numerically grounded AI systems, capable of reliable quantitative reasoning and interaction.File | Dimensione | Formato | |
---|---|---|---|
Kaan_Daibasoglu_Thesis.pdf
accesso aperto
Dimensione
6.78 MB
Formato
Adobe PDF
|
6.78 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/81801