Understanding reasoning capabilities of small LLMs

Large Language Models (LLMs) are increasingly used as decision-making components in em- bodied AI systems, where agents must perceive, reason, and act within complex environments. Among these approaches, the Code as Policies (CaP) framework proposes to treat robot con- trol and spatial reasoning as a program-synthesis problem: given natural-language instructions, an LLM generates executable Python policies that interface with perception and control APIs. While the original results rely on large proprietary models with strong coding priors, it remains unclear to what extent similar benefits can be obtained with smaller, openly available LLMs. This thesis provides a systematic empirical study of CaP using modern open-source models under realistic deployment constraints including quantization and limited fine-tuning. Build- ing on a conceptual overview of embodied AI and LLM-based planners, we evaluate three axes that are central to CaP: (i) structured spatial reasoning with code vs. natural language, (ii) robotics-oriented code generation via the RoboCodeGen benchmark, and (iii) general functional code synthesis on HumanEval. We further investigate mitigation strategies such as hierarchical prompting, error-aware bug fixing, and LoRA fine-tuning. The results highlight both the promise and the limitations of small open-source models in programmatic embodied reasoning.

Understanding reasoning capabilities of small LLMs

GOLAN, RODRIGO

2024/2025

Abstract

Large Language Models (LLMs) are increasingly used as decision-making components in em- bodied AI systems, where agents must perceive, reason, and act within complex environments. Among these approaches, the Code as Policies (CaP) framework proposes to treat robot con- trol and spatial reasoning as a program-synthesis problem: given natural-language instructions, an LLM generates executable Python policies that interface with perception and control APIs. While the original results rely on large proprietary models with strong coding priors, it remains unclear to what extent similar benefits can be obtained with smaller, openly available LLMs. This thesis provides a systematic empirical study of CaP using modern open-source models under realistic deployment constraints including quantization and limited fine-tuning. Build- ing on a conceptual overview of embodied AI and LLM-based planners, we evaluate three axes that are central to CaP: (i) structured spatial reasoning with code vs. natural language, (ii) robotics-oriented code generation via the RoboCodeGen benchmark, and (iii) general functional code synthesis on HumanEval. We further investigate mitigation strategies such as hierarchical prompting, error-aware bug fixing, and LoRA fine-tuning. The results highlight both the promise and the limitations of small open-source models in programmatic embodied reasoning.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Fisica e Astronomia "Galileo Galilei" - DFA
			
	Corso di studio
	
				PHYSICS OF DATA Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Understanding reasoning capabilities of small LLMs
			
	Abstract in italiano
	
				Large Language Models (LLMs) are increasingly used as decision-making components in em-
bodied AI systems, where agents must perceive, reason, and act within complex environments.
Among these approaches, the Code as Policies (CaP) framework proposes to treat robot con-
trol and spatial reasoning as a program-synthesis problem: given natural-language instructions,
an LLM generates executable Python policies that interface with perception and control APIs.
While the original results rely on large proprietary models with strong coding priors, it remains
unclear to what extent similar benefits can be obtained with smaller, openly available LLMs.
This thesis provides a systematic empirical study of CaP using modern open-source models
under realistic deployment constraints including quantization and limited fine-tuning. Build-
ing on a conceptual overview of embodied AI and LLM-based planners, we evaluate three axes
that are central to CaP: (i) structured spatial reasoning with code vs. natural language, (ii)
robotics-oriented code generation via the RoboCodeGen benchmark, and (iii) general functional
code synthesis on HumanEval. We further investigate mitigation strategies such as hierarchical
prompting, error-aware bug fixing, and LoRA fine-tuning. The results highlight both the promise
and the limitations of small open-source models in programmatic embodied reasoning.
			
	Parola chiave
	
				Computer vison
Robotics
NLP
			
	Relatore
	
				BALLAN, LAMBERTO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Golan_Rodrigo.pdf accesso aperto Dimensione 5.57 MB Formato Adobe PDF Visualizza/Apri	5.57 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/101549