Large Language Models (LLMs) are increasingly used as decision-making components in em- bodied AI systems, where agents must perceive, reason, and act within complex environments. Among these approaches, the Code as Policies (CaP) framework proposes to treat robot con- trol and spatial reasoning as a program-synthesis problem: given natural-language instructions, an LLM generates executable Python policies that interface with perception and control APIs. While the original results rely on large proprietary models with strong coding priors, it remains unclear to what extent similar benefits can be obtained with smaller, openly available LLMs. This thesis provides a systematic empirical study of CaP using modern open-source models under realistic deployment constraints including quantization and limited fine-tuning. Build- ing on a conceptual overview of embodied AI and LLM-based planners, we evaluate three axes that are central to CaP: (i) structured spatial reasoning with code vs. natural language, (ii) robotics-oriented code generation via the RoboCodeGen benchmark, and (iii) general functional code synthesis on HumanEval. We further investigate mitigation strategies such as hierarchical prompting, error-aware bug fixing, and LoRA fine-tuning. The results highlight both the promise and the limitations of small open-source models in programmatic embodied reasoning.
Large Language Models (LLMs) are increasingly used as decision-making components in em- bodied AI systems, where agents must perceive, reason, and act within complex environments. Among these approaches, the Code as Policies (CaP) framework proposes to treat robot con- trol and spatial reasoning as a program-synthesis problem: given natural-language instructions, an LLM generates executable Python policies that interface with perception and control APIs. While the original results rely on large proprietary models with strong coding priors, it remains unclear to what extent similar benefits can be obtained with smaller, openly available LLMs. This thesis provides a systematic empirical study of CaP using modern open-source models under realistic deployment constraints including quantization and limited fine-tuning. Build- ing on a conceptual overview of embodied AI and LLM-based planners, we evaluate three axes that are central to CaP: (i) structured spatial reasoning with code vs. natural language, (ii) robotics-oriented code generation via the RoboCodeGen benchmark, and (iii) general functional code synthesis on HumanEval. We further investigate mitigation strategies such as hierarchical prompting, error-aware bug fixing, and LoRA fine-tuning. The results highlight both the promise and the limitations of small open-source models in programmatic embodied reasoning.
Understanding reasoning capabilities of small LLMs
GOLAN, RODRIGO
2024/2025
Abstract
Large Language Models (LLMs) are increasingly used as decision-making components in em- bodied AI systems, where agents must perceive, reason, and act within complex environments. Among these approaches, the Code as Policies (CaP) framework proposes to treat robot con- trol and spatial reasoning as a program-synthesis problem: given natural-language instructions, an LLM generates executable Python policies that interface with perception and control APIs. While the original results rely on large proprietary models with strong coding priors, it remains unclear to what extent similar benefits can be obtained with smaller, openly available LLMs. This thesis provides a systematic empirical study of CaP using modern open-source models under realistic deployment constraints including quantization and limited fine-tuning. Build- ing on a conceptual overview of embodied AI and LLM-based planners, we evaluate three axes that are central to CaP: (i) structured spatial reasoning with code vs. natural language, (ii) robotics-oriented code generation via the RoboCodeGen benchmark, and (iii) general functional code synthesis on HumanEval. We further investigate mitigation strategies such as hierarchical prompting, error-aware bug fixing, and LoRA fine-tuning. The results highlight both the promise and the limitations of small open-source models in programmatic embodied reasoning.| File | Dimensione | Formato | |
|---|---|---|---|
|
Golan_Rodrigo.pdf
accesso aperto
Dimensione
5.57 MB
Formato
Adobe PDF
|
5.57 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/101549