Large Language Models (LLMs) are increasingly used as decision-making components in em- bodied AI systems, where agents must perceive, reason, and act within complex environments. Among these approaches, the Code as Policies (CaP) framework proposes to treat robot con- trol and spatial reasoning as a program-synthesis problem: given natural-language instructions, an LLM generates executable Python policies that interface with perception and control APIs. While the original results rely on large proprietary models with strong coding priors, it remains unclear to what extent similar benefits can be obtained with smaller, openly available LLMs. This thesis provides a systematic empirical study of CaP using modern open-source models under realistic deployment constraints including quantization and limited fine-tuning. Build- ing on a conceptual overview of embodied AI and LLM-based planners, we evaluate three axes that are central to CaP: (i) structured spatial reasoning with code vs. natural language, (ii) robotics-oriented code generation via the RoboCodeGen benchmark, and (iii) general functional code synthesis on HumanEval. We further investigate mitigation strategies such as hierarchical prompting, error-aware bug fixing, and LoRA fine-tuning. The results highlight both the promise and the limitations of small open-source models in programmatic embodied reasoning.

Large Language Models (LLMs) are increasingly used as decision-making components in em- bodied AI systems, where agents must perceive, reason, and act within complex environments. Among these approaches, the Code as Policies (CaP) framework proposes to treat robot con- trol and spatial reasoning as a program-synthesis problem: given natural-language instructions, an LLM generates executable Python policies that interface with perception and control APIs. While the original results rely on large proprietary models with strong coding priors, it remains unclear to what extent similar benefits can be obtained with smaller, openly available LLMs. This thesis provides a systematic empirical study of CaP using modern open-source models under realistic deployment constraints including quantization and limited fine-tuning. Build- ing on a conceptual overview of embodied AI and LLM-based planners, we evaluate three axes that are central to CaP: (i) structured spatial reasoning with code vs. natural language, (ii) robotics-oriented code generation via the RoboCodeGen benchmark, and (iii) general functional code synthesis on HumanEval. We further investigate mitigation strategies such as hierarchical prompting, error-aware bug fixing, and LoRA fine-tuning. The results highlight both the promise and the limitations of small open-source models in programmatic embodied reasoning.

Understanding reasoning capabilities of small LLMs

GOLAN, RODRIGO
2024/2025

Abstract

Large Language Models (LLMs) are increasingly used as decision-making components in em- bodied AI systems, where agents must perceive, reason, and act within complex environments. Among these approaches, the Code as Policies (CaP) framework proposes to treat robot con- trol and spatial reasoning as a program-synthesis problem: given natural-language instructions, an LLM generates executable Python policies that interface with perception and control APIs. While the original results rely on large proprietary models with strong coding priors, it remains unclear to what extent similar benefits can be obtained with smaller, openly available LLMs. This thesis provides a systematic empirical study of CaP using modern open-source models under realistic deployment constraints including quantization and limited fine-tuning. Build- ing on a conceptual overview of embodied AI and LLM-based planners, we evaluate three axes that are central to CaP: (i) structured spatial reasoning with code vs. natural language, (ii) robotics-oriented code generation via the RoboCodeGen benchmark, and (iii) general functional code synthesis on HumanEval. We further investigate mitigation strategies such as hierarchical prompting, error-aware bug fixing, and LoRA fine-tuning. The results highlight both the promise and the limitations of small open-source models in programmatic embodied reasoning.
2024
Understanding reasoning capabilities of small LLMs
Large Language Models (LLMs) are increasingly used as decision-making components in em- bodied AI systems, where agents must perceive, reason, and act within complex environments. Among these approaches, the Code as Policies (CaP) framework proposes to treat robot con- trol and spatial reasoning as a program-synthesis problem: given natural-language instructions, an LLM generates executable Python policies that interface with perception and control APIs. While the original results rely on large proprietary models with strong coding priors, it remains unclear to what extent similar benefits can be obtained with smaller, openly available LLMs. This thesis provides a systematic empirical study of CaP using modern open-source models under realistic deployment constraints including quantization and limited fine-tuning. Build- ing on a conceptual overview of embodied AI and LLM-based planners, we evaluate three axes that are central to CaP: (i) structured spatial reasoning with code vs. natural language, (ii) robotics-oriented code generation via the RoboCodeGen benchmark, and (iii) general functional code synthesis on HumanEval. We further investigate mitigation strategies such as hierarchical prompting, error-aware bug fixing, and LoRA fine-tuning. The results highlight both the promise and the limitations of small open-source models in programmatic embodied reasoning.
Computer vison
Robotics
NLP
File in questo prodotto:
File Dimensione Formato  
Golan_Rodrigo.pdf

accesso aperto

Dimensione 5.57 MB
Formato Adobe PDF
5.57 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/101549