Vision Language Models (VLMs) have recently emerged as a powerful class of multimodal foundation models capable of jointly reasoning over visual inputs and natural language. Unlike traditional deep learning models in robotics, which are typically trained on task-specific small-scale datasets, foundation models are pre-trained on internet-scale data, granting them superior generalization capabilities and, in many cases, emergent zero-shot reasoning abilities across unseen tasks. Integrating the visual and linguistic capabilities of VLMs could significantly expand the applications of robotics, making robotic systems more versatile, intuitive, and capable of interacting with humans in a natural way. Such integration opens up possibilities across a wide range of domains, including natural human-robot interaction, enriched navigation, semantic mapping, descriptive navigation, multimodal perception, object detection, and scene understanding. This work explores the application of VLMs to mobile robotics, with a focus on high-level embodied tasks such as vision language navigation and embodied question answering. To connect VLMs with low-level robot functionality, we introduce a novel function calling interface that allows the model to invoke perception and control tools, such as acquiring images, querying positions, or issuing motion commands, autonomously in response to natural language instructions. We evaluate our approach in the AI2-THOR simulator, enabling testing in indoor environments. Our results demonstrate that pre-trained VLMs, when properly interfaced through tool-augmented architectures, are capable of grounding language in action and perception for closed-loop control. These findings point toward the promising role of foundation models in advancing general-purpose, language-driven robot autonomy.
I Vision Language Models (VLMs) sono recentemente emersi come una potente classe di modelli multimodali in grado di ragionare congiuntamente su input visivi e linguaggio naturale. A differenza dei modelli di deep learning tradizionali in robotica, solitamente addestrati su dataset di piccola scala e specifici per singolo compito, i VLMs vengono pre-addestrati su dati a scala Internet, acquisendo così superiori capacità di generalizzazione e, in molti casi, abilità emergenti di ragionamento su compiti mai incontrati prima. L’integrazione delle capacità visive e linguistiche dei VLMs può ampliare in modo significativo le applicazioni della robotica, rendendo i sistemi robotici più versatili, intuitivi e in grado di interagire con gli esseri umani in maniera naturale. Tale integrazione apre prospettive in numerosi domini, tra cui l’interazione uomo-robot, il mappamento semantico, la navigazione descrittiva e la percezione multimodale. Il presente lavoro esplora l’applicazione dei VLMs alla robotica mobile, con particolare attenzione a compiti di alto livello quali la navigazione guidata da linguaggio naturale e l'embodied question answering. Per connettere i VLMs con le funzionalità di basso livello del robot, introduciamo un’interfaccia innovativa di function calling che consente al modello di invocare strumenti di percezione e controllo, come l’acquisizione di immagini, l’interrogazione di posizioni o l’emissione di comandi di movimento, in maniera autonoma in risposta a istruzioni in linguaggio naturale. La valutazione è condotta nel simulatore AI2-THOR, che consente la sperimentazione in ambienti indoor. I risultati dimostrano che i VLMs pre-addestrati, se opportunamente interfacciati attraverso architetture arricchite da strumenti, sono in grado di ancorare il linguaggio all’azione e alla percezione per il controllo in retroazione. Questi risultati evidenziano il ruolo promettente dei vision language models nell’avanzamento dell’autonomia robotica generale guidata dal linguaggio.
Application of Vision Language Models in Robotics
TUBALDO, TOMMASO
2024/2025
Abstract
Vision Language Models (VLMs) have recently emerged as a powerful class of multimodal foundation models capable of jointly reasoning over visual inputs and natural language. Unlike traditional deep learning models in robotics, which are typically trained on task-specific small-scale datasets, foundation models are pre-trained on internet-scale data, granting them superior generalization capabilities and, in many cases, emergent zero-shot reasoning abilities across unseen tasks. Integrating the visual and linguistic capabilities of VLMs could significantly expand the applications of robotics, making robotic systems more versatile, intuitive, and capable of interacting with humans in a natural way. Such integration opens up possibilities across a wide range of domains, including natural human-robot interaction, enriched navigation, semantic mapping, descriptive navigation, multimodal perception, object detection, and scene understanding. This work explores the application of VLMs to mobile robotics, with a focus on high-level embodied tasks such as vision language navigation and embodied question answering. To connect VLMs with low-level robot functionality, we introduce a novel function calling interface that allows the model to invoke perception and control tools, such as acquiring images, querying positions, or issuing motion commands, autonomously in response to natural language instructions. We evaluate our approach in the AI2-THOR simulator, enabling testing in indoor environments. Our results demonstrate that pre-trained VLMs, when properly interfaced through tool-augmented architectures, are capable of grounding language in action and perception for closed-loop control. These findings point toward the promising role of foundation models in advancing general-purpose, language-driven robot autonomy.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tubaldo_Tommaso.pdf
accesso aperto
Dimensione
20.07 MB
Formato
Adobe PDF
|
20.07 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/93741