Navigazione Visuale Tramite Linguaggio Naturale in Ambiente Continuo

From Philip K. Dick to Isaac Asimov novel writers from all ages have constantly dreamt about robots capable of assisting humans in their daily chores or even accomplishing task that would be completely unfeasible by a mere man. Nowadays the recent advancements in artificial intelligence are bringing day by day the dreams of those mans closer to reality. In the recent years research on artificial intelligence has quickly shifted from Internet AI tasks which completely revolve around datasets of images text or videos extracted from the internet to task which fall under the Embodied AI field. With the term Embodied AI we refer to all those tasks which involve a physical agent capable of interact- ing with the real world through concrete and tangible hardware. More specifically Visual Language Navigation (VLN) is a sub-field of Embodied that tasks an agent to navigate through an environment, which potentially he could have no knowledge about, by following instruction given through natural language. Tasks belonging to the VLN field were originally modeled through navigation graphs which highly abstracted the environment by using nodes to represent loca- tions and edges to indicate navigability between such locations. This approach has the problem of abstracting too much the task by making it much more similar to tele- portation than actual navigation. The next step over VLN tasks was placing the agent inside continuous environment where he can freely navigate by executing low level actions like move forward x degrees or turn left of y degrees. This kind of tasks take the name of Visual Language Navigation in Continuous Environments (VLN-CE). VLN-CE tasks are very challenging due to the high amount of input modalities that the agent needs to understand to achieve their goals. The aim of this thesis project is improving performances of the baseline VLN-CE model over RxR-Habitat dataset by proposing new solutions that exploit a better instruction encoding or the implementation of auxiliary tasks.

Da Philip K. Dick ad Isaac Asimov romanzieri di tutte le epoche hanno costantemente sognato di robot capaci di assistere l'uomo nelle loro attività giornaliere o di addirittura compiere gesta completamente impossibili per un mero essere umano. Oggigiorno le recenti innovazioni nell'ambito dell'intelligenza artificiale stanno portando giorno dopo giorno i sogni di questi uomini più vicini alla realtà. Negli ultimi anni la ricerca nell'ambito dell'intelligenza artificiale si è rapidamente spostata da semplici task basate completamente su dataset di immagini, testi o video estratti dall'internet a task che ricadono sotto il campo dell'Embodied AI. Con il termine Embodied AI ci si riferisce a tutte quelle task che coinvolgono un agente fisico capace di interagire con il mondo reale attraverso hardware tangibile e concreto. Più specificatamente Visual Language Navigation (VLN) è un sottocampo dell'Embodied AI che chiede ad un agente di navigare attraverso l'ambiente, di cui potenzialmente potrebbe avere nessuna conoscenza in merito, seguendo delle istruzioni ricevute in input tramite linguaggio naturale. Le task appartenenti al campo della VLN sono state originariamente modellate attraverso grafi di navigazione che astraggono fortemente l'ambiente, utilizzando nodi per rappresentare posizioni nell'ambiente e gli archi per indicare la navigabilità tra tali posizioni. Questo approccio ha il problema di astrarre troppo il compito rendendolo molto più simile ad un teletrasporto che ad un effettiva navigazione. Il passaggio successivo rispetto alle task di tipo VLN è stato posizionare l'agente all'interno di un ambiente continuo dove può liberamente navigare eseguendo azioni di basso livello come spostarsi in avanti di x gradi o girare a sinistra di y gradi. Questo tipo di attività prende il nome di Visual Language Navigation in Continuous Environments (VLN-CE). Le attività VLN-CE sono molto impegnative a causa dell'elevata quantità di modalità di input che l'agente deve comprendere per raggiungere i propri obiettivi. Lo scopo di questo progetto di tesi è migliorare le prestazioni del modello di base VLN-CE basato sul dataset RxR-Habitat proponendo nuove soluzioni che sfruttano una migliore codifica delle istruzioni o l'esecuzione di task ausiliarie.