Zero-Shot Object Goal Navigation using Online Image Retrieval

Embodied AI has made significant strides in navigating complex environments through the use of Vision-Language Models (VLMs) and queryable maps. However, conventional approaches often struggle to recognize rare, long-tail objects. To address this limitation, we propose a method that leverages online image retrieval to enrich the agent's understanding of target objects. During the mapping phase, we retrieve images related to the target query and extract their embedding vectors, which are then projected onto a queryable embedding map. This enriched map is used to generate a similarity grid, guiding the agent toward the target object. To evaluate performance on rare object categories, we introduce HSSD-rare, a dataset comprising over 1,300 episodes from 17 scenes in the HSSD validation set, specifically curated to represent long-tail object distributions. Our results demonstrate that augmenting text-based queries with online visual context significantly improves long-tail object localization in open-set navigation scenarios.

Zero-Shot Object Goal Navigation using Online Image Retrieval

AKKARA, JELIN RAPHAEL

2024/2025

Abstract

Embodied AI has made significant strides in navigating complex environments through the use of Vision-Language Models (VLMs) and queryable maps. However, conventional approaches often struggle to recognize rare, long-tail objects. To address this limitation, we propose a method that leverages online image retrieval to enrich the agent's understanding of target objects. During the mapping phase, we retrieve images related to the target query and extract their embedding vectors, which are then projected onto a queryable embedding map. This enriched map is used to generate a similarity grid, guiding the agent toward the target object. To evaluate performance on rare object categories, we introduce HSSD-rare, a dataset comprising over 1,300 episodes from 17 scenes in the HSSD validation set, specifically curated to represent long-tail object distributions. Our results demonstrate that augmenting text-based queries with online visual context significantly improves long-tail object localization in open-set navigation scenarios.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Fisica e Astronomia "Galileo Galilei" - DFA
			
	Corso di studio
	
				PHYSICS OF DATA Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Zero-Shot Object Goal Navigation using Online Image Retrieval
			
	Abstract in italiano
	
				Embodied AI has made significant strides in navigating complex environments through the use of Vision-Language Models (VLMs) and queryable maps. However, conventional approaches often struggle to recognize rare, long-tail objects. To address this limitation, we propose a method that leverages online image retrieval to enrich the agent's understanding of target objects. During the mapping phase, we retrieve images related to the target query and extract their embedding vectors, which are then projected onto a queryable embedding map. This enriched map is used to generate a similarity grid, guiding the agent toward the target object. To evaluate performance on rare object categories, we introduce HSSD-rare, a dataset comprising over 1,300 episodes from 17 scenes in the HSSD validation set, specifically curated to represent long-tail object distributions. Our results demonstrate that augmenting text-based queries with online visual context significantly improves long-tail object localization in open-set navigation scenarios.
			
	Parola chiave
	
				Embodied AI
Computer Vision
Object Navigation
Zero-Shot Learning
			
	Relatore
	
				BALLAN, LAMBERTO
			
	Correlatore
	
				CAMPARI, TOMMASO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Akkara_JelinRaphael.pdf accesso aperto Dimensione 6.69 MB Formato Adobe PDF Visualizza/Apri	6.69 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/87169