Embodied AI has made significant strides in navigating complex environments through the use of Vision-Language Models (VLMs) and queryable maps. However, conventional approaches often struggle to recognize rare, long-tail objects. To address this limitation, we propose a method that leverages online image retrieval to enrich the agent's understanding of target objects. During the mapping phase, we retrieve images related to the target query and extract their embedding vectors, which are then projected onto a queryable embedding map. This enriched map is used to generate a similarity grid, guiding the agent toward the target object. To evaluate performance on rare object categories, we introduce HSSD-rare, a dataset comprising over 1,300 episodes from 17 scenes in the HSSD validation set, specifically curated to represent long-tail object distributions. Our results demonstrate that augmenting text-based queries with online visual context significantly improves long-tail object localization in open-set navigation scenarios.

Embodied AI has made significant strides in navigating complex environments through the use of Vision-Language Models (VLMs) and queryable maps. However, conventional approaches often struggle to recognize rare, long-tail objects. To address this limitation, we propose a method that leverages online image retrieval to enrich the agent's understanding of target objects. During the mapping phase, we retrieve images related to the target query and extract their embedding vectors, which are then projected onto a queryable embedding map. This enriched map is used to generate a similarity grid, guiding the agent toward the target object. To evaluate performance on rare object categories, we introduce HSSD-rare, a dataset comprising over 1,300 episodes from 17 scenes in the HSSD validation set, specifically curated to represent long-tail object distributions. Our results demonstrate that augmenting text-based queries with online visual context significantly improves long-tail object localization in open-set navigation scenarios.

Zero-Shot Object Goal Navigation using Online Image Retrieval

AKKARA, JELIN RAPHAEL
2024/2025

Abstract

Embodied AI has made significant strides in navigating complex environments through the use of Vision-Language Models (VLMs) and queryable maps. However, conventional approaches often struggle to recognize rare, long-tail objects. To address this limitation, we propose a method that leverages online image retrieval to enrich the agent's understanding of target objects. During the mapping phase, we retrieve images related to the target query and extract their embedding vectors, which are then projected onto a queryable embedding map. This enriched map is used to generate a similarity grid, guiding the agent toward the target object. To evaluate performance on rare object categories, we introduce HSSD-rare, a dataset comprising over 1,300 episodes from 17 scenes in the HSSD validation set, specifically curated to represent long-tail object distributions. Our results demonstrate that augmenting text-based queries with online visual context significantly improves long-tail object localization in open-set navigation scenarios.
2024
Zero-Shot Object Goal Navigation using Online Image Retrieval
Embodied AI has made significant strides in navigating complex environments through the use of Vision-Language Models (VLMs) and queryable maps. However, conventional approaches often struggle to recognize rare, long-tail objects. To address this limitation, we propose a method that leverages online image retrieval to enrich the agent's understanding of target objects. During the mapping phase, we retrieve images related to the target query and extract their embedding vectors, which are then projected onto a queryable embedding map. This enriched map is used to generate a similarity grid, guiding the agent toward the target object. To evaluate performance on rare object categories, we introduce HSSD-rare, a dataset comprising over 1,300 episodes from 17 scenes in the HSSD validation set, specifically curated to represent long-tail object distributions. Our results demonstrate that augmenting text-based queries with online visual context significantly improves long-tail object localization in open-set navigation scenarios.
Embodied AI
Computer Vision
Object Navigation
Zero-Shot Learning
File in questo prodotto:
File Dimensione Formato  
Akkara_JelinRaphael.pdf

accesso aperto

Dimensione 6.69 MB
Formato Adobe PDF
6.69 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/87169