Embodied AI has made significant strides in navigating complex environments through the use of Vision-Language Models (VLMs) and queryable maps. However, conventional approaches often struggle to recognize rare, long-tail objects. To address this limitation, we propose a method that leverages online image retrieval to enrich the agent's understanding of target objects. During the mapping phase, we retrieve images related to the target query and extract their embedding vectors, which are then projected onto a queryable embedding map. This enriched map is used to generate a similarity grid, guiding the agent toward the target object. To evaluate performance on rare object categories, we introduce HSSD-rare, a dataset comprising over 1,300 episodes from 17 scenes in the HSSD validation set, specifically curated to represent long-tail object distributions. Our results demonstrate that augmenting text-based queries with online visual context significantly improves long-tail object localization in open-set navigation scenarios.
Embodied AI has made significant strides in navigating complex environments through the use of Vision-Language Models (VLMs) and queryable maps. However, conventional approaches often struggle to recognize rare, long-tail objects. To address this limitation, we propose a method that leverages online image retrieval to enrich the agent's understanding of target objects. During the mapping phase, we retrieve images related to the target query and extract their embedding vectors, which are then projected onto a queryable embedding map. This enriched map is used to generate a similarity grid, guiding the agent toward the target object. To evaluate performance on rare object categories, we introduce HSSD-rare, a dataset comprising over 1,300 episodes from 17 scenes in the HSSD validation set, specifically curated to represent long-tail object distributions. Our results demonstrate that augmenting text-based queries with online visual context significantly improves long-tail object localization in open-set navigation scenarios.
Zero-Shot Object Goal Navigation using Online Image Retrieval
AKKARA, JELIN RAPHAEL
2024/2025
Abstract
Embodied AI has made significant strides in navigating complex environments through the use of Vision-Language Models (VLMs) and queryable maps. However, conventional approaches often struggle to recognize rare, long-tail objects. To address this limitation, we propose a method that leverages online image retrieval to enrich the agent's understanding of target objects. During the mapping phase, we retrieve images related to the target query and extract their embedding vectors, which are then projected onto a queryable embedding map. This enriched map is used to generate a similarity grid, guiding the agent toward the target object. To evaluate performance on rare object categories, we introduce HSSD-rare, a dataset comprising over 1,300 episodes from 17 scenes in the HSSD validation set, specifically curated to represent long-tail object distributions. Our results demonstrate that augmenting text-based queries with online visual context significantly improves long-tail object localization in open-set navigation scenarios.| File | Dimensione | Formato | |
|---|---|---|---|
|
Akkara_JelinRaphael.pdf
accesso aperto
Dimensione
6.69 MB
Formato
Adobe PDF
|
6.69 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/87169