This thesis approaches the task of visual grounding, which refers to the process of aligning words or phrases in natural language to specific objects or regions within an image. It can be seen as a two-stage process, them being proposal generation and region-phrase alignment. In the unsupervised setting, the task is even more challenging, as it requires the model to perform without labeled training data. Our research objectives revolve around the proposal generation of an unsupervised visual grounding model, understanding how it impacts the visual grounding performance, especially in a training-free setting. Specifically, we examine the impact of a large vocabulary on the performance of a zero-shot unsupervised visual grounding model. We explore three detectors with increasing vocabulary sizes. By systematically examining statistical data for each detector on our dataset, we measure coverage and accuracy metrics to comprehensively assess the model's performance against several baseline models. Furthermore, this study extends its experiments to alignment mechanisms for grounding region-phrase pairs that consider alternative labels for proposals. Such mechanisms add an extra layer of context and interpretation to the grounding process, enabling a more accurate matching between visual and textual elements. We compare these new mechanisms against state-of-the-art approaches. This thesis highlights the practical importance of proposal generation in achieving more accurate and contextually relevant multimodal processing. The findings from this study will provide insights into the implications of using such integrations, thereby facilitating the development of more effective and efficient visual grounding systems, while advancing our understanding of the challenges and possibilities in the field of unsupervised visual grounding.

Exploring Untrained Zero-Shot Visual Grounding: Proposal Generation and Unsupervised Alignment Techniques

BUTTAU, SARA
2022/2023

Abstract

This thesis approaches the task of visual grounding, which refers to the process of aligning words or phrases in natural language to specific objects or regions within an image. It can be seen as a two-stage process, them being proposal generation and region-phrase alignment. In the unsupervised setting, the task is even more challenging, as it requires the model to perform without labeled training data. Our research objectives revolve around the proposal generation of an unsupervised visual grounding model, understanding how it impacts the visual grounding performance, especially in a training-free setting. Specifically, we examine the impact of a large vocabulary on the performance of a zero-shot unsupervised visual grounding model. We explore three detectors with increasing vocabulary sizes. By systematically examining statistical data for each detector on our dataset, we measure coverage and accuracy metrics to comprehensively assess the model's performance against several baseline models. Furthermore, this study extends its experiments to alignment mechanisms for grounding region-phrase pairs that consider alternative labels for proposals. Such mechanisms add an extra layer of context and interpretation to the grounding process, enabling a more accurate matching between visual and textual elements. We compare these new mechanisms against state-of-the-art approaches. This thesis highlights the practical importance of proposal generation in achieving more accurate and contextually relevant multimodal processing. The findings from this study will provide insights into the implications of using such integrations, thereby facilitating the development of more effective and efficient visual grounding systems, while advancing our understanding of the challenges and possibilities in the field of unsupervised visual grounding.
2022
Exploring Untrained Zero-Shot Visual Grounding: Proposal Generation and Unsupervised Alignment Techniques
visual grounding
large vocabulary
object detector
File in questo prodotto:
File Dimensione Formato  
Buttau_Sara.pdf

accesso riservato

Dimensione 2.8 MB
Formato Adobe PDF
2.8 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/52322