Spatial reasoning is fundamental to human cognition and language, shaping how we perceive and describe spatial relationships through concise expressions like “to the left of” or “behind.” By distilling complex spatial arrangements into simple linguistic terms, humans bridge visual perception and communication, enabling navigation, organization, and interaction with the world. These capabilities pinpoint the importance of spatial reasoning in daily problem-solving and decision-making. One of the core challenges for visual grounding models, that align natural expressions to image regions, is learning spatial reasoning skills to understand the context of real-world tasks. However, vision-language models often struggle with spatial relations, especially in complicated or ambiguous contexts. This dissertation addresses these issues through two primary contributions. First, it introduces SpatialVG, a benchmark dataset carefully designed with template-based referring expressions that rely solely on spatial relations to locate objects. SpatialVG is developed through a systematic procedure, ensuring rigorous annotation and validation to construct unambiguous queries. Evaluations on SpatialVG show significant performance gaps in modern visual grounding models, with accuracy being more than 24 percentage points below the human ceiling. The second contribution is CLEVR-Ground, a synthetic pre-training dataset based on the CLEVR framework, tailored to expose models to a wide range of spatial relations in a controlled environment. CLEVR-Ground automates annotation, allowing scalable data generation and precise spatial configurations, and was used to pre-train models before fine-tuning on real-world datasets to seek improvements. Although models perform good on synthetic data, findings reveal challenges in generalizing to the real-world datasets, emphasizing the limitations of synthetic data in capturing the richness of natural scenes. This research poses a critical open question: what strategies could bridge the gap between humans and machines to truly understand spatiality between objects in visual grounding?
Spatial reasoning is fundamental to human cognition and language, shaping how we perceive and describe spatial relationships through concise expressions like “to the left of” or “behind.” By distilling complex spatial arrangements into simple linguistic terms, humans bridge visual perception and communication, enabling navigation, organization, and interaction with the world. These capabilities pinpoint the importance of spatial reasoning in daily problem-solving and decision-making. One of the core challenges for visual grounding models, that align natural expressions to image regions, is learning spatial reasoning skills to understand the context of real-world tasks. However, vision-language models often struggle with spatial relations, especially in complicated or ambiguous contexts. This dissertation addresses these issues through two primary contributions. First, it introduces SpatialVG, a benchmark dataset carefully designed with template-based referring expressions that rely solely on spatial relations to locate objects. SpatialVG is developed through a systematic procedure, ensuring rigorous annotation and validation to construct unambiguous queries. Evaluations on SpatialVG show significant performance gaps in modern visual grounding models, with accuracy being more than 24 percentage points below the human ceiling. The second contribution is CLEVR-Ground, a synthetic pre-training dataset based on the CLEVR framework, tailored to expose models to a wide range of spatial relations in a controlled environment. CLEVR-Ground automates annotation, allowing scalable data generation and precise spatial configurations, and was used to pre-train models before fine-tuning on real-world datasets to seek improvements. Although models perform good on synthetic data, findings reveal challenges in generalizing to the real-world datasets, emphasizing the limitations of synthetic data in capturing the richness of natural scenes. This research poses a critical open question: what strategies could bridge the gap between humans and machines to truly understand spatiality between objects in visual grounding?
Exploring the Role of Spatial Relations in Visual Grounding: A Novel Benchmark and Synthetic Pretraining
RESTA, ALESSANDRO
2023/2024
Abstract
Spatial reasoning is fundamental to human cognition and language, shaping how we perceive and describe spatial relationships through concise expressions like “to the left of” or “behind.” By distilling complex spatial arrangements into simple linguistic terms, humans bridge visual perception and communication, enabling navigation, organization, and interaction with the world. These capabilities pinpoint the importance of spatial reasoning in daily problem-solving and decision-making. One of the core challenges for visual grounding models, that align natural expressions to image regions, is learning spatial reasoning skills to understand the context of real-world tasks. However, vision-language models often struggle with spatial relations, especially in complicated or ambiguous contexts. This dissertation addresses these issues through two primary contributions. First, it introduces SpatialVG, a benchmark dataset carefully designed with template-based referring expressions that rely solely on spatial relations to locate objects. SpatialVG is developed through a systematic procedure, ensuring rigorous annotation and validation to construct unambiguous queries. Evaluations on SpatialVG show significant performance gaps in modern visual grounding models, with accuracy being more than 24 percentage points below the human ceiling. The second contribution is CLEVR-Ground, a synthetic pre-training dataset based on the CLEVR framework, tailored to expose models to a wide range of spatial relations in a controlled environment. CLEVR-Ground automates annotation, allowing scalable data generation and precise spatial configurations, and was used to pre-train models before fine-tuning on real-world datasets to seek improvements. Although models perform good on synthetic data, findings reveal challenges in generalizing to the real-world datasets, emphasizing the limitations of synthetic data in capturing the richness of natural scenes. This research poses a critical open question: what strategies could bridge the gap between humans and machines to truly understand spatiality between objects in visual grounding?File | Dimensione | Formato | |
---|---|---|---|
Resta_Alessandro.pdf
accesso riservato
Dimensione
9.02 MB
Formato
Adobe PDF
|
9.02 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/80211