Scalable Evaluation of Closed-Set and Open-Set Semantic and Spatial Alignment in Layout-Guided Diffusion Models

Text-to-image diffusion models have recently achieved impressive results, but evaluating their performance is still a challenge. When models are guided not only by text but also by layout constraints, evaluation becomes even harder, since it requires checking both the semantic accuracy of the content and the spatial fidelity of the layout. Despite their importance, existing benchmarks for layout-guided generation are still limited in size and flexibility, and, to the best of our knowledge, no benchmark currently exists for evaluating models on human-like unconstrained prompts. This thesis introduces two complementary benchmarks to address these limitations. The first, called 7Bench++, is a closed-set benchmark where prompts and bounding boxes are automatically generated through constrained randomization. This makes it possible to create large-scale datasets that remain structured, task-specific, and reproducible, overcoming the limitations of small hand-crafted collections. The second, an open-set benchmark based on human-written prompts and bounding boxes, allows the evaluation of models in more natural and unconstrained conditions, closer to real-world usage. Both benchmarks are integrated into an evaluation pipeline that reports both text alignment and layout alignment as distinct scores and also includes a unified score that combines them into a single interpretable measure for ranking models. Using this framework, several state-of-the-art diffusion models are evaluated, highlighting their strengths and limitations across both controlled tasks and open-ended, human-like prompts. In general, 7Bench++ and its open-set counterpart provide a scalable and extensible tool to benchmark layout-guided text-to-image generation. By combining structured and human-like inputs, it enables fair comparisons between models and offers new insights into how different techniques generalize from systematic tasks to unconstrained inputs. All datasets, evaluation code, and generated results are publicly released to support reproducibility. The benchmark and tested models are available at https://github.com/stars/Nikura3/lists/7bench.

Scalable Evaluation of Closed-Set and Open-Set Semantic and Spatial Alignment in Layout-Guided Diffusion Models

FACCIOLI, NICLA

2024/2025

Abstract

Text-to-image diffusion models have recently achieved impressive results, but evaluating their performance is still a challenge. When models are guided not only by text but also by layout constraints, evaluation becomes even harder, since it requires checking both the semantic accuracy of the content and the spatial fidelity of the layout. Despite their importance, existing benchmarks for layout-guided generation are still limited in size and flexibility, and, to the best of our knowledge, no benchmark currently exists for evaluating models on human-like unconstrained prompts. This thesis introduces two complementary benchmarks to address these limitations. The first, called 7Bench++, is a closed-set benchmark where prompts and bounding boxes are automatically generated through constrained randomization. This makes it possible to create large-scale datasets that remain structured, task-specific, and reproducible, overcoming the limitations of small hand-crafted collections. The second, an open-set benchmark based on human-written prompts and bounding boxes, allows the evaluation of models in more natural and unconstrained conditions, closer to real-world usage. Both benchmarks are integrated into an evaluation pipeline that reports both text alignment and layout alignment as distinct scores and also includes a unified score that combines them into a single interpretable measure for ranking models. Using this framework, several state-of-the-art diffusion models are evaluated, highlighting their strengths and limitations across both controlled tasks and open-ended, human-like prompts. In general, 7Bench++ and its open-set counterpart provide a scalable and extensible tool to benchmark layout-guided text-to-image generation. By combining structured and human-like inputs, it enables fair comparisons between models and offers new insights into how different techniques generalize from systematic tasks to unconstrained inputs. All datasets, evaluation code, and generated results are publicly released to support reproducibility. The benchmark and tested models are available at https://github.com/stars/Nikura3/lists/7bench.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				COMPUTER SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Scalable Evaluation of Closed-Set and Open-Set Semantic and Spatial Alignment in Layout-Guided Diffusion Models
			
	Abstract in italiano
	
				Text-to-image diffusion models have recently achieved impressive results, but evaluating their performance is still a challenge. When models are guided not only by text but also by layout constraints, evaluation becomes even harder, since it requires checking both the semantic accuracy of the content and the spatial fidelity of the layout. Despite their importance, existing benchmarks for layout-guided generation are still limited in size and flexibility, and, to the best of our knowledge, no benchmark currently exists for evaluating models on human-like unconstrained prompts.
This thesis introduces two complementary benchmarks to address these limitations. The first, called 7Bench++, is a closed-set benchmark where prompts and bounding boxes are automatically generated through constrained randomization. This makes it possible to create large-scale datasets that remain structured, task-specific, and reproducible, overcoming the limitations of small hand-crafted collections. The second, an open-set benchmark based on human-written prompts and bounding boxes, allows the evaluation of models in more natural and unconstrained conditions, closer to real-world usage.
Both benchmarks are integrated into an evaluation pipeline that reports both text alignment and layout alignment as distinct scores and also includes a unified score that combines them into a single interpretable measure for ranking models. Using this framework, several state-of-the-art diffusion models are evaluated, highlighting their strengths and limitations across both controlled tasks and open-ended, human-like prompts.
In general, 7Bench++ and its open-set counterpart provide a scalable and extensible tool to benchmark layout-guided text-to-image generation. By combining structured and human-like inputs, it enables fair comparisons between models and offers new insights into how different techniques generalize from systematic tasks to unconstrained inputs. All datasets, evaluation code, and generated results are publicly released to support reproducibility. The benchmark and tested models are available at https://github.com/stars/Nikura3/lists/7bench.
			
	Parola chiave
	
				Diffusion Models
Layout guided
Text-to-image
Model evaluation
Benchmark design
			
	Relatore
	
				BALLAN, LAMBERTO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Faccioli_Nicla.pdf accesso aperto Dimensione 5.28 MB Formato Adobe PDF Visualizza/Apri	5.28 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/91851