Text-to-image diffusion models have recently achieved impressive results, but evaluating their performance is still a challenge. When models are guided not only by text but also by layout constraints, evaluation becomes even harder, since it requires checking both the semantic accuracy of the content and the spatial fidelity of the layout. Despite their importance, existing benchmarks for layout-guided generation are still limited in size and flexibility, and, to the best of our knowledge, no benchmark currently exists for evaluating models on human-like unconstrained prompts. This thesis introduces two complementary benchmarks to address these limitations. The first, called 7Bench++, is a closed-set benchmark where prompts and bounding boxes are automatically generated through constrained randomization. This makes it possible to create large-scale datasets that remain structured, task-specific, and reproducible, overcoming the limitations of small hand-crafted collections. The second, an open-set benchmark based on human-written prompts and bounding boxes, allows the evaluation of models in more natural and unconstrained conditions, closer to real-world usage. Both benchmarks are integrated into an evaluation pipeline that reports both text alignment and layout alignment as distinct scores and also includes a unified score that combines them into a single interpretable measure for ranking models. Using this framework, several state-of-the-art diffusion models are evaluated, highlighting their strengths and limitations across both controlled tasks and open-ended, human-like prompts. In general, 7Bench++ and its open-set counterpart provide a scalable and extensible tool to benchmark layout-guided text-to-image generation. By combining structured and human-like inputs, it enables fair comparisons between models and offers new insights into how different techniques generalize from systematic tasks to unconstrained inputs. All datasets, evaluation code, and generated results are publicly released to support reproducibility. The benchmark and tested models are available at https://github.com/stars/Nikura3/lists/7bench.
Text-to-image diffusion models have recently achieved impressive results, but evaluating their performance is still a challenge. When models are guided not only by text but also by layout constraints, evaluation becomes even harder, since it requires checking both the semantic accuracy of the content and the spatial fidelity of the layout. Despite their importance, existing benchmarks for layout-guided generation are still limited in size and flexibility, and, to the best of our knowledge, no benchmark currently exists for evaluating models on human-like unconstrained prompts. This thesis introduces two complementary benchmarks to address these limitations. The first, called 7Bench++, is a closed-set benchmark where prompts and bounding boxes are automatically generated through constrained randomization. This makes it possible to create large-scale datasets that remain structured, task-specific, and reproducible, overcoming the limitations of small hand-crafted collections. The second, an open-set benchmark based on human-written prompts and bounding boxes, allows the evaluation of models in more natural and unconstrained conditions, closer to real-world usage. Both benchmarks are integrated into an evaluation pipeline that reports both text alignment and layout alignment as distinct scores and also includes a unified score that combines them into a single interpretable measure for ranking models. Using this framework, several state-of-the-art diffusion models are evaluated, highlighting their strengths and limitations across both controlled tasks and open-ended, human-like prompts. In general, 7Bench++ and its open-set counterpart provide a scalable and extensible tool to benchmark layout-guided text-to-image generation. By combining structured and human-like inputs, it enables fair comparisons between models and offers new insights into how different techniques generalize from systematic tasks to unconstrained inputs. All datasets, evaluation code, and generated results are publicly released to support reproducibility. The benchmark and tested models are available at https://github.com/stars/Nikura3/lists/7bench.
Scalable Evaluation of Closed-Set and Open-Set Semantic and Spatial Alignment in Layout-Guided Diffusion Models
FACCIOLI, NICLA
2024/2025
Abstract
Text-to-image diffusion models have recently achieved impressive results, but evaluating their performance is still a challenge. When models are guided not only by text but also by layout constraints, evaluation becomes even harder, since it requires checking both the semantic accuracy of the content and the spatial fidelity of the layout. Despite their importance, existing benchmarks for layout-guided generation are still limited in size and flexibility, and, to the best of our knowledge, no benchmark currently exists for evaluating models on human-like unconstrained prompts. This thesis introduces two complementary benchmarks to address these limitations. The first, called 7Bench++, is a closed-set benchmark where prompts and bounding boxes are automatically generated through constrained randomization. This makes it possible to create large-scale datasets that remain structured, task-specific, and reproducible, overcoming the limitations of small hand-crafted collections. The second, an open-set benchmark based on human-written prompts and bounding boxes, allows the evaluation of models in more natural and unconstrained conditions, closer to real-world usage. Both benchmarks are integrated into an evaluation pipeline that reports both text alignment and layout alignment as distinct scores and also includes a unified score that combines them into a single interpretable measure for ranking models. Using this framework, several state-of-the-art diffusion models are evaluated, highlighting their strengths and limitations across both controlled tasks and open-ended, human-like prompts. In general, 7Bench++ and its open-set counterpart provide a scalable and extensible tool to benchmark layout-guided text-to-image generation. By combining structured and human-like inputs, it enables fair comparisons between models and offers new insights into how different techniques generalize from systematic tasks to unconstrained inputs. All datasets, evaluation code, and generated results are publicly released to support reproducibility. The benchmark and tested models are available at https://github.com/stars/Nikura3/lists/7bench.| File | Dimensione | Formato | |
|---|---|---|---|
|
Faccioli_Nicla.pdf
accesso aperto
Dimensione
5.28 MB
Formato
Adobe PDF
|
5.28 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/91851