Text-to-image diffusion models have recently achieved impressive results, but evaluating their performance is still a challenge. When models are guided not only by text but also by layout constraints, evaluation becomes even harder, since it requires checking both the semantic accuracy of the content and the spatial fidelity of the layout. Despite their importance, existing benchmarks for layout-guided generation are still limited in size and flexibility, and, to the best of our knowledge, no benchmark currently exists for evaluating models on human-like unconstrained prompts. This thesis introduces two complementary benchmarks to address these limitations. The first, called 7Bench++, is a closed-set benchmark where prompts and bounding boxes are automatically generated through constrained randomization. This makes it possible to create large-scale datasets that remain structured, task-specific, and reproducible, overcoming the limitations of small hand-crafted collections. The second, an open-set benchmark based on human-written prompts and bounding boxes, allows the evaluation of models in more natural and unconstrained conditions, closer to real-world usage. Both benchmarks are integrated into an evaluation pipeline that reports both text alignment and layout alignment as distinct scores and also includes a unified score that combines them into a single interpretable measure for ranking models. Using this framework, several state-of-the-art diffusion models are evaluated, highlighting their strengths and limitations across both controlled tasks and open-ended, human-like prompts. In general, 7Bench++ and its open-set counterpart provide a scalable and extensible tool to benchmark layout-guided text-to-image generation. By combining structured and human-like inputs, it enables fair comparisons between models and offers new insights into how different techniques generalize from systematic tasks to unconstrained inputs. All datasets, evaluation code, and generated results are publicly released to support reproducibility. The benchmark and tested models are available at https://github.com/stars/Nikura3/lists/7bench.

Text-to-image diffusion models have recently achieved impressive results, but evaluating their performance is still a challenge. When models are guided not only by text but also by layout constraints, evaluation becomes even harder, since it requires checking both the semantic accuracy of the content and the spatial fidelity of the layout. Despite their importance, existing benchmarks for layout-guided generation are still limited in size and flexibility, and, to the best of our knowledge, no benchmark currently exists for evaluating models on human-like unconstrained prompts. This thesis introduces two complementary benchmarks to address these limitations. The first, called 7Bench++, is a closed-set benchmark where prompts and bounding boxes are automatically generated through constrained randomization. This makes it possible to create large-scale datasets that remain structured, task-specific, and reproducible, overcoming the limitations of small hand-crafted collections. The second, an open-set benchmark based on human-written prompts and bounding boxes, allows the evaluation of models in more natural and unconstrained conditions, closer to real-world usage. Both benchmarks are integrated into an evaluation pipeline that reports both text alignment and layout alignment as distinct scores and also includes a unified score that combines them into a single interpretable measure for ranking models. Using this framework, several state-of-the-art diffusion models are evaluated, highlighting their strengths and limitations across both controlled tasks and open-ended, human-like prompts. In general, 7Bench++ and its open-set counterpart provide a scalable and extensible tool to benchmark layout-guided text-to-image generation. By combining structured and human-like inputs, it enables fair comparisons between models and offers new insights into how different techniques generalize from systematic tasks to unconstrained inputs. All datasets, evaluation code, and generated results are publicly released to support reproducibility. The benchmark and tested models are available at https://github.com/stars/Nikura3/lists/7bench.

Scalable Evaluation of Closed-Set and Open-Set Semantic and Spatial Alignment in Layout-Guided Diffusion Models

FACCIOLI, NICLA
2024/2025

Abstract

Text-to-image diffusion models have recently achieved impressive results, but evaluating their performance is still a challenge. When models are guided not only by text but also by layout constraints, evaluation becomes even harder, since it requires checking both the semantic accuracy of the content and the spatial fidelity of the layout. Despite their importance, existing benchmarks for layout-guided generation are still limited in size and flexibility, and, to the best of our knowledge, no benchmark currently exists for evaluating models on human-like unconstrained prompts. This thesis introduces two complementary benchmarks to address these limitations. The first, called 7Bench++, is a closed-set benchmark where prompts and bounding boxes are automatically generated through constrained randomization. This makes it possible to create large-scale datasets that remain structured, task-specific, and reproducible, overcoming the limitations of small hand-crafted collections. The second, an open-set benchmark based on human-written prompts and bounding boxes, allows the evaluation of models in more natural and unconstrained conditions, closer to real-world usage. Both benchmarks are integrated into an evaluation pipeline that reports both text alignment and layout alignment as distinct scores and also includes a unified score that combines them into a single interpretable measure for ranking models. Using this framework, several state-of-the-art diffusion models are evaluated, highlighting their strengths and limitations across both controlled tasks and open-ended, human-like prompts. In general, 7Bench++ and its open-set counterpart provide a scalable and extensible tool to benchmark layout-guided text-to-image generation. By combining structured and human-like inputs, it enables fair comparisons between models and offers new insights into how different techniques generalize from systematic tasks to unconstrained inputs. All datasets, evaluation code, and generated results are publicly released to support reproducibility. The benchmark and tested models are available at https://github.com/stars/Nikura3/lists/7bench.
2024
Scalable Evaluation of Closed-Set and Open-Set Semantic and Spatial Alignment in Layout-Guided Diffusion Models
Text-to-image diffusion models have recently achieved impressive results, but evaluating their performance is still a challenge. When models are guided not only by text but also by layout constraints, evaluation becomes even harder, since it requires checking both the semantic accuracy of the content and the spatial fidelity of the layout. Despite their importance, existing benchmarks for layout-guided generation are still limited in size and flexibility, and, to the best of our knowledge, no benchmark currently exists for evaluating models on human-like unconstrained prompts. This thesis introduces two complementary benchmarks to address these limitations. The first, called 7Bench++, is a closed-set benchmark where prompts and bounding boxes are automatically generated through constrained randomization. This makes it possible to create large-scale datasets that remain structured, task-specific, and reproducible, overcoming the limitations of small hand-crafted collections. The second, an open-set benchmark based on human-written prompts and bounding boxes, allows the evaluation of models in more natural and unconstrained conditions, closer to real-world usage. Both benchmarks are integrated into an evaluation pipeline that reports both text alignment and layout alignment as distinct scores and also includes a unified score that combines them into a single interpretable measure for ranking models. Using this framework, several state-of-the-art diffusion models are evaluated, highlighting their strengths and limitations across both controlled tasks and open-ended, human-like prompts. In general, 7Bench++ and its open-set counterpart provide a scalable and extensible tool to benchmark layout-guided text-to-image generation. By combining structured and human-like inputs, it enables fair comparisons between models and offers new insights into how different techniques generalize from systematic tasks to unconstrained inputs. All datasets, evaluation code, and generated results are publicly released to support reproducibility. The benchmark and tested models are available at https://github.com/stars/Nikura3/lists/7bench.
Diffusion Models
Layout guided
Text-to-image
Model evaluation
Benchmark design
File in questo prodotto:
File Dimensione Formato  
Faccioli_Nicla.pdf

accesso aperto

Dimensione 5.28 MB
Formato Adobe PDF
5.28 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/91851