The recent development and growing adoption of vision-language models (VLMs) have led to significant advance- ments in the field of computer vision. Numerous state-of-the-art models achieve human-comparable performance on traditional datasets; however, they still exhibit limited sensitivity to specific image attributes, such as spa- tial relationships between objects and chromatic characteristics, occasionally making gross errors on tasks that are intuitive for humans. This work proposes a pipeline to enhance the capability of recognizing chromatic attributes in one of the most renowned and widely used models in computer vision: CLIP (Contrastive Lan- guage–Image Pre-training). The proposed methodology involves generating a synthetic dataset composed of chromatic variants of segmented objects, derived from the images and annotations of the MSCOCO dataset. The adopted fine-tuning algorithm is based on a contrastive learning approach.
Il recente sviluppo e la crescente diffusione dei modelli visivo-linguistici (Vision-Language Models, VLMs) hanno portato a significativi avanzamenti nel campo della visione artificiale. Numerosi modelli all’avanguardia ottengono performance comparabili a quelle umane su dataset tradizionali; tuttavia, presentano ancora una sensibilità limitata verso attributi specifici delle immagini, come le relazioni spaziali tra oggetti e le caratteristiche cromatiche, arrivando talvolta a compiere errori grossolani su compiti che risultano intuitivi per un essere umano. In questo lavoro, si propone una pipeline per il miglioramento della capacità di riconoscimento degli attributi cromatici in uno dei modelli più noti e diffusi nella visione artificiale: CLIP (Contrastive Language–Image Pre-training). La metodologia proposta prevede la generazione di un dataset sintetico composto da varianti cromatiche di oggetti segmentati, derivato dalle immagini e dalle annotazioni del dataset MSCOCO. L’algoritmo di fine-tuning adottato si basa su un approccio di apprendimento contrastivo.
Colorista Artificiale: Una Pipeline di Fine-Tuning su Modelli Linguistico-Visivi per il Riconoscimento degli Attributi Cromatici
PASQUALOTTO, LORENZO
2023/2024
Abstract
The recent development and growing adoption of vision-language models (VLMs) have led to significant advance- ments in the field of computer vision. Numerous state-of-the-art models achieve human-comparable performance on traditional datasets; however, they still exhibit limited sensitivity to specific image attributes, such as spa- tial relationships between objects and chromatic characteristics, occasionally making gross errors on tasks that are intuitive for humans. This work proposes a pipeline to enhance the capability of recognizing chromatic attributes in one of the most renowned and widely used models in computer vision: CLIP (Contrastive Lan- guage–Image Pre-training). The proposed methodology involves generating a synthetic dataset composed of chromatic variants of segmented objects, derived from the images and annotations of the MSCOCO dataset. The adopted fine-tuning algorithm is based on a contrastive learning approach.File | Dimensione | Formato | |
---|---|---|---|
Pasqualotto_Lorenzo_2008651.pdf
accesso aperto
Dimensione
32.02 MB
Formato
Adobe PDF
|
32.02 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/80240