Domain Generalization in Semantic Segmentation (DGSS) is an attractive open research field in Computer Vision. It tackles the semantic segmentation performance drop that arises when predicting images of target datasets whose distribution is highly different from those of the source dataset. Unlike Unsupervised Domain Adaptation (UDA), where the target images, even though without their labels, can be exploited during training to facilitate the domain shift, Domain Generalization solely relies on the source dataset at training time. In some common settings, since manually labeling images for semantic segmentation is very time-consuming, the source dataset is typically made of synthetic images coming from video games (e.g., GTA5) or game engines (e.g., SELMA), and only during inference target datasets with real images (e.g., Cityscapes) are employed. Lately, Vision Language Models (VLMs) such as CLIP have shown their remarkable generalization capabilities across many image classification datasets. Indeed, the rich semantics learned from textual supervision allow them to handle much better the domain shift at test time. Even though some tried to use those models to improve on previous DGSS specialized models, only a few exploited the text representations to drive the task. In this work, we assess the direct contribution of language in solving the DGSS task on all the generalization scenarios (i.e., synthetic-to-real, real-to-real, and synthetic-to-synthetic) by building a model that employs VLMs as encoders to operate on image-text data and two decoders, one for each modality, that fuse the heterogeneous representations and solve the segmentation task.

DOMAIN GENERALIZATION FOR SEMANTIC SEGMENTATION EXPLOITING VISION-LANGUAGE FEATURES

CAREDDU, LUCA
2024/2025

Abstract

Domain Generalization in Semantic Segmentation (DGSS) is an attractive open research field in Computer Vision. It tackles the semantic segmentation performance drop that arises when predicting images of target datasets whose distribution is highly different from those of the source dataset. Unlike Unsupervised Domain Adaptation (UDA), where the target images, even though without their labels, can be exploited during training to facilitate the domain shift, Domain Generalization solely relies on the source dataset at training time. In some common settings, since manually labeling images for semantic segmentation is very time-consuming, the source dataset is typically made of synthetic images coming from video games (e.g., GTA5) or game engines (e.g., SELMA), and only during inference target datasets with real images (e.g., Cityscapes) are employed. Lately, Vision Language Models (VLMs) such as CLIP have shown their remarkable generalization capabilities across many image classification datasets. Indeed, the rich semantics learned from textual supervision allow them to handle much better the domain shift at test time. Even though some tried to use those models to improve on previous DGSS specialized models, only a few exploited the text representations to drive the task. In this work, we assess the direct contribution of language in solving the DGSS task on all the generalization scenarios (i.e., synthetic-to-real, real-to-real, and synthetic-to-synthetic) by building a model that employs VLMs as encoders to operate on image-text data and two decoders, one for each modality, that fuse the heterogeneous representations and solve the segmentation task.
2024
DOMAIN GENERALIZATION FOR SEMANTIC SEGMENTATION EXPLOITING VISION-LANGUAGE FEATURES
DOMAIN GEN.
SEMANTIC SEGM
VISION-LANGUAGE
File in questo prodotto:
File Dimensione Formato  
Careddu_Luca.pdf

accesso aperto

Dimensione 13.79 MB
Formato Adobe PDF
13.79 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/84780