Domain Generalization in Semantic Segmentation (DGSS) is an attractive open research field in Computer Vision. It tackles the semantic segmentation performance drop that arises when predicting images of target datasets whose distribution is highly different from those of the source dataset. Unlike Unsupervised Domain Adaptation (UDA), where the target images, even though without their labels, can be exploited during training to facilitate the domain shift, Domain Generalization solely relies on the source dataset at training time. In some common settings, since manually labeling images for semantic segmentation is very time-consuming, the source dataset is typically made of synthetic images coming from video games (e.g., GTA5) or game engines (e.g., SELMA), and only during inference target datasets with real images (e.g., Cityscapes) are employed. Lately, Vision Language Models (VLMs) such as CLIP have shown their remarkable generalization capabilities across many image classification datasets. Indeed, the rich semantics learned from textual supervision allow them to handle much better the domain shift at test time. Even though some tried to use those models to improve on previous DGSS specialized models, only a few exploited the text representations to drive the task. In this work, we assess the direct contribution of language in solving the DGSS task on all the generalization scenarios (i.e., synthetic-to-real, real-to-real, and synthetic-to-synthetic) by building a model that employs VLMs as encoders to operate on image-text data and two decoders, one for each modality, that fuse the heterogeneous representations and solve the segmentation task.
DOMAIN GENERALIZATION FOR SEMANTIC SEGMENTATION EXPLOITING VISION-LANGUAGE FEATURES
CAREDDU, LUCA
2024/2025
Abstract
Domain Generalization in Semantic Segmentation (DGSS) is an attractive open research field in Computer Vision. It tackles the semantic segmentation performance drop that arises when predicting images of target datasets whose distribution is highly different from those of the source dataset. Unlike Unsupervised Domain Adaptation (UDA), where the target images, even though without their labels, can be exploited during training to facilitate the domain shift, Domain Generalization solely relies on the source dataset at training time. In some common settings, since manually labeling images for semantic segmentation is very time-consuming, the source dataset is typically made of synthetic images coming from video games (e.g., GTA5) or game engines (e.g., SELMA), and only during inference target datasets with real images (e.g., Cityscapes) are employed. Lately, Vision Language Models (VLMs) such as CLIP have shown their remarkable generalization capabilities across many image classification datasets. Indeed, the rich semantics learned from textual supervision allow them to handle much better the domain shift at test time. Even though some tried to use those models to improve on previous DGSS specialized models, only a few exploited the text representations to drive the task. In this work, we assess the direct contribution of language in solving the DGSS task on all the generalization scenarios (i.e., synthetic-to-real, real-to-real, and synthetic-to-synthetic) by building a model that employs VLMs as encoders to operate on image-text data and two decoders, one for each modality, that fuse the heterogeneous representations and solve the segmentation task.| File | Dimensione | Formato | |
|---|---|---|---|
|
Careddu_Luca.pdf
accesso aperto
Dimensione
13.79 MB
Formato
Adobe PDF
|
13.79 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/84780