In this thesis we study the problem of generating pseudo-masks from class-labels for real-world few-shot classification and segmentation scenarios, where collecting fully annotated masks is labor-intensive and time-consuming. We focused on leveraging a vision-language foundation model to eliminate mask requirements and instead use text prompts or labels as weak supervision. We based our work on a former SOTA pipeline that uses a classification-segmentation transformer (CST) and we study new approaches to generate the pseudo-masks for training a modified version of CST. The new pipeline is based on the CLIP model to extract visual and text features and on the SAM model to process them into pseudo-masks.
In this thesis we study the problem of generating pseudo-masks from class-labels for real-world few-shot classification and segmentation scenarios, where collecting fully annotated masks is labor-intensive and time-consuming. We focused on leveraging a vision-language foundation model to eliminate mask requirements and instead use text prompts or labels as weak supervision. We based our work on a former SOTA pipeline that uses a classification-segmentation transformer (CST) and we study new approaches to generate the pseudo-masks for training a modified version of CST. The new pipeline is based on the CLIP model to extract visual and text features and on the SAM model to process them into pseudo-masks.
Leveraging multimodal foundation models for weak supervision in few-shot classification and segmentation scenarios
ZENARO, NICCOLÒ
2025/2026
Abstract
In this thesis we study the problem of generating pseudo-masks from class-labels for real-world few-shot classification and segmentation scenarios, where collecting fully annotated masks is labor-intensive and time-consuming. We focused on leveraging a vision-language foundation model to eliminate mask requirements and instead use text prompts or labels as weak supervision. We based our work on a former SOTA pipeline that uses a classification-segmentation transformer (CST) and we study new approaches to generate the pseudo-masks for training a modified version of CST. The new pipeline is based on the CLIP model to extract visual and text features and on the SAM model to process them into pseudo-masks.| File | Dimensione | Formato | |
|---|---|---|---|
|
MasterThesis_Zenaro.pdf
accesso aperto
Dimensione
3.34 MB
Formato
Adobe PDF
|
3.34 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/108176