In this thesis we study the problem of generating pseudo-masks from class-labels for real-world few-shot classification and segmentation scenarios, where collecting fully annotated masks is labor-intensive and time-consuming. We focused on leveraging a vision-language foundation model to eliminate mask requirements and instead use text prompts or labels as weak supervision. We based our work on a former SOTA pipeline that uses a classification-segmentation transformer (CST) and we study new approaches to generate the pseudo-masks for training a modified version of CST. The new pipeline is based on the CLIP model to extract visual and text features and on the SAM model to process them into pseudo-masks.

In this thesis we study the problem of generating pseudo-masks from class-labels for real-world few-shot classification and segmentation scenarios, where collecting fully annotated masks is labor-intensive and time-consuming. We focused on leveraging a vision-language foundation model to eliminate mask requirements and instead use text prompts or labels as weak supervision. We based our work on a former SOTA pipeline that uses a classification-segmentation transformer (CST) and we study new approaches to generate the pseudo-masks for training a modified version of CST. The new pipeline is based on the CLIP model to extract visual and text features and on the SAM model to process them into pseudo-masks.

Leveraging multimodal foundation models for weak supervision in few-shot classification and segmentation scenarios

ZENARO, NICCOLÒ
2025/2026

Abstract

In this thesis we study the problem of generating pseudo-masks from class-labels for real-world few-shot classification and segmentation scenarios, where collecting fully annotated masks is labor-intensive and time-consuming. We focused on leveraging a vision-language foundation model to eliminate mask requirements and instead use text prompts or labels as weak supervision. We based our work on a former SOTA pipeline that uses a classification-segmentation transformer (CST) and we study new approaches to generate the pseudo-masks for training a modified version of CST. The new pipeline is based on the CLIP model to extract visual and text features and on the SAM model to process them into pseudo-masks.
2025
Leveraging multimodal foundation models for weak supervision in few-shot classification and segmentation scenarios
In this thesis we study the problem of generating pseudo-masks from class-labels for real-world few-shot classification and segmentation scenarios, where collecting fully annotated masks is labor-intensive and time-consuming. We focused on leveraging a vision-language foundation model to eliminate mask requirements and instead use text prompts or labels as weak supervision. We based our work on a former SOTA pipeline that uses a classification-segmentation transformer (CST) and we study new approaches to generate the pseudo-masks for training a modified version of CST. The new pipeline is based on the CLIP model to extract visual and text features and on the SAM model to process them into pseudo-masks.
Computer Vision
Multimodal models
Weak supervision
Segmentation
Vision Transformers
File in questo prodotto:
File Dimensione Formato  
MasterThesis_Zenaro.pdf

accesso aperto

Dimensione 3.34 MB
Formato Adobe PDF
3.34 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108176