Leveraging multimodal foundation models for weak supervision in few-shot classification and segmentation scenarios

In this thesis we study the problem of generating pseudo-masks from class-labels for real-world few-shot classification and segmentation scenarios, where collecting fully annotated masks is labor-intensive and time-consuming. We focused on leveraging a vision-language foundation model to eliminate mask requirements and instead use text prompts or labels as weak supervision. We based our work on a former SOTA pipeline that uses a classification-segmentation transformer (CST) and we study new approaches to generate the pseudo-masks for training a modified version of CST. The new pipeline is based on the CLIP model to extract visual and text features and on the SAM model to process them into pseudo-masks.

Leveraging multimodal foundation models for weak supervision in few-shot classification and segmentation scenarios

ZENARO, NICCOLÒ

2025/2026

Abstract

In this thesis we study the problem of generating pseudo-masks from class-labels for real-world few-shot classification and segmentation scenarios, where collecting fully annotated masks is labor-intensive and time-consuming. We focused on leveraging a vision-language foundation model to eliminate mask requirements and instead use text prompts or labels as weak supervision. We based our work on a former SOTA pipeline that uses a classification-segmentation transformer (CST) and we study new approaches to generate the pseudo-masks for training a modified version of CST. The new pipeline is based on the CLIP model to extract visual and text features and on the SAM model to process them into pseudo-masks.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				COMPUTER SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Leveraging multimodal foundation models for weak supervision in few-shot classification and segmentation scenarios
			
	Abstract in italiano
	
				In this thesis we study the problem of generating pseudo-masks from class-labels for real-world few-shot classification and segmentation scenarios, where collecting fully annotated masks is labor-intensive and time-consuming. We focused on leveraging a vision-language foundation model to eliminate mask requirements and instead use text prompts or labels as weak supervision. We based our work on a former SOTA pipeline that uses a classification-segmentation transformer (CST) and we study new approaches to generate the pseudo-masks for training a modified version of CST. The new pipeline is based on the CLIP model to extract visual and text features and on the SAM model to process them into pseudo-masks.
			
	Parola chiave
	
				Computer Vision
Multimodal models
Weak supervision
Segmentation
Vision Transformers
			
	Relatore
	
				BALLAN, LAMBERTO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
MasterThesis_Zenaro.pdf accesso aperto Dimensione 3.34 MB Formato Adobe PDF Visualizza/Apri	3.34 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/108176