Leveraging Multi-Modality in Self-Supervised Learning for Segmentation

This thesis presents a novel framework that integrates a Vision Transformer (ViT) encoder and UPerNet decoder for effective semantic segmentation utilizing both RGB and depth modalities. The architecture is designed to leverage self-attention mechanisms, enabling the model to capture intricate spatial dependencies among image patches. To optimize model performance, self-supervised learning with various patch mixing strategies is explored, including Random Mixing, Chessboard Mixing, Contextual Mixing, Hierarchical Mixing, and Dynamic Patch Selection. Each strategy significantly influences how patches are selected from the input images, impacting the model's ability to capture relevant features effectively. The framework employs a pretraining-fine-tuning paradigm, where the encoder-decoder architecture reconstructs a mixed image composed of sampled patches from each modality, followed by fine-tuning on RGB images to enhance segmentation accuracy. An innovation in this work is the implementation of a reinforcement learning agent utilizing Proximal Policy Optimization (PPO) to dynamically select patches during training. This approach enables the model to learn optimal patch selection strategies, improving the integration of information across modalities in order to ultimately enhance semantic segmentation performance on the Cityscapes dataset. Overall, the proposed framework illustrates the value of combining transformer architectures with patch selection techniques in addressing challenges in urban environments. The findings of this research provide insights that could inform future developments in the use of multimodal data for computer vision tasks.

Leveraging Multi-Modality in Self-Supervised Learning for Segmentation

GHORBANPOUR ARANI, MOHAMMADREZA

2023/2024

Abstract

This thesis presents a novel framework that integrates a Vision Transformer (ViT) encoder and UPerNet decoder for effective semantic segmentation utilizing both RGB and depth modalities. The architecture is designed to leverage self-attention mechanisms, enabling the model to capture intricate spatial dependencies among image patches. To optimize model performance, self-supervised learning with various patch mixing strategies is explored, including Random Mixing, Chessboard Mixing, Contextual Mixing, Hierarchical Mixing, and Dynamic Patch Selection. Each strategy significantly influences how patches are selected from the input images, impacting the model's ability to capture relevant features effectively. The framework employs a pretraining-fine-tuning paradigm, where the encoder-decoder architecture reconstructs a mixed image composed of sampled patches from each modality, followed by fine-tuning on RGB images to enhance segmentation accuracy. An innovation in this work is the implementation of a reinforcement learning agent utilizing Proximal Policy Optimization (PPO) to dynamically select patches during training. This approach enables the model to learn optimal patch selection strategies, improving the integration of information across modalities in order to ultimately enhance semantic segmentation performance on the Cityscapes dataset. Overall, the proposed framework illustrates the value of combining transformer architectures with patch selection techniques in addressing challenges in urban environments. The findings of this research provide insights that could inform future developments in the use of multimodal data for computer vision tasks.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				CONTROL SYSTEMS ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				Leveraging Multi-Modality in Self-Supervised Learning for Segmentation
			
	Parola chiave
	
				Self-supervision
Multi-modal
segmentation
RGB-D
Transformer
			
	Relatore
	
				ZANUTTIGH, PIETRO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Mohammadreza_GhorbanpourArani.pdf accesso riservato Dimensione 11.37 MB Formato Adobe PDF	11.37 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/73728