This thesis presents a novel framework that integrates a Vision Transformer (ViT) encoder and UPerNet decoder for effective semantic segmentation utilizing both RGB and depth modalities. The architecture is designed to leverage self-attention mechanisms, enabling the model to capture intricate spatial dependencies among image patches. To optimize model performance, self-supervised learning with various patch mixing strategies is explored, including Random Mixing, Chessboard Mixing, Contextual Mixing, Hierarchical Mixing, and Dynamic Patch Selection. Each strategy significantly influences how patches are selected from the input images, impacting the model's ability to capture relevant features effectively. The framework employs a pretraining-fine-tuning paradigm, where the encoder-decoder architecture reconstructs a mixed image composed of sampled patches from each modality, followed by fine-tuning on RGB images to enhance segmentation accuracy. An innovation in this work is the implementation of a reinforcement learning agent utilizing Proximal Policy Optimization (PPO) to dynamically select patches during training. This approach enables the model to learn optimal patch selection strategies, improving the integration of information across modalities in order to ultimately enhance semantic segmentation performance on the Cityscapes dataset. Overall, the proposed framework illustrates the value of combining transformer architectures with patch selection techniques in addressing challenges in urban environments. The findings of this research provide insights that could inform future developments in the use of multimodal data for computer vision tasks.

Leveraging Multi-Modality in Self-Supervised Learning for Segmentation

GHORBANPOUR ARANI, MOHAMMADREZA
2023/2024

Abstract

This thesis presents a novel framework that integrates a Vision Transformer (ViT) encoder and UPerNet decoder for effective semantic segmentation utilizing both RGB and depth modalities. The architecture is designed to leverage self-attention mechanisms, enabling the model to capture intricate spatial dependencies among image patches. To optimize model performance, self-supervised learning with various patch mixing strategies is explored, including Random Mixing, Chessboard Mixing, Contextual Mixing, Hierarchical Mixing, and Dynamic Patch Selection. Each strategy significantly influences how patches are selected from the input images, impacting the model's ability to capture relevant features effectively. The framework employs a pretraining-fine-tuning paradigm, where the encoder-decoder architecture reconstructs a mixed image composed of sampled patches from each modality, followed by fine-tuning on RGB images to enhance segmentation accuracy. An innovation in this work is the implementation of a reinforcement learning agent utilizing Proximal Policy Optimization (PPO) to dynamically select patches during training. This approach enables the model to learn optimal patch selection strategies, improving the integration of information across modalities in order to ultimately enhance semantic segmentation performance on the Cityscapes dataset. Overall, the proposed framework illustrates the value of combining transformer architectures with patch selection techniques in addressing challenges in urban environments. The findings of this research provide insights that could inform future developments in the use of multimodal data for computer vision tasks.
2023
Leveraging Multi-Modality in Self-Supervised Learning for Segmentation
Self-supervision
Multi-modal
segmentation
RGB-D
Transformer
File in questo prodotto:
File Dimensione Formato  
Mohammadreza_GhorbanpourArani.pdf

accesso riservato

Dimensione 11.37 MB
Formato Adobe PDF
11.37 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/73728