This thesis presents a novel framework that integrates a Vision Transformer (ViT) encoder and UPerNet decoder for effective semantic segmentation utilizing both RGB and depth modalities. The architecture is designed to leverage self-attention mechanisms, enabling the model to capture intricate spatial dependencies among image patches. To optimize model performance, self-supervised learning with various patch mixing strategies is explored, including Random Mixing, Chessboard Mixing, Contextual Mixing, Hierarchical Mixing, and Dynamic Patch Selection. Each strategy significantly influences how patches are selected from the input images, impacting the model's ability to capture relevant features effectively. The framework employs a pretraining-fine-tuning paradigm, where the encoder-decoder architecture reconstructs a mixed image composed of sampled patches from each modality, followed by fine-tuning on RGB images to enhance segmentation accuracy. An innovation in this work is the implementation of a reinforcement learning agent utilizing Proximal Policy Optimization (PPO) to dynamically select patches during training. This approach enables the model to learn optimal patch selection strategies, improving the integration of information across modalities in order to ultimately enhance semantic segmentation performance on the Cityscapes dataset. Overall, the proposed framework illustrates the value of combining transformer architectures with patch selection techniques in addressing challenges in urban environments. The findings of this research provide insights that could inform future developments in the use of multimodal data for computer vision tasks.
Leveraging Multi-Modality in Self-Supervised Learning for Segmentation
GHORBANPOUR ARANI, MOHAMMADREZA
2023/2024
Abstract
This thesis presents a novel framework that integrates a Vision Transformer (ViT) encoder and UPerNet decoder for effective semantic segmentation utilizing both RGB and depth modalities. The architecture is designed to leverage self-attention mechanisms, enabling the model to capture intricate spatial dependencies among image patches. To optimize model performance, self-supervised learning with various patch mixing strategies is explored, including Random Mixing, Chessboard Mixing, Contextual Mixing, Hierarchical Mixing, and Dynamic Patch Selection. Each strategy significantly influences how patches are selected from the input images, impacting the model's ability to capture relevant features effectively. The framework employs a pretraining-fine-tuning paradigm, where the encoder-decoder architecture reconstructs a mixed image composed of sampled patches from each modality, followed by fine-tuning on RGB images to enhance segmentation accuracy. An innovation in this work is the implementation of a reinforcement learning agent utilizing Proximal Policy Optimization (PPO) to dynamically select patches during training. This approach enables the model to learn optimal patch selection strategies, improving the integration of information across modalities in order to ultimately enhance semantic segmentation performance on the Cityscapes dataset. Overall, the proposed framework illustrates the value of combining transformer architectures with patch selection techniques in addressing challenges in urban environments. The findings of this research provide insights that could inform future developments in the use of multimodal data for computer vision tasks.File | Dimensione | Formato | |
---|---|---|---|
Mohammadreza_GhorbanpourArani.pdf
accesso riservato
Dimensione
11.37 MB
Formato
Adobe PDF
|
11.37 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/73728