Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and using the learned representations for several downstream tasks. In Natural Language Processing brings remarkable results and all the State of the art in this domain achieve important benefits from it. It is not an alternative to traditional Supervised Learning or Unsupervised Learning, but it can help to achieve better generalization with less amount of human effort in building labelled datasets. This thesis aims at investigating the use of self-supervised learning in computer vision tasks by using spatial relations tasks between image patches. It will investigate the improvements in two different contexts; a convolutional neural network (ResNet50) used to solve Image classification tasks, called RelCNN and a transformer-based network, ViT, used for semantic segmentation purposes, named RelVit. In particular, one of the proposed models, RelVit, can outperform the standard ViT in all the experiments proved, but for what concerns the RelCNN model, only in a few situations does it outperform ResNet50, demonstrating that the use of self-supervised learning in the convolutional neural network needs more complicated solutions.

Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and using the learned representations for several downstream tasks. In Natural Language Processing brings remarkable results and all the State of the art in this domain achieve important benefits from it. It is not an alternative to traditional Supervised Learning or Unsupervised Learning, but it can help to achieve better generalization with less amount of human effort in building labelled datasets. This thesis aims at investigating the use of self-supervised learning in computer vision tasks by using spatial relations tasks between image patches. It will investigate the improvements in two different contexts; a convolutional neural network (ResNet50) used to solve Image classification tasks, called RelCNN and a transformer-based network, ViT, used for semantic segmentation purposes, named RelVit. In particular, one of the proposed models, RelVit, can outperform the standard ViT in all the experiments proved, but for what concerns the RelCNN model, only in a few situations does it outperform ResNet50, demonstrating that the use of self-supervised learning in the convolutional neural network needs more complicated solutions.

Exploiting patches spatial relations in self-supervised models for vision tasks

MELISSARI, LUCA
2021/2022

Abstract

Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and using the learned representations for several downstream tasks. In Natural Language Processing brings remarkable results and all the State of the art in this domain achieve important benefits from it. It is not an alternative to traditional Supervised Learning or Unsupervised Learning, but it can help to achieve better generalization with less amount of human effort in building labelled datasets. This thesis aims at investigating the use of self-supervised learning in computer vision tasks by using spatial relations tasks between image patches. It will investigate the improvements in two different contexts; a convolutional neural network (ResNet50) used to solve Image classification tasks, called RelCNN and a transformer-based network, ViT, used for semantic segmentation purposes, named RelVit. In particular, one of the proposed models, RelVit, can outperform the standard ViT in all the experiments proved, but for what concerns the RelCNN model, only in a few situations does it outperform ResNet50, demonstrating that the use of self-supervised learning in the convolutional neural network needs more complicated solutions.
2021
Exploiting patches spatial relations in self-supervised models for vision tasks
Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and using the learned representations for several downstream tasks. In Natural Language Processing brings remarkable results and all the State of the art in this domain achieve important benefits from it. It is not an alternative to traditional Supervised Learning or Unsupervised Learning, but it can help to achieve better generalization with less amount of human effort in building labelled datasets. This thesis aims at investigating the use of self-supervised learning in computer vision tasks by using spatial relations tasks between image patches. It will investigate the improvements in two different contexts; a convolutional neural network (ResNet50) used to solve Image classification tasks, called RelCNN and a transformer-based network, ViT, used for semantic segmentation purposes, named RelVit. In particular, one of the proposed models, RelVit, can outperform the standard ViT in all the experiments proved, but for what concerns the RelCNN model, only in a few situations does it outperform ResNet50, demonstrating that the use of self-supervised learning in the convolutional neural network needs more complicated solutions.
computer vision
deep learning
self-supervision
spatial-relations
File in questo prodotto:
File Dimensione Formato  
Melissari_Luca.pdf

accesso riservato

Dimensione 6.66 MB
Formato Adobe PDF
6.66 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/34962