Exploiting patches spatial relations in self-supervised models for vision tasks

Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and using the learned representations for several downstream tasks. In Natural Language Processing brings remarkable results and all the State of the art in this domain achieve important benefits from it. It is not an alternative to traditional Supervised Learning or Unsupervised Learning, but it can help to achieve better generalization with less amount of human effort in building labelled datasets. This thesis aims at investigating the use of self-supervised learning in computer vision tasks by using spatial relations tasks between image patches. It will investigate the improvements in two different contexts; a convolutional neural network (ResNet50) used to solve Image classification tasks, called RelCNN and a transformer-based network, ViT, used for semantic segmentation purposes, named RelVit. In particular, one of the proposed models, RelVit, can outperform the standard ViT in all the experiments proved, but for what concerns the RelCNN model, only in a few situations does it outperform ResNet50, demonstrating that the use of self-supervised learning in the convolutional neural network needs more complicated solutions.

Exploiting patches spatial relations in self-supervised models for vision tasks

MELISSARI, LUCA

2021/2022

Abstract

Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and using the learned representations for several downstream tasks. In Natural Language Processing brings remarkable results and all the State of the art in this domain achieve important benefits from it. It is not an alternative to traditional Supervised Learning or Unsupervised Learning, but it can help to achieve better generalization with less amount of human effort in building labelled datasets. This thesis aims at investigating the use of self-supervised learning in computer vision tasks by using spatial relations tasks between image patches. It will investigate the improvements in two different contexts; a convolutional neural network (ResNet50) used to solve Image classification tasks, called RelCNN and a transformer-based network, ViT, used for semantic segmentation purposes, named RelVit. In particular, one of the proposed models, RelVit, can outperform the standard ViT in all the experiments proved, but for what concerns the RelCNN model, only in a few situations does it outperform ResNet50, demonstrating that the use of self-supervised learning in the convolutional neural network needs more complicated solutions.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				INFORMATICA Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2021
			
	Titolo inglese
	
				Exploiting patches spatial relations in self-supervised models for vision tasks
			
	Abstract in italiano
	
				Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and using the learned representations for several downstream tasks. In Natural Language Processing brings remarkable results and all the State of the art in this domain achieve important benefits from it. It is not an alternative to traditional Supervised Learning or Unsupervised Learning, but it can help to achieve better generalization with less amount of human effort in building labelled datasets. This thesis aims at investigating the use of self-supervised learning in computer vision tasks by using spatial relations tasks between image patches. It will investigate the improvements in two different contexts; a convolutional neural network (ResNet50) used to solve Image classification tasks, called RelCNN and a transformer-based network, ViT, used for semantic segmentation purposes, named RelVit. In particular, one of the proposed models, RelVit, can outperform the standard ViT in all the experiments proved, but for what concerns the RelCNN model, only in a few situations does it outperform ResNet50, demonstrating that the use of self-supervised learning in the convolutional neural network needs more complicated solutions.
			
	Parola chiave
	
				computer vision
deep learning
self-supervision
spatial-relations
			
	Relatore
	
				BALLAN, LAMBERTO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Melissari_Luca.pdf accesso riservato Dimensione 6.66 MB Formato Adobe PDF	6.66 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/34962