Unveiling Biases in Word Embeddings: An Algorithmic Approach for Comparative Analysis Based on Alignment

Word embeddings are state-of-the-art vectorial representation of words with the goal of preserving semantic similarity. They are the result of specific learning algorithms trained on usually large corpora. Consequently, they inherit all biases of the corpora on which they have been trained on. The goal of the thesis is to devise and adapt an efficient algorithm to compare two different word embeddings in order to highlight the biases they are subjected to. Specifically, we look for an alignment between the two vector spaces, corresponding to the two word embeddings, that minimises the difference between the stable words, i.e. the ones that have not changed in the two embeddings, thus highlighting the differences between the ones that did changed. In this work, we test this idea adapting a machine translation framework called MUSE that, after some improvements, can run over multiple cores in a HPC framework, specifically managed with SLURM. We also provide an amplpy implementation of linear and convex programming algorithms adapted to our case. We then test these techniques on a corpus of text taken from Italian newspapers in order to identify which words are more subject to change among the different pairs of corpora.

Unveiling Biases in Word Embeddings: An Algorithmic Approach for Comparative Analysis Based on Alignment

SANGUIN, PIETRO MARIA

2022/2023

Abstract

Word embeddings are state-of-the-art vectorial representation of words with the goal of preserving semantic similarity. They are the result of specific learning algorithms trained on usually large corpora. Consequently, they inherit all biases of the corpora on which they have been trained on. The goal of the thesis is to devise and adapt an efficient algorithm to compare two different word embeddings in order to highlight the biases they are subjected to. Specifically, we look for an alignment between the two vector spaces, corresponding to the two word embeddings, that minimises the difference between the stable words, i.e. the ones that have not changed in the two embeddings, thus highlighting the differences between the ones that did changed. In this work, we test this idea adapting a machine translation framework called MUSE that, after some improvements, can run over multiple cores in a HPC framework, specifically managed with SLURM. We also provide an amplpy implementation of linear and convex programming algorithms adapted to our case. We then test these techniques on a corpus of text taken from Italian newspapers in order to identify which words are more subject to change among the different pairs of corpora.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				DATA SCIENCE Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2022
			
	Titolo inglese
	
				Unveiling Biases in Word Embeddings: An Algorithmic Approach for Comparative Analysis Based on Alignment
			
	Abstract in italiano
	
				Word embeddings are state-of-the-art vectorial representation of words with the goal of preserving semantic similarity.  They are the result of specific learning algorithms trained on usually large corpora. Consequently, they inherit all biases of the corpora on which they have been trained on.  The goal of the thesis is to devise and adapt an efficient algorithm to compare two different word embeddings in order to highlight the biases they are subjected to. Specifically, we look for an alignment between the two vector spaces, corresponding to the two word embeddings, that minimises the difference between the stable words, i.e. the ones that have not changed in the two embeddings, thus highlighting the differences between the ones that did changed. In this work, we test this idea adapting a machine translation framework called MUSE that, after some improvements, can run over multiple cores in a HPC framework, specifically managed with SLURM. We also provide an amplpy implementation of linear and convex programming algorithms adapted to our case. We then test these techniques on a corpus of text taken from Italian newspapers in order to identify which words are more subject to change among the different pairs of corpora.
			
	Parola chiave
	
				Word embeddings
Biases
Alignment algorithm
NLP
Optimization
			
	Relatore
	
				DA SAN MARTINO, GIOVANNI
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
pdfA_Data_Science_thesis_NLP.pdf accesso aperto Dimensione 4.15 MB Formato Adobe PDF Visualizza/Apri	4.15 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/52277