Word embeddings are state-of-the-art vectorial representation of words with the goal of preserving semantic similarity. They are the result of specific learning algorithms trained on usually large corpora. Consequently, they inherit all biases of the corpora on which they have been trained on. The goal of the thesis is to devise and adapt an efficient algorithm to compare two different word embeddings in order to highlight the biases they are subjected to. Specifically, we look for an alignment between the two vector spaces, corresponding to the two word embeddings, that minimises the difference between the stable words, i.e. the ones that have not changed in the two embeddings, thus highlighting the differences between the ones that did changed. In this work, we test this idea adapting a machine translation framework called MUSE that, after some improvements, can run over multiple cores in a HPC framework, specifically managed with SLURM. We also provide an amplpy implementation of linear and convex programming algorithms adapted to our case. We then test these techniques on a corpus of text taken from Italian newspapers in order to identify which words are more subject to change among the different pairs of corpora.

Word embeddings are state-of-the-art vectorial representation of words with the goal of preserving semantic similarity. They are the result of specific learning algorithms trained on usually large corpora. Consequently, they inherit all biases of the corpora on which they have been trained on. The goal of the thesis is to devise and adapt an efficient algorithm to compare two different word embeddings in order to highlight the biases they are subjected to. Specifically, we look for an alignment between the two vector spaces, corresponding to the two word embeddings, that minimises the difference between the stable words, i.e. the ones that have not changed in the two embeddings, thus highlighting the differences between the ones that did changed. In this work, we test this idea adapting a machine translation framework called MUSE that, after some improvements, can run over multiple cores in a HPC framework, specifically managed with SLURM. We also provide an amplpy implementation of linear and convex programming algorithms adapted to our case. We then test these techniques on a corpus of text taken from Italian newspapers in order to identify which words are more subject to change among the different pairs of corpora.

Unveiling Biases in Word Embeddings: An Algorithmic Approach for Comparative Analysis Based on Alignment

SANGUIN, PIETRO MARIA
2022/2023

Abstract

Word embeddings are state-of-the-art vectorial representation of words with the goal of preserving semantic similarity. They are the result of specific learning algorithms trained on usually large corpora. Consequently, they inherit all biases of the corpora on which they have been trained on. The goal of the thesis is to devise and adapt an efficient algorithm to compare two different word embeddings in order to highlight the biases they are subjected to. Specifically, we look for an alignment between the two vector spaces, corresponding to the two word embeddings, that minimises the difference between the stable words, i.e. the ones that have not changed in the two embeddings, thus highlighting the differences between the ones that did changed. In this work, we test this idea adapting a machine translation framework called MUSE that, after some improvements, can run over multiple cores in a HPC framework, specifically managed with SLURM. We also provide an amplpy implementation of linear and convex programming algorithms adapted to our case. We then test these techniques on a corpus of text taken from Italian newspapers in order to identify which words are more subject to change among the different pairs of corpora.
2022
Unveiling Biases in Word Embeddings: An Algorithmic Approach for Comparative Analysis Based on Alignment
Word embeddings are state-of-the-art vectorial representation of words with the goal of preserving semantic similarity. They are the result of specific learning algorithms trained on usually large corpora. Consequently, they inherit all biases of the corpora on which they have been trained on. The goal of the thesis is to devise and adapt an efficient algorithm to compare two different word embeddings in order to highlight the biases they are subjected to. Specifically, we look for an alignment between the two vector spaces, corresponding to the two word embeddings, that minimises the difference between the stable words, i.e. the ones that have not changed in the two embeddings, thus highlighting the differences between the ones that did changed. In this work, we test this idea adapting a machine translation framework called MUSE that, after some improvements, can run over multiple cores in a HPC framework, specifically managed with SLURM. We also provide an amplpy implementation of linear and convex programming algorithms adapted to our case. We then test these techniques on a corpus of text taken from Italian newspapers in order to identify which words are more subject to change among the different pairs of corpora.
Word embeddings
Biases
Alignment algorithm
NLP
Optimization
File in questo prodotto:
File Dimensione Formato  
pdfA_Data_Science_thesis_NLP.pdf

accesso aperto

Dimensione 4.15 MB
Formato Adobe PDF
4.15 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/52277