Word Embedding: progettazione e valutazione di un modello del terzo ordine

In recent years, fields such as Natural Language Processing, Artificial Intelligence and Machine Learning have been making great strides. Computers have become very adept at manipulating large amounts of information. However, this information must always be encoded in some numerical form. This thesis examined how it is possible to encode what is the basis of all communication between human beings: language, words. In order to do this, a new model of word embeddings has been designed and evaluated, i.e. a model that allows words and their meaning to be represented through vectors of real numbers. This model aims to create static word embeddings, which means that the presented method learns a fixed embedding for each vocabulary word regardless of the context in which it is found. In addition, it is a third-order model, i.e. unlike the best-known static models, in which pairs of words are considered each time, this model is based on the idea of using a set of three different words to learn the corresponding word embeddings. In this thesis, initially, the research field called Natural Language Processing was introduced, showing its main characteristics and then tracing its history until reaching a simplified treatment of the most interesting problems at the present time. Then, within this field, attention was focused on word embeddings, that is, on how it is possible to represent words through vectors in such a way that these vectors represent, in some way, the meaning of the words with which they are associated. First of all, in the first chapter, the simplest methods to obtain these vectors were explained, then, in the third chapter, the main existing models of static word embeddings were analysed, such as Word2vec, GloVe and fastTest. Particular attention was also paid to deep learning techniques and neural networks, as these were fundamental tools for the work carried out in this thesis. In the second chapter, therefore, starting from a general introduction on Machine Learning and Deep Learning, the methods called Stochastic Gradient Descent and Backpropagation Algorithm were analysed. Afterwards, a whole chapter is dedicated to explain how the idea of the model developed in this thesis was born and how it was designed. Fundamental details about its implementation were discussed, taking into account the problems and solutions found. Finally, the model was evaluated by comparing it with one of the best known static models, Word2Vec, and the main problems and difficulties encountered by our model were analysed by means of a careful error analysis.

Negli ultimi anni campi come Natural Language Processing, Intelligenza artificiale e Machine learning stanno compiendo passi da gigante. I computer sono diventati molto abili a manipolare un gran numero di informazioni. Tuttavia queste informazioni devono sempre essere codificate in una qualche forma numerica. In questa tesi si è preso in esame come sia possibile codificare quello che è alla base di qualsiasi tipo di comunicazione fra gli esseri umani: il linguaggio, le parole. Per fare ciò, è stato progettato e valutato un nuovo modello per lo sviluppo di word embeddings, ossia un modello che permette di rappresentare le parole e il loro significato attraverso vettori di numeri reali. Questo modello ha come scopo quello di creare word embeddings statici, il che significa che il metodo presentato apprende un fissato embedding per ogni parola del vocabolario indipendentemente dal contesto in cui questa si trovi. In aggiunta è un modello del terzo ordine, ossia a differenza dei modelli statici più conosciuti, in cui vengono considerate ogni volta coppie di parole, questo modello si basa sull’idea di usare un insieme di tre diverse parole per apprendere i word embeddings corrispondenti. In questa tesi, inizialmente, è stato introdotto il campo di ricerca denominato Natural Language Processing, mostrandone le caratteristiche principali e poi ripercorrendone la storia fino a raggiungere una trattazione semplificata delle problematiche più di interesse al momento attuale. Poi all’interno di questo ambito, si è focalizzata l’attenzione sui word embeddings, ossia su come sia possibile rappresentare le parole tramite vettori in modo tale che questi vettori rappresentino in qualche modo il significato delle parole a cui sono associati. Innanzitutto, nel primo capitolo, sono stati spiegati i metodi più semplici per ottenere questi vettori, poi sono stati analizzati, nel terzo capitolo, i principali modelli esistenti di word embeddings statici, quali Word2vec, GloVe e fastTest. E’ stato necessario dedicare una particolare attenzione anche alle tecniche di deep learning e alle reti neurali, in quanto esse sono state strumenti fondamentali per il lavoro portato avanti in questa tesi. Nel secondo capitolo, quindi, partendo da un’introduzione generale riguardo il Machine learning e Deep learning si sono poi analizzati i metodi denominati Stochastic Gradient Descent e Algoritmo di Backpropagation. In seguito si è dedicato un intero capitolo per spiegare come è nata l’idea del modello sviluppato nella tesi e come esso è stato progettato. Si sono trattati i dettagli fondamentali riguardo la sua implementazione, prendendo ad esame i problemi e le soluzioni trovate. Infine si è valutato il modello confrontandolo con uno dei modelli statici più conosciuti, ossia Word2Vec, e si sono analizzate le principali problematiche e difficoltà riscontrate dal nostro modello tramite un’attenta analisi degli errori.