Bias di genere nei dataset utilizzati per l'addestramento di assistenti virtuali

This thesis aims to explore the phenomenon of gender bias within datasets used for training virtual assistants. Virtual assistants, such as Siri, Alexa and Google Assistant, have become an integral part of our daily lives, providing support and answers to our queries. However, there is a growing concern that such assistants are affected by gender bias. This study focuses on the analysis of the MASSIVE dataset in order to identify possible implicit gender biases. The analyses performed on the dataset are based on on those conducted by Seaborn's study 'Transcending the "male code": Implicit masculine biases in nlp contexts", in fact they were carried out in a similar manner but considering only Italian language utterances. In addition to the evidence of masculine implicit biases, another contribution consists of AVA (Ambiguity for Virtual Assistants): a dictionary that collects ambiguous terms common to the language of gender and to the language of virtual assistants. After the first two introductory chapters, the thesis focuses on the analysis of the MASSIVE dataset. The third and fourth chapters focus on the description of the various dictionaries used and the MASSIVE dataset. The fifth and sixth chapters focus on the first analyses performed on the dataset. The next two chapters describe the creation of the AVA dictionary and deal with the analyses performed using this dictionary. Finally there is a discussion of future analyses and research, the limitations of this research and the final conclusions.

Questa tesi si propone di esplorare il fenomeno dei bias di genere all’interno dei dataset utilizzati per l’addestramento di assistenti virtuali. Gli assistenti virtuali, come Siri, Alexa e Google Assistant, sono diventati parte integrante della nostra vita quotidiana, fornendo supporto e risposte ai nostri quesiti. Tuttavia, c’`e una crescente preoccupazione che tali assistenti siano affetti da bias di genere. Il presente studio si concentra sull’analisi del dataset MASSIVE al fine di identificare eventuali pregiudizi impliciti di genere. Le analisi eseguite sul dataset sono basate su quelle condotte dallo studio di Seaborn “Transcending the “male code”: Implicit masculine biases in nlp contexts”, infatti sono state realizzate in maniera simile ma considerando solamente gli enunciati in lingua italiana. Oltre alle evidenze di bias impliciti maschili, un altro contributo consiste in AVA (Ambiguity for Virtual Assistants): un dizionario che raccoglie i termini ambigui comuni al linguaggio di genere e al linguaggio degli assistenti virtuali. Dopo i primi due capitoli introduttivi la tesi si concentra sull’analisi del dataset MASSIVE. Il terzo e quarto capitolo si concentrano sulla descrizione dei vari dizionari usati e sul dataset MASSIVE. Il quinto e sesto capitolo sono incentrati sulle prime analisi effettuate sul dataset. I due capitoli successivi descrivono la creazione del dizionario AVA e trattano le analisi eseguite utilizzando tale dizionario. Infine `e riportata una discussione sulle future analisi e ricerche, sulle limitazioni di questa ricerca e le conclusioni finali.