Natural Language Processing (NLP) is a field that has been rapidly expanding in the past few years in the world of Artificial Intelligence (AI). It is enough to consider the recent explosion of text generation softwares, such as OpenAI’s ChatGPT, released in November 2022. The aim of this work is to explain what is NLP about and which are the most useful tasks, to provide an overview of the main machine-learning models that are utilized to solve NLP tasks, and to give an example of their application in a company project, especially focusing on the use of BERT during my internship at Ixly B.V., in Utrecht, Netherlands. Together with Ixly’s Data Science team, I had to work on their “Interview App”, which is a software that records a speech during a job interview and returns a report with the following features: the main topics that have been touched, in terms of competencies, motivators and personality keywords, a wordcloud of the most used words, the number and type of questions that have been asked, and the language style matching between the candidate and the interviewer. The “Interview App” is currently available in Dutch, so my job was to make it suitable for the Italian language. This meant that I had to find an Italian corpus of spoken text, normalize it and train my model with this new dataset. Then I had to apply all the filters mentioned above to the Italian language and finally create the pipeline that computes the language analysis for Italian conversations. This pipeline utilizes BERT to find question types within a conversation. In order to collect the information for this work, a detailed research on GoogleScholar has been made. Some main websites used for machine learning and AI have been consulted as well, such as GitHub and Hugging Face to obtain the codes to run BERT on my personal computer, the Azure API code to transcript spoken language, and the official website of OpenAI to learn about their GPT products. Python has been used as the main programming language with Visual Studio Code as its development environment. The Italian corpus KIParla has been adopted as the main dataset for the Italian language, which has been collected and released in 2019 together by the University of Bologna and Turin.
Natural Language Processing (NLP) is a field that has been rapidly expanding in the past few years in the world of Artificial Intelligence (AI). It is enough to consider the recent explosion of text generation softwares, such as OpenAI’s ChatGPT, released in November 2022. The aim of this work is to explain what is NLP about and which are the most useful tasks, to provide an overview of the main machine-learning models that are utilized to solve NLP tasks, and to give an example of their application in a company project, especially focusing on the use of BERT during my internship at Ixly B.V., in Utrecht, Netherlands. Together with Ixly’s Data Science team, I had to work on their “Interview App”, which is a software that records a speech during a job interview and returns a report with the following features: the main topics that have been touched, in terms of competencies, motivators and personality keywords, a wordcloud of the most used words, the number and type of questions that have been asked, and the language style matching between the candidate and the interviewer. The “Interview App” is currently available in Dutch, so my job was to make it suitable for the Italian language. This meant that I had to find an Italian corpus of spoken text, normalize it and train my model with this new dataset. Then I had to apply all the filters mentioned above to the Italian language and finally create the pipeline that computes the language analysis for Italian conversations. This pipeline utilizes BERT to find question types within a conversation. In order to collect the information for this work, a detailed research on GoogleScholar has been made. Some main websites used for machine learning and AI have been consulted as well, such as GitHub and Hugging Face to obtain the codes to run BERT on my personal computer, the Azure API code to transcript spoken language, and the official website of OpenAI to learn about their GPT products. Python has been used as the main programming language with Visual Studio Code as its development environment. The Italian corpus KIParla has been adopted as the main dataset for the Italian language, which has been collected and released in 2019 together by the University of Bologna and Turin.
The use of Artificial Intelligence to solve Natural Language Processing problems: the example of BERT in a company project
MIGLIORE, GIULIA
2022/2023
Abstract
Natural Language Processing (NLP) is a field that has been rapidly expanding in the past few years in the world of Artificial Intelligence (AI). It is enough to consider the recent explosion of text generation softwares, such as OpenAI’s ChatGPT, released in November 2022. The aim of this work is to explain what is NLP about and which are the most useful tasks, to provide an overview of the main machine-learning models that are utilized to solve NLP tasks, and to give an example of their application in a company project, especially focusing on the use of BERT during my internship at Ixly B.V., in Utrecht, Netherlands. Together with Ixly’s Data Science team, I had to work on their “Interview App”, which is a software that records a speech during a job interview and returns a report with the following features: the main topics that have been touched, in terms of competencies, motivators and personality keywords, a wordcloud of the most used words, the number and type of questions that have been asked, and the language style matching between the candidate and the interviewer. The “Interview App” is currently available in Dutch, so my job was to make it suitable for the Italian language. This meant that I had to find an Italian corpus of spoken text, normalize it and train my model with this new dataset. Then I had to apply all the filters mentioned above to the Italian language and finally create the pipeline that computes the language analysis for Italian conversations. This pipeline utilizes BERT to find question types within a conversation. In order to collect the information for this work, a detailed research on GoogleScholar has been made. Some main websites used for machine learning and AI have been consulted as well, such as GitHub and Hugging Face to obtain the codes to run BERT on my personal computer, the Azure API code to transcript spoken language, and the official website of OpenAI to learn about their GPT products. Python has been used as the main programming language with Visual Studio Code as its development environment. The Italian corpus KIParla has been adopted as the main dataset for the Italian language, which has been collected and released in 2019 together by the University of Bologna and Turin.File | Dimensione | Formato | |
---|---|---|---|
Migliore_Giulia.pdf
accesso aperto
Dimensione
840.76 kB
Formato
Adobe PDF
|
840.76 kB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/47137