An Experimental Assessment of the Efficacy of BERTopic

Topic modelling is an unsupervised machine-learning technique for finding abstract topics in a large collection of documents. It helps in organizing, understanding, and summarizing large collections of textual information while discovering the latent topics that vary among documents in a given corpus. Recently, newly developed algorithms for topic modelling, such as BERTopic have gained significant attention from researchers and continue to attract growing interest. This research not only sheds light on the efficacy of using these advanced algorithms but also emphasizes the importance of possessing certain technical skills for conducting meaningful investigations in this domain. Efficient, speedy, and scalable implementations of these algorithms are essential for handling vast corpora of text data. Additionally, to ensure the success of this study and meaningful comparisons among various topic modelling approaches, proficiency in technical skills such as data analysis and data visualization is imperative. Utilizing Python as the programming language of choice provides the flexibility and robustness required for algorithmic implementations, while a solid foundation in statistical modelling and mathematical skills is indispensable for accurate calculation and prediction. Specifically, the main contribution of the study is to introduce the NMI (Normalized Mutual Information) and modularity which are the two different evaluation metrics used to assess the quality of clusters or topics generated by clustering algorithms, including those used in BERTopic. In essence, this research not only explores the effectiveness of state-of-the-art topic modelling algorithms but also underscores the significance of technical expertise in data analysis, data visualization, Python programming, and statistical modelling to facilitate comprehensive comparisons within the field of topic modelling.

An Experimental Assessment of the Efficacy of BERTopic

ZAHIR, FARIN BINTA

2022/2023

Abstract

Topic modelling is an unsupervised machine-learning technique for finding abstract topics in a large collection of documents. It helps in organizing, understanding, and summarizing large collections of textual information while discovering the latent topics that vary among documents in a given corpus. Recently, newly developed algorithms for topic modelling, such as BERTopic have gained significant attention from researchers and continue to attract growing interest. This research not only sheds light on the efficacy of using these advanced algorithms but also emphasizes the importance of possessing certain technical skills for conducting meaningful investigations in this domain. Efficient, speedy, and scalable implementations of these algorithms are essential for handling vast corpora of text data. Additionally, to ensure the success of this study and meaningful comparisons among various topic modelling approaches, proficiency in technical skills such as data analysis and data visualization is imperative. Utilizing Python as the programming language of choice provides the flexibility and robustness required for algorithmic implementations, while a solid foundation in statistical modelling and mathematical skills is indispensable for accurate calculation and prediction. Specifically, the main contribution of the study is to introduce the NMI (Normalized Mutual Information) and modularity which are the two different evaluation metrics used to assess the quality of clusters or topics generated by clustering algorithms, including those used in BERTopic. In essence, this research not only explores the effectiveness of state-of-the-art topic modelling algorithms but also underscores the significance of technical expertise in data analysis, data visualization, Python programming, and statistical modelling to facilitate comprehensive comparisons within the field of topic modelling.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				ICT FOR INTERNET AND MULTIMEDIA - INGEGNERIA PER LE COMUNICAZIONI MULTIMEDIALI E INTERNET Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2022
			
	Titolo inglese
	
				An Experimental Assessment of the Efficacy of BERTopic
			
	Abstract in italiano
	
				Topic modelling is an unsupervised machine-learning technique for finding abstract topics in a large collection of documents. It helps in organizing, understanding, and summarizing large collections of textual information while discovering the latent topics that vary among documents in a given corpus. Recently, newly developed algorithms for topic modelling, such as  BERTopic have gained significant attention from researchers and continue to attract growing interest.

This research not only sheds light on the efficacy of using these advanced algorithms but also emphasizes the importance of possessing certain technical skills for conducting meaningful investigations in this domain. Efficient, speedy, and scalable implementations of these algorithms are essential for handling vast corpora of text data. Additionally, to ensure the success of this study and meaningful comparisons among various topic modelling approaches, proficiency in technical skills such as data analysis and data visualization is imperative. Utilizing Python as the programming language of choice provides the flexibility and robustness required for algorithmic implementations, while a solid foundation in statistical modelling and mathematical skills is indispensable for accurate calculation and prediction. Specifically, the main contribution of the study is to introduce the NMI (Normalized Mutual Information) and modularity which are the two different evaluation metrics used to assess the quality of clusters or topics generated by clustering algorithms, including those used in BERTopic.

In essence, this research not only explores the effectiveness of state-of-the-art topic modelling algorithms but also underscores the significance of technical expertise in data analysis, data visualization, Python programming, and statistical modelling to facilitate comprehensive comparisons within the field of topic modelling.
			
	Parola chiave
	
				BERTopic
TF-IDF
NMI
modularity
			
	Relatore
	
				ERSEGHE, TOMASO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Zahir_Farin Binta_pdfA.pdf accesso aperto Dimensione 4.53 MB Formato Adobe PDF Visualizza/Apri	4.53 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/58772