The rapid growth of online news provides an opportunity and a challenge for social scientists conducting media monitoring. Repetitive content in the form of near-duplicate articles can skew analysis, yet extracting meaningful terms is often useful for public discourse analysis. This study addresses these challenges by evaluating existing approaches for detecting near-duplicate content and extracting meaningful terms in the context of news media monitoring. First, we evaluated the performance of a Python implementation of locality-sensitive hashing (LSH) with MinHashing for detecting near-duplicates in a corpus of 3,932 English news articles, which had been manually labeled for duplicates and near-duplicates. Second, we conducted a comparison study of four tools – spaCy, KeyBERT, PhraseMachine, and LLaMA – used for term extraction in a domain-specific corpus of 636 genome editing articles. The results indicate that the LSH–MinHash approach is a robust method for detecting near-duplicate content in media streams. At the same time, the term extraction comparison shows that tool performance is context-dependent, with no single method outperforming the others across all the considered metrics. Collectively, these findings offer practical guidance when designing media monitoring pipelines and support more effective digital social science research.
La rapida crescita delle testate giornalistiche online rappresenta un’opportunità e una sfida per gli scienziati sociali che conducono monitoraggi dei media. I contenuti ripetuti sotto forma di articoli quasi duplicati possono influire sulle analisi, e l’estrazione di termini significativi è utile per l’analisi del discorso pubblico. Questo studio affronta tali sfide valutando approcci esistenti per rilevare contenuti quasi duplicati ed estrarre termini significativi nel contesto del monitoraggio dei media giornalistici. In primo luogo, abbiamo valutato le prestazioni di un’implementazione Python di un approccio di Locality-Sensitive Hashing (LSH) basato su MinHashing per rilevare quasi duplicati in un corpus di 3.932 articoli di notizie in inglese, che erano stati etichettati manualmente al fine di identificare duplicati e quasi duplicati. In secondo luogo, abbiamo condotto uno studio comparativo di quattro strumenti – spaCy, KeyBERT, PhraseMachine e LLaMA – utilizzati per l’estrazione di termini in un corpus di 636 articoli sull’editing genomico. I risultati indicano che l’approccio LSH-MinHash è un metodo robusto per rilevare contenuti quasi duplicati in media stream come quelli delle testate giornalistiche online. Allo stesso tempo, l’esplorazione di diversi metodi per l’estrazione dei termini mostra che le prestazioni dipendono dal contesto, senza che un singolo metodo superi gli altri in tutte le metriche considerate. Nel complesso, questi risultati offrono indicazioni pratiche nella progettazione di pipeline di monitoraggio dei media e nel supportare in maniera più efficace la ricerca nell’ambito della digital social science.
Evaluation of Near-Duplicate Detection and Term Extraction Techniques for News Media Monitoring
JARRA, NIUMI
2024/2025
Abstract
The rapid growth of online news provides an opportunity and a challenge for social scientists conducting media monitoring. Repetitive content in the form of near-duplicate articles can skew analysis, yet extracting meaningful terms is often useful for public discourse analysis. This study addresses these challenges by evaluating existing approaches for detecting near-duplicate content and extracting meaningful terms in the context of news media monitoring. First, we evaluated the performance of a Python implementation of locality-sensitive hashing (LSH) with MinHashing for detecting near-duplicates in a corpus of 3,932 English news articles, which had been manually labeled for duplicates and near-duplicates. Second, we conducted a comparison study of four tools – spaCy, KeyBERT, PhraseMachine, and LLaMA – used for term extraction in a domain-specific corpus of 636 genome editing articles. The results indicate that the LSH–MinHash approach is a robust method for detecting near-duplicate content in media streams. At the same time, the term extraction comparison shows that tool performance is context-dependent, with no single method outperforming the others across all the considered metrics. Collectively, these findings offer practical guidance when designing media monitoring pipelines and support more effective digital social science research.| File | Dimensione | Formato | |
|---|---|---|---|
|
Jarra_Niumi.pdf
Accesso riservato
Dimensione
1.27 MB
Formato
Adobe PDF
|
1.27 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/96063