Stima di Frequenza in Stream di Dati tramite Predizioni Algoritmiche

An algorithm is presented that has the purpose of estimating the frequencies of the elements in a data stream and identifying the frequent elements. Recent studies have tried to solve similar problems using Machine Learning-based techniques. These algorithms, although they often produce encouraging results, tend to be very heavy in memory. The solution proposed here has the main objective of producing accurate estimates using as little memory as possible. To do this, a double Count-Min Sketches is used, the first is used to filter frequent elements from non frequent ones, the second is used to estimate the frequencies of non frequent elements. The algorithm has been tested on three types of datasets: synthetic data with uniform distribution and Zipf distribution, and the IC-DDoS2019 dataset containing network traffic from DDoS attacks. The results show promising but variable performance depending on the type of data distribution. On the CIC-DDoS2019 dataset, the algorithm demonstrates high effectiveness in identifying anomalous traffic spikes. On the uniform dataset, the dual structure consistently outperforms the classic Count-Min Sketch in terms of accuracy. However, on the Zipf data, difficulties in accurately estimating rates are observed for medium-sized configurations (1-3KB).

Viene presentato un algoritmo che ha lo scopo di stimare le frequenze degli elementi in uno stream di dati e di identificare gli elementi frequenti. Studi recenti hanno provato a risolvere problemi simili utilizzando tecniche basate sul Machine Learning. Questi algoritmi, nonostante spesso producano risultati incoraggianti, tendono ad essere molto pesanti in memoria. La soluzione qui proposta ha l'obiettivo principale di produrre stime accurate utilizzando quanta meno memoria possibile. Per fare ciò viene utilizzata una doppia Count-Min Sketches, la prima viene usata per filtrare gli elementi frequenti da quelli non frequenti, la seconda viene usata per stimare le frequenze degli elementi non frequenti. L'algoritmo è stato testato su tre tipologie di dataset: dati sintetici con distribuzione uniforme, dati con distribuzione Zipf e il dataset reale CIC-DDoS2019 contenente traffico di rete da attacchi DDoS. I risultati mostrano prestazioni promettenti ma variabili a seconda del tipo di distribuzione dei dati. Sul dataset CIC-DDoS2019, l'algoritmo dimostra elevata efficacia nell'identificazione di picchi di traffico anomalo. Sul dataset uniforme, la struttura doppia supera consistentemente la Count-Min Sketch classica in termini di accuratezza. Tuttavia, sui dati Zipf si osservano difficoltà nella stima precisa delle frequenze per configurazioni di medie dimensioni (1-3KB).

Stima di Frequenza in Stream di Dati tramite Predizioni Algoritmiche

TERRONE, STEFANO

2024/2025

Abstract

An algorithm is presented that has the purpose of estimating the frequencies of the elements in a data stream and identifying the frequent elements. Recent studies have tried to solve similar problems using Machine Learning-based techniques. These algorithms, although they often produce encouraging results, tend to be very heavy in memory. The solution proposed here has the main objective of producing accurate estimates using as little memory as possible. To do this, a double Count-Min Sketches is used, the first is used to filter frequent elements from non frequent ones, the second is used to estimate the frequencies of non frequent elements. The algorithm has been tested on three types of datasets: synthetic data with uniform distribution and Zipf distribution, and the IC-DDoS2019 dataset containing network traffic from DDoS attacks. The results show promising but variable performance depending on the type of data distribution. On the CIC-DDoS2019 dataset, the algorithm demonstrates high effectiveness in identifying anomalous traffic spikes. On the uniform dataset, the dual structure consistently outperforms the classic Count-Min Sketch in terms of accuracy. However, on the Zipf data, difficulties in accurately estimating rates are observed for medium-sized configurations (1-3KB).

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Scienze Statistiche
			
	Corso di studio
	
				STATISTICA PER LE TECNOLOGIE E LE SCIENZE Laurea di Primo Livello (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Frequency Estimation in Data Streams with Algorithmic Predictions
			
	Abstract in italiano
	
				Viene presentato un algoritmo che ha lo scopo di stimare le frequenze degli elementi in uno stream di dati e di identificare gli elementi frequenti. Studi recenti hanno provato a risolvere problemi simili utilizzando tecniche basate sul Machine Learning. Questi algoritmi, nonostante spesso producano risultati incoraggianti, tendono ad essere molto pesanti in memoria. La soluzione qui proposta ha l'obiettivo principale di produrre stime accurate utilizzando quanta meno memoria possibile. Per fare ciò viene utilizzata una doppia Count-Min Sketches, la prima viene usata per filtrare gli elementi frequenti da quelli non frequenti, la seconda viene usata per stimare le frequenze degli elementi non frequenti. L'algoritmo è stato testato su tre tipologie di dataset: dati sintetici con distribuzione uniforme, dati con distribuzione Zipf e il dataset reale CIC-DDoS2019 contenente traffico di rete da attacchi DDoS. I risultati mostrano prestazioni promettenti ma variabili a seconda del tipo di distribuzione dei dati. Sul dataset CIC-DDoS2019, l'algoritmo dimostra elevata efficacia nell'identificazione di picchi di traffico anomalo. Sul dataset uniforme, la struttura doppia supera consistentemente la Count-Min Sketch classica in termini di accuratezza. Tuttavia, sui dati Zipf si osservano difficoltà nella stima precisa delle frequenze per configurazioni di medie dimensioni (1-3KB).
			
	Parola chiave
	
				Stima di Frequenze
Data Streams
Algoritmi Online
			
	Relatore
	
				CECCARELLO, MATTEO
			
	Appare nelle tipologie:
	
				Lauree triennali

File in questo prodotto:

File	Dimensione	Formato
Terrone_Stefano.pdf accesso aperto Dimensione 339.99 kB Formato Adobe PDF Visualizza/Apri	339.99 kB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/92981