An algorithm is presented that has the purpose of estimating the frequencies of the elements in a data stream and identifying the frequent elements. Recent studies have tried to solve similar problems using Machine Learning-based techniques. These algorithms, although they often produce encouraging results, tend to be very heavy in memory. The solution proposed here has the main objective of producing accurate estimates using as little memory as possible. To do this, a double Count-Min Sketches is used, the first is used to filter frequent elements from non frequent ones, the second is used to estimate the frequencies of non frequent elements. The algorithm has been tested on three types of datasets: synthetic data with uniform distribution and Zipf distribution, and the IC-DDoS2019 dataset containing network traffic from DDoS attacks. The results show promising but variable performance depending on the type of data distribution. On the CIC-DDoS2019 dataset, the algorithm demonstrates high effectiveness in identifying anomalous traffic spikes. On the uniform dataset, the dual structure consistently outperforms the classic Count-Min Sketch in terms of accuracy. However, on the Zipf data, difficulties in accurately estimating rates are observed for medium-sized configurations (1-3KB).
Viene presentato un algoritmo che ha lo scopo di stimare le frequenze degli elementi in uno stream di dati e di identificare gli elementi frequenti. Studi recenti hanno provato a risolvere problemi simili utilizzando tecniche basate sul Machine Learning. Questi algoritmi, nonostante spesso producano risultati incoraggianti, tendono ad essere molto pesanti in memoria. La soluzione qui proposta ha l'obiettivo principale di produrre stime accurate utilizzando quanta meno memoria possibile. Per fare ciò viene utilizzata una doppia Count-Min Sketches, la prima viene usata per filtrare gli elementi frequenti da quelli non frequenti, la seconda viene usata per stimare le frequenze degli elementi non frequenti. L'algoritmo è stato testato su tre tipologie di dataset: dati sintetici con distribuzione uniforme, dati con distribuzione Zipf e il dataset reale CIC-DDoS2019 contenente traffico di rete da attacchi DDoS. I risultati mostrano prestazioni promettenti ma variabili a seconda del tipo di distribuzione dei dati. Sul dataset CIC-DDoS2019, l'algoritmo dimostra elevata efficacia nell'identificazione di picchi di traffico anomalo. Sul dataset uniforme, la struttura doppia supera consistentemente la Count-Min Sketch classica in termini di accuratezza. Tuttavia, sui dati Zipf si osservano difficoltà nella stima precisa delle frequenze per configurazioni di medie dimensioni (1-3KB).
Stima di Frequenza in Stream di Dati tramite Predizioni Algoritmiche
TERRONE, STEFANO
2024/2025
Abstract
An algorithm is presented that has the purpose of estimating the frequencies of the elements in a data stream and identifying the frequent elements. Recent studies have tried to solve similar problems using Machine Learning-based techniques. These algorithms, although they often produce encouraging results, tend to be very heavy in memory. The solution proposed here has the main objective of producing accurate estimates using as little memory as possible. To do this, a double Count-Min Sketches is used, the first is used to filter frequent elements from non frequent ones, the second is used to estimate the frequencies of non frequent elements. The algorithm has been tested on three types of datasets: synthetic data with uniform distribution and Zipf distribution, and the IC-DDoS2019 dataset containing network traffic from DDoS attacks. The results show promising but variable performance depending on the type of data distribution. On the CIC-DDoS2019 dataset, the algorithm demonstrates high effectiveness in identifying anomalous traffic spikes. On the uniform dataset, the dual structure consistently outperforms the classic Count-Min Sketch in terms of accuracy. However, on the Zipf data, difficulties in accurately estimating rates are observed for medium-sized configurations (1-3KB).| File | Dimensione | Formato | |
|---|---|---|---|
|
Terrone_Stefano.pdf
accesso aperto
Dimensione
339.99 kB
Formato
Adobe PDF
|
339.99 kB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/92981