In the era of big data, efficiently analyzing large volumes of data requires probabilistic data structures, which offer a trade-off between accuracy and resources. This thesis focuses on the Count-Min Sketch, a structure that allows for compact representation of a data stream, providing fast responses to various types of queries (such as point, range, and inner product) while using sublinear memory space relative to the input size. However, due to the reduced size of the sketch, the responses are only approximate and characterized by an error factor for each query. After outlining the main characteristics, applications, implementation, and performance of the Count-Min Sketch, the thesis explores and implements two approaches to improve the accuracy of point query estimates: an oracle and a predictor. The oracle accurately identifies the "heavy hitters", the most frequent elements in the data stream, while the predictor uses a second, smaller Count-Min Sketch to roughly classify those elements. By predicting the identity of each item, collisions in the original sketch are reduced, thus improving overall accuracy. As a matter of fact, the counts of elements classified as heavy hitters are stored in distinct buckets. The experimental results, presented through graphs comparing accuracy and mean error in the three cases, support the effectiveness of these solutions in improving the performance of the Count-Min Sketch.
Nell'era dei big data, l'analisi efficiente di grandi volumi di dati richiede strutture dati probabilistiche, che offrono un compromesso tra accuratezza e risorse. Tra queste strutture, in questa tesi si è posta l'attenzione su Count-Min Sketch, che consente una rappresentazione compatta di un data stream, utilizzando uno spazio di memoria sublineare rispetto alle dimensioni dell'input e rispondendo efficientemente a diverse tipologie di query (come point, range e inner product). A causa della grandezza ridotta di tale sketch, tuttavia, quest'ultimo fornisce solo risposte approssimate, caratterizzate da un fattore di errore per ogni query. Dopo aver illustrato, quindi, le principali caratteristiche, applicazioni, implementazione e prestazioni di Count-Min Sketch, la tesi esplora e implementa due tecniche per migliorare la precisione delle stime in risposta a point query: un oracolo e un predittore. L'oracolo identifica con precisione gli "heavy hitter", ossia gli elementi più frequenti nel flusso di dati, mentre il predittore utilizza un secondo, più piccolo Count-Min Sketch per classificare approssimativamente tali elementi. Grazie alla previsione dell'identità di ciascun item, quindi, è possibile ridurre le collisioni nello sketch originale, migliorando l'accuratezza complessiva. Con queste strategie, infatti, i conteggi di elementi classificati come heavy hitter vengono salvati in bucket distinti. I risultati sperimentali, presentati attraverso grafici che confrontano l'accuratezza e l'errore medio nei tre scenari, supportano, infine, l'efficacia di queste soluzioni nel migliorare le performance di Count-Min Sketch.
Studio di algoritmi con predizione per la stima di elementi frequenti in uno stream
BERALDO, GIULIA
2023/2024
Abstract
In the era of big data, efficiently analyzing large volumes of data requires probabilistic data structures, which offer a trade-off between accuracy and resources. This thesis focuses on the Count-Min Sketch, a structure that allows for compact representation of a data stream, providing fast responses to various types of queries (such as point, range, and inner product) while using sublinear memory space relative to the input size. However, due to the reduced size of the sketch, the responses are only approximate and characterized by an error factor for each query. After outlining the main characteristics, applications, implementation, and performance of the Count-Min Sketch, the thesis explores and implements two approaches to improve the accuracy of point query estimates: an oracle and a predictor. The oracle accurately identifies the "heavy hitters", the most frequent elements in the data stream, while the predictor uses a second, smaller Count-Min Sketch to roughly classify those elements. By predicting the identity of each item, collisions in the original sketch are reduced, thus improving overall accuracy. As a matter of fact, the counts of elements classified as heavy hitters are stored in distinct buckets. The experimental results, presented through graphs comparing accuracy and mean error in the three cases, support the effectiveness of these solutions in improving the performance of the Count-Min Sketch.File | Dimensione | Formato | |
---|---|---|---|
Beraldo_Giulia.pdf
accesso aperto
Dimensione
1.73 MB
Formato
Adobe PDF
|
1.73 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/72149