Studio di algoritmi con predizione per la stima di elementi frequenti
in uno stream

In the era of big data, efficiently analyzing large volumes of data requires probabilistic data structures, which offer a trade-off between accuracy and resources. This thesis focuses on the Count-Min Sketch, a structure that allows for compact representation of a data stream, providing fast responses to various types of queries (such as point, range, and inner product) while using sublinear memory space relative to the input size. However, due to the reduced size of the sketch, the responses are only approximate and characterized by an error factor for each query. After outlining the main characteristics, applications, implementation, and performance of the Count-Min Sketch, the thesis explores and implements two approaches to improve the accuracy of point query estimates: an oracle and a predictor. The oracle accurately identifies the "heavy hitters", the most frequent elements in the data stream, while the predictor uses a second, smaller Count-Min Sketch to roughly classify those elements. By predicting the identity of each item, collisions in the original sketch are reduced, thus improving overall accuracy. As a matter of fact, the counts of elements classified as heavy hitters are stored in distinct buckets. The experimental results, presented through graphs comparing accuracy and mean error in the three cases, support the effectiveness of these solutions in improving the performance of the Count-Min Sketch.

Nell'era dei big data, l'analisi efficiente di grandi volumi di dati richiede strutture dati probabilistiche, che offrono un compromesso tra accuratezza e risorse. Tra queste strutture, in questa tesi si è posta l'attenzione su Count-Min Sketch, che consente una rappresentazione compatta di un data stream, utilizzando uno spazio di memoria sublineare rispetto alle dimensioni dell'input e rispondendo efficientemente a diverse tipologie di query (come point, range e inner product). A causa della grandezza ridotta di tale sketch, tuttavia, quest'ultimo fornisce solo risposte approssimate, caratterizzate da un fattore di errore per ogni query. Dopo aver illustrato, quindi, le principali caratteristiche, applicazioni, implementazione e prestazioni di Count-Min Sketch, la tesi esplora e implementa due tecniche per migliorare la precisione delle stime in risposta a point query: un oracolo e un predittore. L'oracolo identifica con precisione gli "heavy hitter", ossia gli elementi più frequenti nel flusso di dati, mentre il predittore utilizza un secondo, più piccolo Count-Min Sketch per classificare approssimativamente tali elementi. Grazie alla previsione dell'identità di ciascun item, quindi, è possibile ridurre le collisioni nello sketch originale, migliorando l'accuratezza complessiva. Con queste strategie, infatti, i conteggi di elementi classificati come heavy hitter vengono salvati in bucket distinti. I risultati sperimentali, presentati attraverso grafici che confrontano l'accuratezza e l'errore medio nei tre scenari, supportano, infine, l'efficacia di queste soluzioni nel migliorare le performance di Count-Min Sketch.

Studio di algoritmi con predizione per la stima di elementi frequenti in uno stream

BERALDO, GIULIA

2023/2024

Abstract

In the era of big data, efficiently analyzing large volumes of data requires probabilistic data structures, which offer a trade-off between accuracy and resources. This thesis focuses on the Count-Min Sketch, a structure that allows for compact representation of a data stream, providing fast responses to various types of queries (such as point, range, and inner product) while using sublinear memory space relative to the input size. However, due to the reduced size of the sketch, the responses are only approximate and characterized by an error factor for each query. After outlining the main characteristics, applications, implementation, and performance of the Count-Min Sketch, the thesis explores and implements two approaches to improve the accuracy of point query estimates: an oracle and a predictor. The oracle accurately identifies the "heavy hitters", the most frequent elements in the data stream, while the predictor uses a second, smaller Count-Min Sketch to roughly classify those elements. By predicting the identity of each item, collisions in the original sketch are reduced, thus improving overall accuracy. As a matter of fact, the counts of elements classified as heavy hitters are stored in distinct buckets. The experimental results, presented through graphs comparing accuracy and mean error in the three cases, support the effectiveness of these solutions in improving the performance of the Count-Min Sketch.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				INGEGNERIA INFORMATICA Laurea di Primo Livello (D.M. 270/2004)
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				Study of prediction algorithms for estimating frequent elements in a data stream
			
	Abstract in italiano
	
				Nell'era dei big data, l'analisi efficiente di grandi volumi di dati richiede strutture dati probabilistiche, che offrono un compromesso tra accuratezza e risorse. Tra queste strutture, in questa tesi si è posta l'attenzione su Count-Min Sketch, che consente una rappresentazione compatta di un data stream, utilizzando uno spazio di memoria sublineare rispetto alle dimensioni dell'input e rispondendo efficientemente a diverse tipologie di query (come point, range e inner product). A causa della grandezza ridotta di tale sketch, tuttavia, quest'ultimo fornisce solo risposte approssimate, caratterizzate da un fattore di errore per ogni query.
Dopo aver illustrato, quindi, le principali caratteristiche, applicazioni, implementazione e prestazioni di Count-Min Sketch, la tesi esplora e implementa due tecniche per migliorare la precisione delle stime in risposta a point query: un oracolo e un predittore. L'oracolo identifica con precisione gli "heavy hitter", ossia gli elementi più frequenti nel flusso di dati, mentre il predittore utilizza un secondo, più piccolo Count-Min Sketch per classificare approssimativamente tali elementi. Grazie alla previsione dell'identità di ciascun item, quindi, è possibile ridurre le collisioni nello sketch originale, migliorando l'accuratezza complessiva. Con queste strategie, infatti, i conteggi di elementi classificati come heavy hitter vengono salvati in bucket distinti. 
I risultati sperimentali, presentati attraverso grafici che confrontano l'accuratezza e l'errore medio nei tre scenari, supportano, infine, l'efficacia di queste soluzioni nel migliorare le performance di Count-Min Sketch.
			
	Parola chiave
	
				Count-Min sketch
data stream
struttura dati
predizione
algoritmi
			
	Relatore
	
				VANDIN, FABIO
			
	Correlatore
	
				PELLEGRINA, LEONARDO
			
	Appare nelle tipologie:
	
				Lauree triennali

File in questo prodotto:

File	Dimensione	Formato
Beraldo_Giulia.pdf accesso aperto Dimensione 1.73 MB Formato Adobe PDF Visualizza/Apri	1.73 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/72149

Studio di algoritmi con predizione per la stima di elementi frequenti in uno stream

BERALDO, GIULIA

2023/2024

Abstract

Scheda Scheda DC

Informazioni

Conferma cancellazione

Scheda

Scheda DC