CLANN: Clustered LSH-based Algorithm for the Nearest Neighbors problem

Efficient approximate nearest neighbour (ANN) search is a cornerstone of modern machine learning, data retrieval, and high-dimensional data analysis. Among the many methods to solve the problem, approaches based on locality-sensitive hashing (LSH) functions are among the most studied and widely used in academic contexts, owing to their inherent nature of provid- ing formal probabilistic guarantees for the quality of the solution. But traditional LSH meth- ods, while effective, often suffer from suboptimal performance due to hash collisions between distant points and struggles with query points that lie far from the dataset points. This the- sis addresses these limitations by proposing a novel algorithm, which we call CLANN, that integrates k-center clustering with PUFFINN, an optimized LSH implementation. By first partitioning data into clusters using k-center clustering and then applying PUFFINN indexing within each cluster, CLANN aims to achieve improved query performance while pre- serving LSH’s probabilistic guarantees. The central theoretical contribution is a framework for adaptively partitioning the data space to reduce unnecessary hash computations while main- taining guarantees of accuracy and efficiency. This work will analyze the theoretical bounds of the proposed method and evaluate its performance against existing approaches across various high-dimensional datasets.

CLANN: Clustered LSH-based Algorithm for the Nearest Neighbors problem

BOLZONELLO, ENRICO

2024/2025

Abstract

Efficient approximate nearest neighbour (ANN) search is a cornerstone of modern machine learning, data retrieval, and high-dimensional data analysis. Among the many methods to solve the problem, approaches based on locality-sensitive hashing (LSH) functions are among the most studied and widely used in academic contexts, owing to their inherent nature of provid- ing formal probabilistic guarantees for the quality of the solution. But traditional LSH meth- ods, while effective, often suffer from suboptimal performance due to hash collisions between distant points and struggles with query points that lie far from the dataset points. This the- sis addresses these limitations by proposing a novel algorithm, which we call CLANN, that integrates k-center clustering with PUFFINN, an optimized LSH implementation. By first partitioning data into clusters using k-center clustering and then applying PUFFINN indexing within each cluster, CLANN aims to achieve improved query performance while pre- serving LSH’s probabilistic guarantees. The central theoretical contribution is a framework for adaptively partitioning the data space to reduce unnecessary hash computations while main- taining guarantees of accuracy and efficiency. This work will analyze the theoretical bounds of the proposed method and evaluate its performance against existing approaches across various high-dimensional datasets.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				COMPUTER ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				CLANN: Clustered LSH-based Algorithm for the Nearest Neighbors problem
			
	Abstract in italiano
	
				Efficient approximate nearest neighbour (ANN) search is a cornerstone of modern machine
learning, data retrieval, and high-dimensional data analysis. Among the many methods to solve
the problem, approaches based on locality-sensitive hashing (LSH) functions are among the
most studied and widely used in academic contexts, owing to their inherent nature of provid-
ing formal probabilistic guarantees for the quality of the solution. But traditional LSH meth-
ods, while effective, often suffer from suboptimal performance due to hash collisions between
distant points and struggles with query points that lie far from the dataset points. This the-
sis addresses these limitations by proposing a novel algorithm, which we call CLANN, that
integrates k-center clustering with PUFFINN, an optimized LSH implementation.
By first partitioning data into clusters using k-center clustering and then applying PUFFINN
indexing within each cluster, CLANN aims to achieve improved query performance while pre-
serving LSH’s probabilistic guarantees. The central theoretical contribution is a framework for
adaptively partitioning the data space to reduce unnecessary hash computations while main-
taining guarantees of accuracy and efficiency. This work will analyze the theoretical bounds of
the proposed method and evaluate its performance against existing approaches across various
high-dimensional datasets.
			
	Parola chiave
	
				Nearest Neighbor
LSH
Clustering
			
	Relatore
	
				CECCARELLO, MATTEO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Bolzonello_Enrico.pdf accesso aperto Dimensione 5.77 MB Formato Adobe PDF Visualizza/Apri	5.77 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/83732