Efficient approximate nearest neighbour (ANN) search is a cornerstone of modern machine learning, data retrieval, and high-dimensional data analysis. Among the many methods to solve the problem, approaches based on locality-sensitive hashing (LSH) functions are among the most studied and widely used in academic contexts, owing to their inherent nature of provid- ing formal probabilistic guarantees for the quality of the solution. But traditional LSH meth- ods, while effective, often suffer from suboptimal performance due to hash collisions between distant points and struggles with query points that lie far from the dataset points. This the- sis addresses these limitations by proposing a novel algorithm, which we call CLANN, that integrates k-center clustering with PUFFINN, an optimized LSH implementation. By first partitioning data into clusters using k-center clustering and then applying PUFFINN indexing within each cluster, CLANN aims to achieve improved query performance while pre- serving LSH’s probabilistic guarantees. The central theoretical contribution is a framework for adaptively partitioning the data space to reduce unnecessary hash computations while main- taining guarantees of accuracy and efficiency. This work will analyze the theoretical bounds of the proposed method and evaluate its performance against existing approaches across various high-dimensional datasets.

Efficient approximate nearest neighbour (ANN) search is a cornerstone of modern machine learning, data retrieval, and high-dimensional data analysis. Among the many methods to solve the problem, approaches based on locality-sensitive hashing (LSH) functions are among the most studied and widely used in academic contexts, owing to their inherent nature of provid- ing formal probabilistic guarantees for the quality of the solution. But traditional LSH meth- ods, while effective, often suffer from suboptimal performance due to hash collisions between distant points and struggles with query points that lie far from the dataset points. This the- sis addresses these limitations by proposing a novel algorithm, which we call CLANN, that integrates k-center clustering with PUFFINN, an optimized LSH implementation. By first partitioning data into clusters using k-center clustering and then applying PUFFINN indexing within each cluster, CLANN aims to achieve improved query performance while pre- serving LSH’s probabilistic guarantees. The central theoretical contribution is a framework for adaptively partitioning the data space to reduce unnecessary hash computations while main- taining guarantees of accuracy and efficiency. This work will analyze the theoretical bounds of the proposed method and evaluate its performance against existing approaches across various high-dimensional datasets.

CLANN: Clustered LSH-based Algorithm for the Nearest Neighbors problem

BOLZONELLO, ENRICO
2024/2025

Abstract

Efficient approximate nearest neighbour (ANN) search is a cornerstone of modern machine learning, data retrieval, and high-dimensional data analysis. Among the many methods to solve the problem, approaches based on locality-sensitive hashing (LSH) functions are among the most studied and widely used in academic contexts, owing to their inherent nature of provid- ing formal probabilistic guarantees for the quality of the solution. But traditional LSH meth- ods, while effective, often suffer from suboptimal performance due to hash collisions between distant points and struggles with query points that lie far from the dataset points. This the- sis addresses these limitations by proposing a novel algorithm, which we call CLANN, that integrates k-center clustering with PUFFINN, an optimized LSH implementation. By first partitioning data into clusters using k-center clustering and then applying PUFFINN indexing within each cluster, CLANN aims to achieve improved query performance while pre- serving LSH’s probabilistic guarantees. The central theoretical contribution is a framework for adaptively partitioning the data space to reduce unnecessary hash computations while main- taining guarantees of accuracy and efficiency. This work will analyze the theoretical bounds of the proposed method and evaluate its performance against existing approaches across various high-dimensional datasets.
2024
CLANN: Clustered LSH-based Algorithm for the Nearest Neighbors problem
Efficient approximate nearest neighbour (ANN) search is a cornerstone of modern machine learning, data retrieval, and high-dimensional data analysis. Among the many methods to solve the problem, approaches based on locality-sensitive hashing (LSH) functions are among the most studied and widely used in academic contexts, owing to their inherent nature of provid- ing formal probabilistic guarantees for the quality of the solution. But traditional LSH meth- ods, while effective, often suffer from suboptimal performance due to hash collisions between distant points and struggles with query points that lie far from the dataset points. This the- sis addresses these limitations by proposing a novel algorithm, which we call CLANN, that integrates k-center clustering with PUFFINN, an optimized LSH implementation. By first partitioning data into clusters using k-center clustering and then applying PUFFINN indexing within each cluster, CLANN aims to achieve improved query performance while pre- serving LSH’s probabilistic guarantees. The central theoretical contribution is a framework for adaptively partitioning the data space to reduce unnecessary hash computations while main- taining guarantees of accuracy and efficiency. This work will analyze the theoretical bounds of the proposed method and evaluate its performance against existing approaches across various high-dimensional datasets.
Nearest Neighbor
LSH
Clustering
File in questo prodotto:
File Dimensione Formato  
Bolzonello_Enrico.pdf

accesso aperto

Dimensione 5.77 MB
Formato Adobe PDF
5.77 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/83732