Efficient approximate nearest neighbour (ANN) search is a cornerstone of modern machine learning, data retrieval, and high-dimensional data analysis. Among the many methods to solve the problem, approaches based on locality-sensitive hashing (LSH) functions are among the most studied and widely used in academic contexts, owing to their inherent nature of provid- ing formal probabilistic guarantees for the quality of the solution. But traditional LSH meth- ods, while effective, often suffer from suboptimal performance due to hash collisions between distant points and struggles with query points that lie far from the dataset points. This the- sis addresses these limitations by proposing a novel algorithm, which we call CLANN, that integrates k-center clustering with PUFFINN, an optimized LSH implementation. By first partitioning data into clusters using k-center clustering and then applying PUFFINN indexing within each cluster, CLANN aims to achieve improved query performance while pre- serving LSH’s probabilistic guarantees. The central theoretical contribution is a framework for adaptively partitioning the data space to reduce unnecessary hash computations while main- taining guarantees of accuracy and efficiency. This work will analyze the theoretical bounds of the proposed method and evaluate its performance against existing approaches across various high-dimensional datasets.
Efficient approximate nearest neighbour (ANN) search is a cornerstone of modern machine learning, data retrieval, and high-dimensional data analysis. Among the many methods to solve the problem, approaches based on locality-sensitive hashing (LSH) functions are among the most studied and widely used in academic contexts, owing to their inherent nature of provid- ing formal probabilistic guarantees for the quality of the solution. But traditional LSH meth- ods, while effective, often suffer from suboptimal performance due to hash collisions between distant points and struggles with query points that lie far from the dataset points. This the- sis addresses these limitations by proposing a novel algorithm, which we call CLANN, that integrates k-center clustering with PUFFINN, an optimized LSH implementation. By first partitioning data into clusters using k-center clustering and then applying PUFFINN indexing within each cluster, CLANN aims to achieve improved query performance while pre- serving LSH’s probabilistic guarantees. The central theoretical contribution is a framework for adaptively partitioning the data space to reduce unnecessary hash computations while main- taining guarantees of accuracy and efficiency. This work will analyze the theoretical bounds of the proposed method and evaluate its performance against existing approaches across various high-dimensional datasets.
CLANN: Clustered LSH-based Algorithm for the Nearest Neighbors problem
BOLZONELLO, ENRICO
2024/2025
Abstract
Efficient approximate nearest neighbour (ANN) search is a cornerstone of modern machine learning, data retrieval, and high-dimensional data analysis. Among the many methods to solve the problem, approaches based on locality-sensitive hashing (LSH) functions are among the most studied and widely used in academic contexts, owing to their inherent nature of provid- ing formal probabilistic guarantees for the quality of the solution. But traditional LSH meth- ods, while effective, often suffer from suboptimal performance due to hash collisions between distant points and struggles with query points that lie far from the dataset points. This the- sis addresses these limitations by proposing a novel algorithm, which we call CLANN, that integrates k-center clustering with PUFFINN, an optimized LSH implementation. By first partitioning data into clusters using k-center clustering and then applying PUFFINN indexing within each cluster, CLANN aims to achieve improved query performance while pre- serving LSH’s probabilistic guarantees. The central theoretical contribution is a framework for adaptively partitioning the data space to reduce unnecessary hash computations while main- taining guarantees of accuracy and efficiency. This work will analyze the theoretical bounds of the proposed method and evaluate its performance against existing approaches across various high-dimensional datasets.File | Dimensione | Formato | |
---|---|---|---|
Bolzonello_Enrico.pdf
accesso aperto
Dimensione
5.77 MB
Formato
Adobe PDF
|
5.77 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/83732