Classification and Automatic Annotation of Tandem Repeat Proteins in RepeatsDB

Abstract Protein tandem repeats are crucial structural elements in various biological processes, playing essential roles in cell adhesion, protein-protein interactions, and molecular recognition. These repetitive regions have sparked considerable interest in structural biology and bioinformatics, leading to the development of specialized resources like RepeatsDB. RepeatsDB is a comprehensive, curated database of annotated tandem repeat protein structures, offering a valuable resource for researchers. In this study, we systematically analyzed protein tandem repeats in RepeatsDB, with a primary focus on Alpha-Solenoids and Beta-Propellers, to enhance the existing classification system and provide a more profound understanding of protein tandem repeats. Our investigation commenced with an initial statistical analysis to elucidate the diversity and population status of distinct repeat groups within the database, as well as their respective degree of annotation. This approach proved instrumental in addressing the challenges associated with numerous entries that had a missing annotation. We conducted a structural analysis using pairwise structural alignment and explored dimensionality reduction and visualization techniques to uncover novel structural relationships. These findings improved our understanding of protein structural comparisons and informed a refined classification system. We utilized the density-based clustering algorithm, DBSCAN, to establish structural similarity ranges for Clan members and provide computational support for defining Clan boundaries. This method proved effective in detecting outlier entries and refining existing clans, leading to the proposal of new repeat groups. Additionally, we implemented a supervised classification experiment using the K-Nearest Neighbors (KNN) algorithm, which facilitated the automatic annotation of previously unannotated entries. This study introduces an automatic annotation methodology that significantly improves the performance of RepeatsDB curators and can be extended to other bioinformatics applications. The findings contribute to a more comprehensive understanding of protein tandem repeats and offer valuable insights for future research in structural biology and bioinformatics.

Protein tandem repeats are crucial structural elements in various biological processes, playing essential roles in cell adhesion, protein-protein interactions, and molecular recognition. These repetitive regions have sparked considerable interest in structural biology and bioinformatics, leading to the development of specialized resources like RepeatsDB. RepeatsDB is a comprehensive, curated database of annotated tandem repeat protein structures, offering a valuable resource for researchers. In this study, we systematically analyzed protein tandem repeats in RepeatsDB, with a primary focus on Alpha-Solenoids and Beta-Propellers, to enhance the existing classification system and provide a more profound understanding of protein tandem repeats. Our investigation commenced with an initial statistical analysis to elucidate the diversity and population status of distinct repeat groups within the database, as well as their respective degree of annotation. This approach proved instrumental in addressing the challenges associated with numerous entries that had a missing annotation. We conducted a structural analysis using pairwise structural alignment and explored dimensionality reduction and visualization techniques to uncover novel structural relationships. These findings improved our understanding of protein structural comparisons and informed a refined classification system. We utilized the density-based clustering algorithm, DBSCAN, to establish structural similarity ranges for Clan members and provide computational support for defining Clan boundaries. This method proved effective in detecting outlier entries and refining existing clans, leading to the proposal of new repeat groups. Additionally, we implemented a supervised classification experiment using the K-Nearest Neighbors (KNN) algorithm, which facilitated the automatic annotation of previously unannotated entries. This study introduces an automatic annotation methodology that significantly improves the performance of RepeatsDB curators and can be extended to other bioinformatics applications. The findings contribute to a more comprehensive understanding of protein tandem repeats and offer valuable insights for future research in structural biology and bioinformatics.