A clinical variant is a change in a patients DNA sequence that is evaluated for its possible role in disease. Correct interpretation of these variants is central to genetic diagnosis, prognosis, and family counselling, but classifications are not static. As new evidence accumulates, many variants in databases such as ClinVar are reclassified, creating pressure on clinical laboratories to revisit past reports. This thesis asks whether information already available at an earlier time can be used to predict which variants are most likely to be reclassified later. The study uses two ClinVar releases (2018-12 and 2025-09) to construct a timeaware benchmark for reclassification prediction. Variants are aligned by their stable identifier (VariationID), and clinical significance in each snapshot is mapped to categories compatible with ACMG/AMP guidelines (pathogenic or likely pathogenic, variant of uncertain significanc, likely benign or benign). A variant is labelled as undergoing a clinically meaningful reclassification if its classification moves across these categories between 2018 and 2025; variants that change only within a given tier or receive ambiguous labels are not counted. Features are limited to summary metadata available in 2018: review status, number of submitters, variant type, and standardised prior clinical significance. A temporal split based on the 2018 LastEvaluated date ensures that models are trained only on information available at that time and evaluated on a later slice that mimics prospective use. On this imbalanced task (about 3% positives), simple tabular models such as logistic regression recover a meaningful signal. On the held-out test set, the model achieves an average precision of about 0.062, compared with about 0.032 expected from random ranking at the observed prevalence, corresponding to roughly a twofold enrichment of future reclassifications among top-ranked variants. At moderate recall (around 0.74), the model produces a low-precision but enriched shortlist of variants to re-review, with most benefit in variants of uncertain significance and low-evidence review categories and few flags among expert-panel or multisubmitter concordant entries. Together, these results define a leakage-controlled benchmark and transparent baseline models for ClinVar reclassification using only public, temporally valid metadata, and provide an extensible framework for future work on more expressive models and richer feature sets.

A clinical variant is a change in a patients DNA sequence that is evaluated for its possible role in disease. Correct interpretation of these variants is central to genetic diagnosis, prognosis, and family counselling, but classifications are not static. As new evidence accumulates, many variants in databases such as ClinVar are reclassified, creating pressure on clinical laboratories to revisit past reports. This thesis asks whether information already available at an earlier time can be used to predict which variants are most likely to be reclassified later. The study uses two ClinVar releases (2018-12 and 2025-09) to construct a timeaware benchmark for reclassification prediction. Variants are aligned by their stable identifier (VariationID), and clinical significance in each snapshot is mapped to categories compatible with ACMG/AMP guidelines (pathogenic or likely pathogenic, variant of uncertain significanc, likely benign or benign). A variant is labelled as undergoing a clinically meaningful reclassification if its classification moves across these categories between 2018 and 2025; variants that change only within a given tier or receive ambiguous labels are not counted. Features are limited to summary metadata available in 2018: review status, number of submitters, variant type, and standardised prior clinical significance. A temporal split based on the 2018 LastEvaluated date ensures that models are trained only on information available at that time and evaluated on a later slice that mimics prospective use. On this imbalanced task (about 3% positives), simple tabular models such as logistic regression recover a meaningful signal. On the held-out test set, the model achieves an average precision of about 0.062, compared with about 0.032 expected from random ranking at the observed prevalence, corresponding to roughly a twofold enrichment of future reclassifications among top-ranked variants. At moderate recall (around 0.74), the model produces a low-precision but enriched shortlist of variants to re-review, with most benefit in variants of uncertain significance and low-evidence review categories and few flags among expert-panel or multisubmitter concordant entries. Together, these results define a leakage-controlled benchmark and transparent baseline models for ClinVar reclassification using only public, temporally valid metadata, and provide an extensible framework for future work on more expressive models and richer feature sets.

Explainable Machine Learning for Classification Tasks in Genomics

FEIZYAB, SARA
2024/2025

Abstract

A clinical variant is a change in a patients DNA sequence that is evaluated for its possible role in disease. Correct interpretation of these variants is central to genetic diagnosis, prognosis, and family counselling, but classifications are not static. As new evidence accumulates, many variants in databases such as ClinVar are reclassified, creating pressure on clinical laboratories to revisit past reports. This thesis asks whether information already available at an earlier time can be used to predict which variants are most likely to be reclassified later. The study uses two ClinVar releases (2018-12 and 2025-09) to construct a timeaware benchmark for reclassification prediction. Variants are aligned by their stable identifier (VariationID), and clinical significance in each snapshot is mapped to categories compatible with ACMG/AMP guidelines (pathogenic or likely pathogenic, variant of uncertain significanc, likely benign or benign). A variant is labelled as undergoing a clinically meaningful reclassification if its classification moves across these categories between 2018 and 2025; variants that change only within a given tier or receive ambiguous labels are not counted. Features are limited to summary metadata available in 2018: review status, number of submitters, variant type, and standardised prior clinical significance. A temporal split based on the 2018 LastEvaluated date ensures that models are trained only on information available at that time and evaluated on a later slice that mimics prospective use. On this imbalanced task (about 3% positives), simple tabular models such as logistic regression recover a meaningful signal. On the held-out test set, the model achieves an average precision of about 0.062, compared with about 0.032 expected from random ranking at the observed prevalence, corresponding to roughly a twofold enrichment of future reclassifications among top-ranked variants. At moderate recall (around 0.74), the model produces a low-precision but enriched shortlist of variants to re-review, with most benefit in variants of uncertain significance and low-evidence review categories and few flags among expert-panel or multisubmitter concordant entries. Together, these results define a leakage-controlled benchmark and transparent baseline models for ClinVar reclassification using only public, temporally valid metadata, and provide an extensible framework for future work on more expressive models and richer feature sets.
2024
Explainable Machine Learning for Classification Tasks in Genomics
A clinical variant is a change in a patients DNA sequence that is evaluated for its possible role in disease. Correct interpretation of these variants is central to genetic diagnosis, prognosis, and family counselling, but classifications are not static. As new evidence accumulates, many variants in databases such as ClinVar are reclassified, creating pressure on clinical laboratories to revisit past reports. This thesis asks whether information already available at an earlier time can be used to predict which variants are most likely to be reclassified later. The study uses two ClinVar releases (2018-12 and 2025-09) to construct a timeaware benchmark for reclassification prediction. Variants are aligned by their stable identifier (VariationID), and clinical significance in each snapshot is mapped to categories compatible with ACMG/AMP guidelines (pathogenic or likely pathogenic, variant of uncertain significanc, likely benign or benign). A variant is labelled as undergoing a clinically meaningful reclassification if its classification moves across these categories between 2018 and 2025; variants that change only within a given tier or receive ambiguous labels are not counted. Features are limited to summary metadata available in 2018: review status, number of submitters, variant type, and standardised prior clinical significance. A temporal split based on the 2018 LastEvaluated date ensures that models are trained only on information available at that time and evaluated on a later slice that mimics prospective use. On this imbalanced task (about 3% positives), simple tabular models such as logistic regression recover a meaningful signal. On the held-out test set, the model achieves an average precision of about 0.062, compared with about 0.032 expected from random ranking at the observed prevalence, corresponding to roughly a twofold enrichment of future reclassifications among top-ranked variants. At moderate recall (around 0.74), the model produces a low-precision but enriched shortlist of variants to re-review, with most benefit in variants of uncertain significance and low-evidence review categories and few flags among expert-panel or multisubmitter concordant entries. Together, these results define a leakage-controlled benchmark and transparent baseline models for ClinVar reclassification using only public, temporally valid metadata, and provide an extensible framework for future work on more expressive models and richer feature sets.
Explainable AI
ClinVar
Genomic Variants
File in questo prodotto:
File Dimensione Formato  
Explainable Machine Learning for Classification Tasks in Genomics.pdf

accesso aperto

Dimensione 2.62 MB
Formato Adobe PDF
2.62 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/102085