Statistics of unseen variants in genomics: mathematical modeling and data analysis

In the last years quantitative approaches have gained increasing importance in genomics research due to their ability of interpreting and characterizing the vast amount of genetic data that the new sequencing technologies have made available. Statistical tools in particular seem to play a central role in gaining information on the human DNA mutation variability. Such mutations may lead to tumor insurgence and progression, thus there is the exigency to develop analytical methods that could quantify the genetic heterogeneity of a tumor, whose knowledge may be crucial to design the best therapeutic setting. Within this framework, the present thesis aims to inference the statistical description of a genetic region taking as input few samples only. To this end, we present an ecological-inspired method to predict the number of mutations in a DNA sequence or in a whole tumor (global scale) from presence/absence information collected in a portion of the region (local scale). For our model, we have assumed to work under the neutral hypothesis of mutation demographic equivalence and within the parametric framework of a global RSA, i.e. frequency of mutations at given occurrence abundance, distributed according to a Negative Binomial. This latter choice has been justified by both the derivation of the Negative Binomial as steady state solution of a biological birth and death process and by the functional versatility Negative Binomial has in well accommodating different empirical RSA shapes (power law, Log-Series, unimodal). Under the hypothesis of demographic equivalence of mutations, it can be proved that the Negative Binomial is form invariant, i.e. a random subsample can still be described via a Negative Binomial distribution. In other words, the local scale RSA is a Negative Binomial if the global scale RSA is a Negative Binomial. It has followed that we can obtain a computable formula bridging the parameters of RSA at local scale to those at the global scale, which we have exploited to end up with an unbiased and consistent estimator of the number of global mutations in the genetic region of interest. Simulations on both DNA single-nucleotide polymorphism and synthetic spatial tumor growth datasets have been performed at last to test our framework. The promising results they have given back would confirm the stability and the reliability of the proposed method in genetic field.

Statistics of unseen variants in genomics: mathematical modeling and data analysis

Fochesato, Anna

2019/2020

Abstract

Scheda

Scheda DC