Leukemia is a malignancy linked to the abnormal production and accumulation of blood cells from the lymphoid or the myeloid lineage. Acute lymphoblastic leukemia is the most common cancer in children and adolescents. T-cell Acute Lymphoblastic Leukemia (T-ALL), linked to the malignant transformation of immature T-cells due to genetic abnormalities, is a particularly aggressive disease for which the cure rates are still unsatisfactory. Genetics and genomics studies showed that T-ALL is a heterogeneous disease and that molecular subtypes of patients can be defined according to the presence of distinct important drive mutations of other lesions, and are each associated with peculiar expression profiles of genes and also of non-coding RNAs. Thus, classifiers of T-ALL molecular subtypes are being developed, mainly based on gene expression data. Recent evidence showed that in addition to genes, also circular RNAs are altered in T-ALL with specificities for the different molecular subtypes and can thus be informative for the classification of the disease, and possibly for the discovery of new players of the disease mechanisms. In this thesis, we leveraged RNA-seq data and analysis with CirComPara2 to obtain a comprehensive characterization of both gene and circRNA expression profiles of the largest available cohort of T-ALL, including samples at diagnosis of 264 pediatric patients. Then, we used gene and circRNA expression values to train a support vector machine to classify T-ALL samples into five distinct molecular subtypes (HOXA, IMM, TAL_LMO, TLX1, TLX3). The classification model has been developed using three modifications of the dataset, including all samples, a subset, or also simulated samples, with best-practice procedures of five-fold cross-validation and grid search for parameters optimization. The optimal model performed very well reaching an accuracy of 94%. Furthermore, the model explanation has been obtained using LIME, a model-agnostic interpretability technique, thus extracting and analyzing the gene and circRNAs more levant for T-ALL subtype classification. In this way, group-characteristic known and new genes and transcripts have been identified providing information useful for further studies. F.i. in addition to the HOXA clusters gene overexpression, already known to be the distinctive feature of the HOXA T-ALL subtype, also genes not previously associated to this group, including the oncogene SKIDA1, or with almost uncharacterized function, including LINC02718 and circXPO1, have been pinpointed by the model explanation. We envisage that further validation of the model in an independent cohort, attempts to precisely classify T-ALL in more than five groups, and the classification of a subset of cases yet of unknown positioning would be the natural continuation of this thesis work.
Definition of circular RNA expression signatures of T-cell acute lymphoblastic leukemia molecular subtypes by multimodal transcriptomics study in a large pediatric patient cohort
CAREGARI, ALBERTO
2023/2024
Abstract
Leukemia is a malignancy linked to the abnormal production and accumulation of blood cells from the lymphoid or the myeloid lineage. Acute lymphoblastic leukemia is the most common cancer in children and adolescents. T-cell Acute Lymphoblastic Leukemia (T-ALL), linked to the malignant transformation of immature T-cells due to genetic abnormalities, is a particularly aggressive disease for which the cure rates are still unsatisfactory. Genetics and genomics studies showed that T-ALL is a heterogeneous disease and that molecular subtypes of patients can be defined according to the presence of distinct important drive mutations of other lesions, and are each associated with peculiar expression profiles of genes and also of non-coding RNAs. Thus, classifiers of T-ALL molecular subtypes are being developed, mainly based on gene expression data. Recent evidence showed that in addition to genes, also circular RNAs are altered in T-ALL with specificities for the different molecular subtypes and can thus be informative for the classification of the disease, and possibly for the discovery of new players of the disease mechanisms. In this thesis, we leveraged RNA-seq data and analysis with CirComPara2 to obtain a comprehensive characterization of both gene and circRNA expression profiles of the largest available cohort of T-ALL, including samples at diagnosis of 264 pediatric patients. Then, we used gene and circRNA expression values to train a support vector machine to classify T-ALL samples into five distinct molecular subtypes (HOXA, IMM, TAL_LMO, TLX1, TLX3). The classification model has been developed using three modifications of the dataset, including all samples, a subset, or also simulated samples, with best-practice procedures of five-fold cross-validation and grid search for parameters optimization. The optimal model performed very well reaching an accuracy of 94%. Furthermore, the model explanation has been obtained using LIME, a model-agnostic interpretability technique, thus extracting and analyzing the gene and circRNAs more levant for T-ALL subtype classification. In this way, group-characteristic known and new genes and transcripts have been identified providing information useful for further studies. F.i. in addition to the HOXA clusters gene overexpression, already known to be the distinctive feature of the HOXA T-ALL subtype, also genes not previously associated to this group, including the oncogene SKIDA1, or with almost uncharacterized function, including LINC02718 and circXPO1, have been pinpointed by the model explanation. We envisage that further validation of the model in an independent cohort, attempts to precisely classify T-ALL in more than five groups, and the classification of a subset of cases yet of unknown positioning would be the natural continuation of this thesis work.File | Dimensione | Formato | |
---|---|---|---|
Caregari_Alberto.pdf
accesso riservato
Dimensione
3.24 MB
Formato
Adobe PDF
|
3.24 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/64786