Sviluppo di modelli di Machine Learning per prevedere la relapse al fine di valutare il tempo di guarigione dopo un'infezione da Mycobacterium Tuberculosis usando dati del Relapsing Mouse Model (RMM)

Nowadays, Tuberculosis (TB) is still one of the most prevalent respiratory diseases and it is responsible for over one million deaths worldwide per year. The agent that causes the infection, Mycobacterium tuberculosis (Mtb), is usually spread in the air by coughing, and it is severely infectious. Among the different forms of TB, multi-drug-resistant ones are the hardest to deal with, leading to significant issues in the curative process like complexity, long duration, and poor tolerability. The long duration of current used treatments, in particular, highlights the need for new, shorter cure regimens to help improve cure rates and lower costs. In this context, when evaluating new drug regimens, the final goal is to rank them based on the time required to have complete sterilization after an off-treatment period, without observing any evidence of relapse, defined as the recurrence of Mtb bacteria growth. A widely used metric to assess this is the treatment time needed to reach the 90% probability of cure (i.e. 10% relapse), namely T90. In the literature, T90 is calculated using descriptive models based on logistic approaches, which rely on relapse data usually collected 3 months after the end of the treatment. The goal of this work is instead to explore machine learning (ML) techniques for predicting the relapse based on both experimental and treatment characterizing variables, with the aim of anticipate relapse information without waiting for experimental relapse assessment, aligning with the 3R principles (Replacement, Reduction and Refinement), knowing that such experiments are highly demanding in terms of resources. Based on these models it will be possible to derive T90 values for each regimen and, consequently, to rank treatments based on their efficacy. Specifically, this will be performed by applying three distinct ML approaches, namely logistic regression, random forest and boosting, by using Relapsing Mouse Model (RMM) data collected in four different studies. This thesis is structured in five chapters. Chapter 1 introduces Tuberculosis, covering its variants and standard treatment protocols, followed by an explanation of the RMMs; then, the current state of art in T90 derivation is described, which leads to the definition of the thesis objectives. In Chapter 2, the dataset, consisting of 4 different RMM studies, is analyzed and pre-processed. Chapter 3 describes the theory behind the adopted ML techniques and evaluation metrics, providing also details on T90 calculation. Chapter 4 presents the modeling results, organized by technique, with a specific focus on the regimens ranking based on their derived T90 values. Finally in Chapter 5, the key findings and potential future developments of this thesis are reported.

Oggi, la Tubercolosi (TB) è ancora una delle malattie respiratorie più diffuse ed è responsabile di oltre un milione di decessi all'anno in tutto il mondo. L'agente che causa l'infezione, il Mycobacterium tuberculosis, si diffonde solitamente nell'aria tramite colpi di tosse ed è altamente infettivo. Tra le diverse forme di Tubercolosi, quelle multiresistenti sono le più difficili da trattare e comportano importanti problemi nel processo curativo, come complessità, lunga durata e scarsa tollerabilità ai farmaci. La lunga durata dei trattamenti, in particolare, mette alla luce la necessità di nuovi trattamenti curativi, più brevi, per contribuire a migliorare i tassi di guarigione e ridurre i costi. In questo contesto, nella valutazione di nuovi regimi di trattamento, l’obiettivo finale è quello di classificarli in base al tempo necessario per ottenere una completa sterilizzazione dopo un periodo di non trattamento, senza osservare alcuna evidenza di relapse, definita come la presenza di sviluppo dei batteri Mtb. Una metrica ampiamente utilizzata per valutare questo aspetto è il tempo di trattamento necessario per raggiungere il 90% di probabilità di guarigione (ovvero il 10% di relapse), chiamato T90. In letteratura, il T90 viene calcolato utilizzando modelli descrittivi basati su approcci logistici, che si basano su dati di relapse raccolti solitamente 3 mesi dopo la fine del trattamento. L’obiettivo di questo lavoro è invece di utilizzare le tecniche di machine learning (ML) per prevedere la relapse basandosi sia su variabili sperimentali che su caratteristiche del trattamento, con l’intento di anticipare le informazioni sulla relapse senza dover attenderne la valutazione sperimentale, in linea con i principi delle 3R (Sostituzione, Riduzione e Raffinamento), considerando che tali esperimenti sono molto dispendiosi in termini di risorse. Sulla base di questi modelli, sarà possibile derivare i valori di T90 per ciascun regime e, di conseguenza, classificare i trattamenti in base alla loro efficacia. Nello specifico, ciò sarà realizzato applicando tre distinti approcci di ML, ovvero logistic regression, random forest e boosting, utilizzando dati del Relapsing Mouse Model (RMM), raccolti in quattro studi diversi. Questa tesi è strutturata in cinque capitoli. Il Capitolo 1 introduce la tubercolosi, descrivendone le varianti e i protocolli di cura standard, seguito da una spiegazione del RMM; successivamente, viene descritto lo stato dell’arte attuale nella derivazione del T90, che porta alla definizione degli obiettivi della tesi. Nel Capitolo 2, il dataset, costituito da 4 diversi studi RMM, viene analizzato e pre-processato. Il Capitolo 3 descrive la teoria alla base delle tecniche di ML adottate e delle metriche di valutazione, fornendo anche dettagli sul calcolo del T90. Il Capitolo 4 presenta i risultati dei modelli, organizzati per tecnica, con un focus specifico sulla classificazione dei trattamenti basata sui valori di T90 derivati. Infine, nel Capitolo 5, vengono riportati i principali risultati e i potenziali sviluppi futuri di questa tesi.