This work explores articulatory speech synthesis through a Deep Learning approach. Beginning with the history of articulatory models, it shows how modern imaging and AI technologies now make it possible to overcome the historical limitations of articulatory research. The thesis addresses both the "direct" problem of speech synthesis from anatomical parameters and the "reverse" problem of anatomical estimation of articulators from speech. An Encoder/Decoder neural architecture is proposed: the Encoder generates speech from silent MRI videos of the vocal tract, while the Decoder reconstructs articulatory movements from audio. The discussion covers vocal tract modeling, deep learning fundamentals with focus on convolutional neural networks, and detailed design of the proposed model. The implementation is supported by documented code and the results are validated through specific tests, including a Matlab algorithm developed for anatomical interpretation of Decoder performance. The conclusions present the possible clinical developments of this line of research.
Questo lavoro esplora la sintesi vocale articolatoria attraverso un approccio basato sul Deep Learning. Partendo dalla storia dei modelli articolatori, si evidenzia come le moderne tecnologie di imaging e intelligenza artificiale permettano oggi di superare i limiti storici della ricerca articolatoria. La tesi affronta sia il problema "diretto" della sintesi vocale a partire da parametri anatomici, sia quello "inverso" della stima anatomica degli articolatori a partire dal parlato. Viene proposta un'architettura neurale Encoder/Decoder: l'Encoder genera il parlato da video muti di risonanza magnetica del tratto vocale, mentre il Decoder ricostruisce i movimenti articolatori a partire dall'audio. La trattazione copre la modellizzazione del tratto vocale, i fondamenti di deep learning con focus sulle reti convoluzionali, e il design dettagliato del modello proposto. L'implementazione è supportata da codice documentato e i risultati sono validati attraverso test specifici, incluso un algoritmo Matlab sviluppato per l'interpretazione anatomica delle prestazioni del Decoder. Le conclusioni presentano i possibili sviluppi clinici di questa linea di ricerca.
Deep articulatory vocal modelling. An encoder-decoder architecture.
NUCERA, VALERIO
2024/2025
Abstract
This work explores articulatory speech synthesis through a Deep Learning approach. Beginning with the history of articulatory models, it shows how modern imaging and AI technologies now make it possible to overcome the historical limitations of articulatory research. The thesis addresses both the "direct" problem of speech synthesis from anatomical parameters and the "reverse" problem of anatomical estimation of articulators from speech. An Encoder/Decoder neural architecture is proposed: the Encoder generates speech from silent MRI videos of the vocal tract, while the Decoder reconstructs articulatory movements from audio. The discussion covers vocal tract modeling, deep learning fundamentals with focus on convolutional neural networks, and detailed design of the proposed model. The implementation is supported by documented code and the results are validated through specific tests, including a Matlab algorithm developed for anatomical interpretation of Decoder performance. The conclusions present the possible clinical developments of this line of research.| File | Dimensione | Formato | |
|---|---|---|---|
|
valerio_nucera.pdf
embargo fino al 29/08/2026
Dimensione
16.29 MB
Formato
Adobe PDF
|
16.29 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/82091