The advent of deep learning has opened new frontiers in music generation, but achieving fine and interpretable emotional control remains a central challenge. This thesis addresses this challenge through the analysis of two distinct methodological paradigms: direct generation from raw audio and that based on symbolic representations. The first research thread explores an end-to-end approach employing a Conditional Variational Autoencoder (CVAE) to generate music directly from audio waveforms, conditioned by valence and arousal parameters. However, in-depth analysis revealed intrinsic limitations of this method. Although the model shows good input reconstruction capability, it systematically fails in the generative task and emotional control. To overcome these limitations, an alternative approach was explored by developing a proof-of-concept system, named "Drift." This system is based on a Transformer architecture that operates on symbolic musical representations (MIDI) to generate affective music. By decoupling the generation of musical structure from timbre synthesis, the Transformer model proved capable of learning effective and interpretable emotional mappings.
L'avvento del deep learning ha aperto nuove frontiere nella generazione di musica, ma ottenere un controllo emotivo fine e interpretabile rimane una sfida centrale. Questa tesi affronta tale sfida attraverso l'analisi di due paradigmi metodologici distinti: la generazione diretta da audio raw e quella basata su rappresentazioni simboliche. Il primo filone di ricerca esplora un approccio end-to-end impiegando un Conditional Variational Autoencoder (CVAE) per generare musica direttamente da forme d'onda audio, condizionata da parametri di valenza e arousal. L'analisi approfondita ha però rivelato limiti intrinseci di questo metodo. Sebbene il modello mostri una buona capacità di ricostruzione dell'input, fallisce sistematicamente nel compito generativo e nel controllo emotivo. Per superare questi limiti, è stato esplorato un approccio alternativo sviluppando un sistema proof-of-concept, denominato "Drift". Questo sistema si basa su un'architettura Transformer che opera su rappresentazioni musicali simboliche (MIDI) per generare musica affettiva. Disaccoppiando la generazione della struttura musicale dalla sintesi del timbro, il modello Transformer si è rivelato capace di apprendere mappature emotive efficaci e interpretabili.
Conditioning Sound Generation with Emotions: An Investigation on Generative AI for Controllable Audio Synthesis
GALLI, FILIPPO
2024/2025
Abstract
The advent of deep learning has opened new frontiers in music generation, but achieving fine and interpretable emotional control remains a central challenge. This thesis addresses this challenge through the analysis of two distinct methodological paradigms: direct generation from raw audio and that based on symbolic representations. The first research thread explores an end-to-end approach employing a Conditional Variational Autoencoder (CVAE) to generate music directly from audio waveforms, conditioned by valence and arousal parameters. However, in-depth analysis revealed intrinsic limitations of this method. Although the model shows good input reconstruction capability, it systematically fails in the generative task and emotional control. To overcome these limitations, an alternative approach was explored by developing a proof-of-concept system, named "Drift." This system is based on a Transformer architecture that operates on symbolic musical representations (MIDI) to generate affective music. By decoupling the generation of musical structure from timbre synthesis, the Transformer model proved capable of learning effective and interpretable emotional mappings.| File | Dimensione | Formato | |
|---|---|---|---|
|
Galli_Filippo.pdf
Accesso riservato
Dimensione
4.51 MB
Formato
Adobe PDF
|
4.51 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/94382