In this work, I present the fine-tuning of the Riffusion model using Dreambooth, guided by the DEAM (Database for Emotional Analysis of Music) dataset, to enhance emotion-based music generation. Using the advanced software frameworks and computational resources accessible via Google Colab, three distinct experiments were executed with several key hyperparameters such as spectrogram resolution, batch size, learning rate schedules and regularization methods being varied. The goal was to condition the model to synthesize spectrograms accurately corresponding to localized desired emotions, but retaining an overall high musical quality. This was validated by the experimental results which showed incremental gain in the stability of loss function and clarity of spectrograms after each configuration. In particular, I noted more stable convergence and less overfitting in the final experiment thanks to using a cosine learning rate scheduler and introducing weight decay. Nevertheless, several issues such as prominent noise artifacts, unstable loss curves and relatively small and unbalanced dataset prevented us from achieving purely high quality outputs. These results demonstrate the promise and challenges of using diffusion models for emotion-driven music composition. I conclude the study by highlighting several components that should be investigated in future work, such as larger data sets, increased computational resources, denoising capabilities and model architecture design for diffusion models to explore the true potential of creating emotionally convincing music.
"Riffusion Meets Emotions: Deep Learning with Stable Diffusion for Emotionally Expressive Music Composition"
ZARE, MOHAMMAD MEHDI
2023/2024
Abstract
In this work, I present the fine-tuning of the Riffusion model using Dreambooth, guided by the DEAM (Database for Emotional Analysis of Music) dataset, to enhance emotion-based music generation. Using the advanced software frameworks and computational resources accessible via Google Colab, three distinct experiments were executed with several key hyperparameters such as spectrogram resolution, batch size, learning rate schedules and regularization methods being varied. The goal was to condition the model to synthesize spectrograms accurately corresponding to localized desired emotions, but retaining an overall high musical quality. This was validated by the experimental results which showed incremental gain in the stability of loss function and clarity of spectrograms after each configuration. In particular, I noted more stable convergence and less overfitting in the final experiment thanks to using a cosine learning rate scheduler and introducing weight decay. Nevertheless, several issues such as prominent noise artifacts, unstable loss curves and relatively small and unbalanced dataset prevented us from achieving purely high quality outputs. These results demonstrate the promise and challenges of using diffusion models for emotion-driven music composition. I conclude the study by highlighting several components that should be investigated in future work, such as larger data sets, increased computational resources, denoising capabilities and model architecture design for diffusion models to explore the true potential of creating emotionally convincing music.File | Dimensione | Formato | |
---|---|---|---|
Zare_Mohammad Mehdi.pdf
accesso aperto
Dimensione
2.9 MB
Formato
Adobe PDF
|
2.9 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/76999