In this work, I present the fine-tuning of the Riffusion model using Dreambooth, guided by the DEAM (Database for Emotional Analysis of Music) dataset, to enhance emotion-based music generation. Using the advanced software frameworks and computational resources accessible via Google Colab, three distinct experiments were executed with several key hyperparameters such as spectrogram resolution, batch size, learning rate schedules and regularization methods being varied. The goal was to condition the model to synthesize spectrograms accurately corresponding to localized desired emotions, but retaining an overall high musical quality. This was validated by the experimental results which showed incremental gain in the stability of loss function and clarity of spectrograms after each configuration. In particular, I noted more stable convergence and less overfitting in the final experiment thanks to using a cosine learning rate scheduler and introducing weight decay. Nevertheless, several issues such as prominent noise artifacts, unstable loss curves and relatively small and unbalanced dataset prevented us from achieving purely high quality outputs. These results demonstrate the promise and challenges of using diffusion models for emotion-driven music composition. I conclude the study by highlighting several components that should be investigated in future work, such as larger data sets, increased computational resources, denoising capabilities and model architecture design for diffusion models to explore the true potential of creating emotionally convincing music.

"Riffusion Meets Emotions: Deep Learning with Stable Diffusion for Emotionally Expressive Music Composition"

ZARE, MOHAMMAD MEHDI
2023/2024

Abstract

In this work, I present the fine-tuning of the Riffusion model using Dreambooth, guided by the DEAM (Database for Emotional Analysis of Music) dataset, to enhance emotion-based music generation. Using the advanced software frameworks and computational resources accessible via Google Colab, three distinct experiments were executed with several key hyperparameters such as spectrogram resolution, batch size, learning rate schedules and regularization methods being varied. The goal was to condition the model to synthesize spectrograms accurately corresponding to localized desired emotions, but retaining an overall high musical quality. This was validated by the experimental results which showed incremental gain in the stability of loss function and clarity of spectrograms after each configuration. In particular, I noted more stable convergence and less overfitting in the final experiment thanks to using a cosine learning rate scheduler and introducing weight decay. Nevertheless, several issues such as prominent noise artifacts, unstable loss curves and relatively small and unbalanced dataset prevented us from achieving purely high quality outputs. These results demonstrate the promise and challenges of using diffusion models for emotion-driven music composition. I conclude the study by highlighting several components that should be investigated in future work, such as larger data sets, increased computational resources, denoising capabilities and model architecture design for diffusion models to explore the true potential of creating emotionally convincing music.
2023
"Riffusion Meets Emotions: Deep Learning with Stable Diffusion for Emotionally Expressive Music Composition"
Deep Learning
Stable Diffusion
Riffusion
Transfer Learning
Music Generation
File in questo prodotto:
File Dimensione Formato  
Zare_Mohammad Mehdi.pdf

accesso aperto

Dimensione 2.9 MB
Formato Adobe PDF
2.9 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/76999