This thesis presents a comprehensive study and implementation of artificial intelligence systems for music generation, focusing on Computational Creativity. As one of its most vibrant subfields, music generation leverages computational methods to create novel musical compositions. Given the rapid advancements in Generative AI, our research aims to explore state-of-the-art deep learning models specifically tailored for music generation. Initially, we will develop a novel dataset that connects food-related imagery and music through emotion vectors, enhancing the creative process by incorporating emotional responses. This dataset will be constructed by merging and normalizing several publicly available datasets, avoiding the challenges associated with original data collection. Subsequently, we will assess the quality of our dataset utilizing pre-trained models to tokenize input images and compute token similarity, facilitating the verification of our data's efficacy. The core framework will be designed to process image inputs, yielding music outputs in formats like WAV or MP3, without incorporating text at this stage. Notably, the transformation from imagery to sound will be mediated through emotion embeddings, offering a unique approach to music creation that aligns with our multidisciplinary focus. Given the predominance of Tensorflow or PyTorch in contemporary deep learning applications, we will familiarize ourselves with this framework, complementing our prior experience with TensorFlow. The project will focus primarily on the initial stages of data collection and similarity analysis, setting the groundwork for the future development of a complex deep-learning architecture capable of performing image-to-sound transformation. By utilizing and fine-tuning existing pre-trained models in conjunction with the Hugging Face library, this research will contribute valuable insights into the fusion of visual and auditory creativity through artificial intelligence.

This thesis presents a comprehensive study and implementation of artificial intelligence systems for music generation, focusing on Computational Creativity. As one of its most vibrant subfields, music generation leverages computational methods to create novel musical compositions. Given the rapid advancements in Generative AI, our research aims to explore state-of-the-art deep learning models specifically tailored for music generation. Initially, we will develop a novel dataset that connects food-related imagery and music through emotion vectors, enhancing the creative process by incorporating emotional responses. This dataset will be constructed by merging and normalizing several publicly available datasets, avoiding the challenges associated with original data collection. Subsequently, we will assess the quality of our dataset utilizing pre-trained models to tokenize input images and compute token similarity, facilitating the verification of our data's efficacy. The core framework will be designed to process image inputs, yielding music outputs in formats like WAV or MP3, without incorporating text at this stage. Notably, the transformation from imagery to sound will be mediated through emotion embeddings, offering a unique approach to music creation that aligns with our multidisciplinary focus. Given the predominance of Tensorflow or PyTorch in contemporary deep learning applications, we will familiarize ourselves with this framework, complementing our prior experience with TensorFlow. The project will focus primarily on the initial stages of data collection and similarity analysis, setting the groundwork for the future development of a complex deep-learning architecture capable of performing image-to-sound transformation. By utilizing and fine-tuning existing pre-trained models in conjunction with the Hugging Face library, this research will contribute valuable insights into the fusion of visual and auditory creativity through artificial intelligence.

Predicting Valence and Arousal States from Visual Content Using ResNet-50 Deep Learning Architecture

REZAEI TALEBI, FATEMEH
2023/2024

Abstract

This thesis presents a comprehensive study and implementation of artificial intelligence systems for music generation, focusing on Computational Creativity. As one of its most vibrant subfields, music generation leverages computational methods to create novel musical compositions. Given the rapid advancements in Generative AI, our research aims to explore state-of-the-art deep learning models specifically tailored for music generation. Initially, we will develop a novel dataset that connects food-related imagery and music through emotion vectors, enhancing the creative process by incorporating emotional responses. This dataset will be constructed by merging and normalizing several publicly available datasets, avoiding the challenges associated with original data collection. Subsequently, we will assess the quality of our dataset utilizing pre-trained models to tokenize input images and compute token similarity, facilitating the verification of our data's efficacy. The core framework will be designed to process image inputs, yielding music outputs in formats like WAV or MP3, without incorporating text at this stage. Notably, the transformation from imagery to sound will be mediated through emotion embeddings, offering a unique approach to music creation that aligns with our multidisciplinary focus. Given the predominance of Tensorflow or PyTorch in contemporary deep learning applications, we will familiarize ourselves with this framework, complementing our prior experience with TensorFlow. The project will focus primarily on the initial stages of data collection and similarity analysis, setting the groundwork for the future development of a complex deep-learning architecture capable of performing image-to-sound transformation. By utilizing and fine-tuning existing pre-trained models in conjunction with the Hugging Face library, this research will contribute valuable insights into the fusion of visual and auditory creativity through artificial intelligence.
2023
Predicting Valence and Arousal States from Visual Content Using ResNet-50 Deep Learning Architecture
This thesis presents a comprehensive study and implementation of artificial intelligence systems for music generation, focusing on Computational Creativity. As one of its most vibrant subfields, music generation leverages computational methods to create novel musical compositions. Given the rapid advancements in Generative AI, our research aims to explore state-of-the-art deep learning models specifically tailored for music generation. Initially, we will develop a novel dataset that connects food-related imagery and music through emotion vectors, enhancing the creative process by incorporating emotional responses. This dataset will be constructed by merging and normalizing several publicly available datasets, avoiding the challenges associated with original data collection. Subsequently, we will assess the quality of our dataset utilizing pre-trained models to tokenize input images and compute token similarity, facilitating the verification of our data's efficacy. The core framework will be designed to process image inputs, yielding music outputs in formats like WAV or MP3, without incorporating text at this stage. Notably, the transformation from imagery to sound will be mediated through emotion embeddings, offering a unique approach to music creation that aligns with our multidisciplinary focus. Given the predominance of Tensorflow or PyTorch in contemporary deep learning applications, we will familiarize ourselves with this framework, complementing our prior experience with TensorFlow. The project will focus primarily on the initial stages of data collection and similarity analysis, setting the groundwork for the future development of a complex deep-learning architecture capable of performing image-to-sound transformation. By utilizing and fine-tuning existing pre-trained models in conjunction with the Hugging Face library, this research will contribute valuable insights into the fusion of visual and auditory creativity through artificial intelligence.
food-related imagery
Deep-learning
Music generation
File in questo prodotto:
File Dimensione Formato  
Rezaei Talebi_Fatemeh.pdf

accesso riservato

Dimensione 839.26 kB
Formato Adobe PDF
839.26 kB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/77855