Predicting Valence and Arousal States from Visual Content Using ResNet-50 Deep Learning Architecture

This thesis presents a comprehensive study and implementation of artificial intelligence systems for music generation, focusing on Computational Creativity. As one of its most vibrant subfields, music generation leverages computational methods to create novel musical compositions. Given the rapid advancements in Generative AI, our research aims to explore state-of-the-art deep learning models specifically tailored for music generation. Initially, we will develop a novel dataset that connects food-related imagery and music through emotion vectors, enhancing the creative process by incorporating emotional responses. This dataset will be constructed by merging and normalizing several publicly available datasets, avoiding the challenges associated with original data collection. Subsequently, we will assess the quality of our dataset utilizing pre-trained models to tokenize input images and compute token similarity, facilitating the verification of our data's efficacy. The core framework will be designed to process image inputs, yielding music outputs in formats like WAV or MP3, without incorporating text at this stage. Notably, the transformation from imagery to sound will be mediated through emotion embeddings, offering a unique approach to music creation that aligns with our multidisciplinary focus. Given the predominance of Tensorflow or PyTorch in contemporary deep learning applications, we will familiarize ourselves with this framework, complementing our prior experience with TensorFlow. The project will focus primarily on the initial stages of data collection and similarity analysis, setting the groundwork for the future development of a complex deep-learning architecture capable of performing image-to-sound transformation. By utilizing and fine-tuning existing pre-trained models in conjunction with the Hugging Face library, this research will contribute valuable insights into the fusion of visual and auditory creativity through artificial intelligence.

Predicting Valence and Arousal States from Visual Content Using ResNet-50 Deep Learning Architecture

REZAEI TALEBI, FATEMEH

2023/2024

Abstract

This thesis presents a comprehensive study and implementation of artificial intelligence systems for music generation, focusing on Computational Creativity. As one of its most vibrant subfields, music generation leverages computational methods to create novel musical compositions. Given the rapid advancements in Generative AI, our research aims to explore state-of-the-art deep learning models specifically tailored for music generation. Initially, we will develop a novel dataset that connects food-related imagery and music through emotion vectors, enhancing the creative process by incorporating emotional responses. This dataset will be constructed by merging and normalizing several publicly available datasets, avoiding the challenges associated with original data collection. Subsequently, we will assess the quality of our dataset utilizing pre-trained models to tokenize input images and compute token similarity, facilitating the verification of our data's efficacy. The core framework will be designed to process image inputs, yielding music outputs in formats like WAV or MP3, without incorporating text at this stage. Notably, the transformation from imagery to sound will be mediated through emotion embeddings, offering a unique approach to music creation that aligns with our multidisciplinary focus. Given the predominance of Tensorflow or PyTorch in contemporary deep learning applications, we will familiarize ourselves with this framework, complementing our prior experience with TensorFlow. The project will focus primarily on the initial stages of data collection and similarity analysis, setting the groundwork for the future development of a complex deep-learning architecture capable of performing image-to-sound transformation. By utilizing and fine-tuning existing pre-trained models in conjunction with the Hugging Face library, this research will contribute valuable insights into the fusion of visual and auditory creativity through artificial intelligence.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				ICT FOR INTERNET AND MULTIMEDIA - INGEGNERIA PER LE COMUNICAZIONI MULTIMEDIALI E INTERNET Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				Predicting Valence and Arousal States from Visual Content Using ResNet-50 Deep Learning Architecture
			
	Abstract in italiano
	
				This thesis presents a comprehensive study and implementation of artificial intelligence systems for music generation, focusing on Computational Creativity. As one of its most vibrant subfields, music generation leverages computational methods to create novel musical compositions. Given the rapid advancements in Generative AI, our research aims to explore state-of-the-art deep learning models specifically tailored for music generation. Initially, we will develop a novel dataset that connects food-related imagery and music through emotion vectors, enhancing the creative process by incorporating emotional responses. This dataset will be constructed by merging and normalizing several publicly available datasets, avoiding the challenges associated with original data collection. 
Subsequently, we will assess the quality of our dataset utilizing pre-trained models to tokenize input images and compute token similarity, facilitating the verification of our data's efficacy. The core framework will be designed to process image inputs, yielding music outputs in formats like WAV or MP3, without incorporating text at this stage. Notably, the transformation from imagery to sound will be mediated through emotion embeddings, offering a unique approach to music creation that aligns with our multidisciplinary focus. 
Given the predominance of Tensorflow or PyTorch in contemporary deep learning applications, we will familiarize ourselves with this framework, complementing our prior experience with TensorFlow. The project will focus primarily on the initial stages of data collection and similarity analysis, setting the groundwork for the future development of a complex deep-learning architecture capable of performing image-to-sound transformation. By utilizing and fine-tuning existing pre-trained models in conjunction with the Hugging Face library, this research will contribute valuable insights into the fusion of visual and auditory creativity through artificial intelligence.
			
	Parola chiave
	
				food-related imagery
Deep-learning
Music generation
			
	Relatore
	
				RODA', ANTONIO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Rezaei Talebi_Fatemeh.pdf Accesso riservato Dimensione 839.26 kB Formato Adobe PDF	839.26 kB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/77855