Study and Implementation of Multimodal Generative AI Systems for Emotion-Conditioned Music Generation

This thesis presents the development of a multimodal generative artificial intelligence system capable of producing emotionally consistent content across visual and auditory data. Its main goal was to build a model that can analyze the emotional content perceived (specifically Valence and Arousal) from food images and then synthesize the original audio reflecting that emotional state. To accomplish this goal, a model was developed based on a Variational Autoencoder (VAE) architecture, adopting a "token-to-token" paradigm. This research utilized the FoodPics Extended 2022 and DEAM datasets, both of which include the necessary emotional annotations. The research systematically compared two primary architectural strategies. The first involved models using modality-specific tokenizers (ViT for images, EnCodec for audio), which required complex dimensional adaptation techniques. The second approach, in contrast, converted audio signals into mel-spectrograms, allowing a unified Vision Transformer (ViT) to process both visual and auditory data. Experimental results demonstrated that the unified tokenizer architecture was significantly superior in learning cross-modal representations and reducing architectural complexity. The most significant breakthrough among the six model versions was achieved in Version 3.2 with the introduction of a novel "CLS-informed" generative mechanism. This innovative architecture processes the global semantic information from the ViT's [CLS] token separately from the patch tokens, then uses this global context to conditionally guide the reconstruction of local details. This approach yielded a significant improvement, reducing the reconstruction loss than 0.025, an enhancement of over 90% compared to earlier versions. Furthermore, the model proved its ability to accurately predict Valence-Arousal values from the latent space and to meaningfully organize this space according to both semantic content and emotional values. In conclusion, this thesis successfully demonstrates the feasibility of developing emotion-aware cross-modal generative systems. It shows the effectiveness of the unified representation learning approach, establishing a solid foundation for future multimodal AI systems.

Study and Implementation of Multimodal Generative AI Systems for Emotion-Conditioned Music Generation

KÖSE, İSMAIL DEHA

2024/2025

Abstract

This thesis presents the development of a multimodal generative artificial intelligence system capable of producing emotionally consistent content across visual and auditory data. Its main goal was to build a model that can analyze the emotional content perceived (specifically Valence and Arousal) from food images and then synthesize the original audio reflecting that emotional state. To accomplish this goal, a model was developed based on a Variational Autoencoder (VAE) architecture, adopting a "token-to-token" paradigm. This research utilized the FoodPics Extended 2022 and DEAM datasets, both of which include the necessary emotional annotations. The research systematically compared two primary architectural strategies. The first involved models using modality-specific tokenizers (ViT for images, EnCodec for audio), which required complex dimensional adaptation techniques. The second approach, in contrast, converted audio signals into mel-spectrograms, allowing a unified Vision Transformer (ViT) to process both visual and auditory data. Experimental results demonstrated that the unified tokenizer architecture was significantly superior in learning cross-modal representations and reducing architectural complexity. The most significant breakthrough among the six model versions was achieved in Version 3.2 with the introduction of a novel "CLS-informed" generative mechanism. This innovative architecture processes the global semantic information from the ViT's [CLS] token separately from the patch tokens, then uses this global context to conditionally guide the reconstruction of local details. This approach yielded a significant improvement, reducing the reconstruction loss than 0.025, an enhancement of over 90% compared to earlier versions. Furthermore, the model proved its ability to accurately predict Valence-Arousal values from the latent space and to meaningfully organize this space according to both semantic content and emotional values. In conclusion, this thesis successfully demonstrates the feasibility of developing emotion-aware cross-modal generative systems. It shows the effectiveness of the unified representation learning approach, establishing a solid foundation for future multimodal AI systems.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				ICT FOR INTERNET AND MULTIMEDIA - INGEGNERIA PER LE COMUNICAZIONI MULTIMEDIALI E INTERNET Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Study and Implementation of Multimodal Generative AI Systems for Emotion-Conditioned Music Generation
			
	Abstract in italiano
	
				This thesis presents the development of a multimodal generative artificial intelligence system capable of producing emotionally consistent content across visual and auditory data. Its main goal was to build a model that can analyze the emotional content perceived (specifically Valence and Arousal) from food images and then synthesize the original audio reflecting that emotional state. To accomplish this goal, a model was developed based on a Variational Autoencoder (VAE) architecture, adopting a "token-to-token" paradigm. This research utilized the FoodPics Extended 2022 and DEAM datasets, both of which include the necessary emotional annotations.

The research systematically compared two primary architectural strategies. The first involved models using modality-specific tokenizers (ViT for images, EnCodec for audio), which required complex dimensional adaptation techniques. The second approach, in contrast, converted audio signals into mel-spectrograms, allowing a unified Vision Transformer (ViT) to process both visual and auditory data.

Experimental results demonstrated that the unified tokenizer architecture was significantly superior in learning cross-modal representations and reducing architectural complexity. The most significant breakthrough among the six model versions was achieved in Version 3.2 with the introduction of a novel "CLS-informed" generative mechanism. This innovative architecture processes the global semantic information from the ViT's [CLS] token separately from the patch tokens, then uses this global context to conditionally guide the reconstruction of local details.

This approach yielded a significant improvement, reducing the reconstruction loss than 0.025, an enhancement of over 90% compared to earlier versions. Furthermore, the model proved its ability to accurately predict Valence-Arousal values from the latent space and to meaningfully organize this space according to both semantic content and emotional values.

In conclusion, this thesis successfully demonstrates the feasibility of developing emotion-aware cross-modal generative systems. It shows the effectiveness of the unified representation learning approach, establishing a solid foundation for future multimodal AI systems.
			
	Parola chiave
	
				Generative AI
Valence-Arousal
VAE
Token
			
	Relatore
	
				RODA', ANTONIO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Kose_Ismail_Deha.pdf accesso aperto Dimensione 5.98 MB Formato Adobe PDF Visualizza/Apri	5.98 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/87359