This thesis presents the development of a multimodal generative artificial intelligence system capable of producing emotionally consistent content across visual and auditory data. Its main goal was to build a model that can analyze the emotional content perceived (specifically Valence and Arousal) from food images and then synthesize the original audio reflecting that emotional state. To accomplish this goal, a model was developed based on a Variational Autoencoder (VAE) architecture, adopting a "token-to-token" paradigm. This research utilized the FoodPics Extended 2022 and DEAM datasets, both of which include the necessary emotional annotations. The research systematically compared two primary architectural strategies. The first involved models using modality-specific tokenizers (ViT for images, EnCodec for audio), which required complex dimensional adaptation techniques. The second approach, in contrast, converted audio signals into mel-spectrograms, allowing a unified Vision Transformer (ViT) to process both visual and auditory data. Experimental results demonstrated that the unified tokenizer architecture was significantly superior in learning cross-modal representations and reducing architectural complexity. The most significant breakthrough among the six model versions was achieved in Version 3.2 with the introduction of a novel "CLS-informed" generative mechanism. This innovative architecture processes the global semantic information from the ViT's [CLS] token separately from the patch tokens, then uses this global context to conditionally guide the reconstruction of local details. This approach yielded a significant improvement, reducing the reconstruction loss than 0.025, an enhancement of over 90% compared to earlier versions. Furthermore, the model proved its ability to accurately predict Valence-Arousal values from the latent space and to meaningfully organize this space according to both semantic content and emotional values. In conclusion, this thesis successfully demonstrates the feasibility of developing emotion-aware cross-modal generative systems. It shows the effectiveness of the unified representation learning approach, establishing a solid foundation for future multimodal AI systems.

This thesis presents the development of a multimodal generative artificial intelligence system capable of producing emotionally consistent content across visual and auditory data. Its main goal was to build a model that can analyze the emotional content perceived (specifically Valence and Arousal) from food images and then synthesize the original audio reflecting that emotional state. To accomplish this goal, a model was developed based on a Variational Autoencoder (VAE) architecture, adopting a "token-to-token" paradigm. This research utilized the FoodPics Extended 2022 and DEAM datasets, both of which include the necessary emotional annotations. The research systematically compared two primary architectural strategies. The first involved models using modality-specific tokenizers (ViT for images, EnCodec for audio), which required complex dimensional adaptation techniques. The second approach, in contrast, converted audio signals into mel-spectrograms, allowing a unified Vision Transformer (ViT) to process both visual and auditory data. Experimental results demonstrated that the unified tokenizer architecture was significantly superior in learning cross-modal representations and reducing architectural complexity. The most significant breakthrough among the six model versions was achieved in Version 3.2 with the introduction of a novel "CLS-informed" generative mechanism. This innovative architecture processes the global semantic information from the ViT's [CLS] token separately from the patch tokens, then uses this global context to conditionally guide the reconstruction of local details. This approach yielded a significant improvement, reducing the reconstruction loss than 0.025, an enhancement of over 90% compared to earlier versions. Furthermore, the model proved its ability to accurately predict Valence-Arousal values from the latent space and to meaningfully organize this space according to both semantic content and emotional values. In conclusion, this thesis successfully demonstrates the feasibility of developing emotion-aware cross-modal generative systems. It shows the effectiveness of the unified representation learning approach, establishing a solid foundation for future multimodal AI systems.

Study and Implementation of Multimodal Generative AI Systems for Emotion-Conditioned Music Generation

KÖSE, İSMAIL DEHA
2024/2025

Abstract

This thesis presents the development of a multimodal generative artificial intelligence system capable of producing emotionally consistent content across visual and auditory data. Its main goal was to build a model that can analyze the emotional content perceived (specifically Valence and Arousal) from food images and then synthesize the original audio reflecting that emotional state. To accomplish this goal, a model was developed based on a Variational Autoencoder (VAE) architecture, adopting a "token-to-token" paradigm. This research utilized the FoodPics Extended 2022 and DEAM datasets, both of which include the necessary emotional annotations. The research systematically compared two primary architectural strategies. The first involved models using modality-specific tokenizers (ViT for images, EnCodec for audio), which required complex dimensional adaptation techniques. The second approach, in contrast, converted audio signals into mel-spectrograms, allowing a unified Vision Transformer (ViT) to process both visual and auditory data. Experimental results demonstrated that the unified tokenizer architecture was significantly superior in learning cross-modal representations and reducing architectural complexity. The most significant breakthrough among the six model versions was achieved in Version 3.2 with the introduction of a novel "CLS-informed" generative mechanism. This innovative architecture processes the global semantic information from the ViT's [CLS] token separately from the patch tokens, then uses this global context to conditionally guide the reconstruction of local details. This approach yielded a significant improvement, reducing the reconstruction loss than 0.025, an enhancement of over 90% compared to earlier versions. Furthermore, the model proved its ability to accurately predict Valence-Arousal values from the latent space and to meaningfully organize this space according to both semantic content and emotional values. In conclusion, this thesis successfully demonstrates the feasibility of developing emotion-aware cross-modal generative systems. It shows the effectiveness of the unified representation learning approach, establishing a solid foundation for future multimodal AI systems.
2024
Study and Implementation of Multimodal Generative AI Systems for Emotion-Conditioned Music Generation
This thesis presents the development of a multimodal generative artificial intelligence system capable of producing emotionally consistent content across visual and auditory data. Its main goal was to build a model that can analyze the emotional content perceived (specifically Valence and Arousal) from food images and then synthesize the original audio reflecting that emotional state. To accomplish this goal, a model was developed based on a Variational Autoencoder (VAE) architecture, adopting a "token-to-token" paradigm. This research utilized the FoodPics Extended 2022 and DEAM datasets, both of which include the necessary emotional annotations. The research systematically compared two primary architectural strategies. The first involved models using modality-specific tokenizers (ViT for images, EnCodec for audio), which required complex dimensional adaptation techniques. The second approach, in contrast, converted audio signals into mel-spectrograms, allowing a unified Vision Transformer (ViT) to process both visual and auditory data. Experimental results demonstrated that the unified tokenizer architecture was significantly superior in learning cross-modal representations and reducing architectural complexity. The most significant breakthrough among the six model versions was achieved in Version 3.2 with the introduction of a novel "CLS-informed" generative mechanism. This innovative architecture processes the global semantic information from the ViT's [CLS] token separately from the patch tokens, then uses this global context to conditionally guide the reconstruction of local details. This approach yielded a significant improvement, reducing the reconstruction loss than 0.025, an enhancement of over 90% compared to earlier versions. Furthermore, the model proved its ability to accurately predict Valence-Arousal values from the latent space and to meaningfully organize this space according to both semantic content and emotional values. In conclusion, this thesis successfully demonstrates the feasibility of developing emotion-aware cross-modal generative systems. It shows the effectiveness of the unified representation learning approach, establishing a solid foundation for future multimodal AI systems.
Generative AI
Valence-Arousal
VAE
Token
File in questo prodotto:
File Dimensione Formato  
Kose_Ismail_Deha.pdf

accesso aperto

Dimensione 5.98 MB
Formato Adobe PDF
5.98 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/87359