The perception of food is a multisensory process, in which sight, taste, and hearing interact to shape the overall experience. Neuroscientific studies have highlighted crossmodal effects between music and taste, showing how specific musical attributes can alter the perception of food. This thesis aims to explore such dynamics by creating a crossmodal dataset built from two key resources: Food-Pics_Extended, containing food images with normative ratings of valence and arousal, and DEAM, which provides musical clips annotated along the same emotional dimensions. Using similarity metrics (Euclidean distance and cosine similarity), images and songs were matched based on emotional coherence, defined in the bidimensional valence–arousal space. The resulting dataset includes 896 food images, each associated with four musical clips (two static, two dynamic). To test the quality of these pairings, an online survey was developed in which participants evaluated the emotional congruence between visual and auditory stimuli and reported the emotion they perceived. Results showed a limited theoretical-empirical concordance (16.4%), suggesting that emotional coordinates alone may not sufficiently predict subjective experience. Nonetheless, the comparison between similarity metrics revealed specific advantages of cosine similarity in preserving the qualitative coherence of the evoked emotion. The collected data provide insights for improving sensory matching systems, with potential applications in experiential dining and emotional design. This project lays the groundwork for the future development of intelligent systems capable of generating personalized food-music combinations, although the training of artificial intelligence is not directly addressed in this thesis and will be explored in future research.
La percezione del cibo è un processo multisensoriale, in cui vista, gusto e udito interagiscono per influenzare l’esperienza complessiva. Studi neuroscientifici hanno evidenziato effetti crossmodali tra musica e gusto, mostrando come specifici attributi musicali possano alterare la percezione degli alimenti. Questa tesi si propone di esplorare tali dinamiche attraverso la creazione di un dataset crossmodale costruito a partire da due risorse fondamentali: Food-Pics_Extended, contenente immagini alimentari con valutazioni normative di valenza e arousal, e DEAM, che raccoglie brani musicali annotati secondo le medesime dimensioni emozionali. Utilizzando metriche di similarità (distanza euclidea e cosine similarity), le immagini e i brani sono stati abbinati in base alla coerenza emozionale percepita, definita nello spazio bidimensionale valenza–arousal. È stato così generato un dataset composto da 896 immagini ciascuna associata a quattro clip musicali (due statiche, due dinamiche). Per testare la qualità di tali abbinamenti è stato sviluppato un sondaggio online, in cui i partecipanti hanno valutato la congruenza emozionale tra stimoli visivi e sonori e indicato l’emozione percepita. L’analisi dei risultati ha mostrato una concordanza teorico-empirica limitata (16,4%), suggerendo che le sole coordinate emozionali non siano sufficienti a prevedere l’esperienza soggettiva. Tuttavia, il confronto tra le metriche ha evidenziato vantaggi specifici della cosine similarity nel preservare la coerenza qualitativa dell’emozione evocata. I dati raccolti offrono spunti per il miglioramento di sistemi di abbinamento sensoriale, con possibili applicazioni nell’ambito della ristorazione esperienziale e della progettazione emozionale. Il progetto rappresenta una base per lo sviluppo futuro di sistemi intelligenti capaci di generare combinazioni cibo-musica personalizzate, pur non trattando direttamente l’addestramento dell’intelligenza artificiale, che sarà oggetto di studi successivi.
Sound Food - Creazione di un Dataset cross-modale basato su misure di similarità emozionali
SINGH, ROBINPREET
2024/2025
Abstract
The perception of food is a multisensory process, in which sight, taste, and hearing interact to shape the overall experience. Neuroscientific studies have highlighted crossmodal effects between music and taste, showing how specific musical attributes can alter the perception of food. This thesis aims to explore such dynamics by creating a crossmodal dataset built from two key resources: Food-Pics_Extended, containing food images with normative ratings of valence and arousal, and DEAM, which provides musical clips annotated along the same emotional dimensions. Using similarity metrics (Euclidean distance and cosine similarity), images and songs were matched based on emotional coherence, defined in the bidimensional valence–arousal space. The resulting dataset includes 896 food images, each associated with four musical clips (two static, two dynamic). To test the quality of these pairings, an online survey was developed in which participants evaluated the emotional congruence between visual and auditory stimuli and reported the emotion they perceived. Results showed a limited theoretical-empirical concordance (16.4%), suggesting that emotional coordinates alone may not sufficiently predict subjective experience. Nonetheless, the comparison between similarity metrics revealed specific advantages of cosine similarity in preserving the qualitative coherence of the evoked emotion. The collected data provide insights for improving sensory matching systems, with potential applications in experiential dining and emotional design. This project lays the groundwork for the future development of intelligent systems capable of generating personalized food-music combinations, although the training of artificial intelligence is not directly addressed in this thesis and will be explored in future research.| File | Dimensione | Formato | |
|---|---|---|---|
|
Singh_Robinpreet.pdf
accesso aperto
Dimensione
4.08 MB
Formato
Adobe PDF
|
4.08 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/89716