In recent years, advancements in multimodal artificial intelligence have opened new possibilities for integrating visual and auditory data to build innovative applications. This thesis explores the novel task of generating audio from food images, aiming to develop a pipeline that translates visual features into corresponding auditory outputs. The approach begins by generating descriptive captions from food images using a vision-language model, which are then refined with a language model to enhance contextual accuracy. Audio samples are analyzed to extract meaningful features, which are matched with the generated captions based on semantic similarity, forming a structured dataset for training and evaluation. In the final stage, an audio generation model is used to synthesize sound outputs aligned with the visual input. The system is evaluated to assess the coherence between images, captions, and generated audio. Results demonstrate the feasibility of using multimodal AI for creative tasks like image-guided audio generation, with potential applications in entertainment and sensory experiences. This study contributes to the field by introducing a practical framework and highlighting its technical and creative potential.

In recent years, advancements in multimodal artificial intelligence have opened new possibilities for integrating visual and auditory data to build innovative applications. This thesis explores the novel task of generating audio from food images, aiming to develop a pipeline that translates visual features into corresponding auditory outputs. The approach begins by generating descriptive captions from food images using a vision-language model, which are then refined with a language model to enhance contextual accuracy. Audio samples are analyzed to extract meaningful features, which are matched with the generated captions based on semantic similarity, forming a structured dataset for training and evaluation. In the final stage, an audio generation model is used to synthesize sound outputs aligned with the visual input. The system is evaluated to assess the coherence between images, captions, and generated audio. Results demonstrate the feasibility of using multimodal AI for creative tasks like image-guided audio generation, with potential applications in entertainment and sensory experiences. This study contributes to the field by introducing a practical framework and highlighting its technical and creative potential.

Developing a Pipeline to Generate Audio from Food Images Using Combined AI Models

BÖĞÜRCÜ, ÇAĞIN
2024/2025

Abstract

In recent years, advancements in multimodal artificial intelligence have opened new possibilities for integrating visual and auditory data to build innovative applications. This thesis explores the novel task of generating audio from food images, aiming to develop a pipeline that translates visual features into corresponding auditory outputs. The approach begins by generating descriptive captions from food images using a vision-language model, which are then refined with a language model to enhance contextual accuracy. Audio samples are analyzed to extract meaningful features, which are matched with the generated captions based on semantic similarity, forming a structured dataset for training and evaluation. In the final stage, an audio generation model is used to synthesize sound outputs aligned with the visual input. The system is evaluated to assess the coherence between images, captions, and generated audio. Results demonstrate the feasibility of using multimodal AI for creative tasks like image-guided audio generation, with potential applications in entertainment and sensory experiences. This study contributes to the field by introducing a practical framework and highlighting its technical and creative potential.
2024
Developing a Pipeline to Generate Audio from Food Images Using Combined AI Models
In recent years, advancements in multimodal artificial intelligence have opened new possibilities for integrating visual and auditory data to build innovative applications. This thesis explores the novel task of generating audio from food images, aiming to develop a pipeline that translates visual features into corresponding auditory outputs. The approach begins by generating descriptive captions from food images using a vision-language model, which are then refined with a language model to enhance contextual accuracy. Audio samples are analyzed to extract meaningful features, which are matched with the generated captions based on semantic similarity, forming a structured dataset for training and evaluation. In the final stage, an audio generation model is used to synthesize sound outputs aligned with the visual input. The system is evaluated to assess the coherence between images, captions, and generated audio. Results demonstrate the feasibility of using multimodal AI for creative tasks like image-guided audio generation, with potential applications in entertainment and sensory experiences. This study contributes to the field by introducing a practical framework and highlighting its technical and creative potential.
Image to Audio
Audio Captioning
Image Captioning
Generative AI
Text2Vec
File in questo prodotto:
File Dimensione Formato  
Bogurcu_Cagin.pdf

accesso riservato

Dimensione 2.09 MB
Formato Adobe PDF
2.09 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/85210