In recent years, advancements in multimodal artificial intelligence have opened new possibilities for integrating visual and auditory data to build innovative applications. This thesis explores the novel task of generating audio from food images, aiming to develop a pipeline that translates visual features into corresponding auditory outputs. The approach begins by generating descriptive captions from food images using a vision-language model, which are then refined with a language model to enhance contextual accuracy. Audio samples are analyzed to extract meaningful features, which are matched with the generated captions based on semantic similarity, forming a structured dataset for training and evaluation. In the final stage, an audio generation model is used to synthesize sound outputs aligned with the visual input. The system is evaluated to assess the coherence between images, captions, and generated audio. Results demonstrate the feasibility of using multimodal AI for creative tasks like image-guided audio generation, with potential applications in entertainment and sensory experiences. This study contributes to the field by introducing a practical framework and highlighting its technical and creative potential.
In recent years, advancements in multimodal artificial intelligence have opened new possibilities for integrating visual and auditory data to build innovative applications. This thesis explores the novel task of generating audio from food images, aiming to develop a pipeline that translates visual features into corresponding auditory outputs. The approach begins by generating descriptive captions from food images using a vision-language model, which are then refined with a language model to enhance contextual accuracy. Audio samples are analyzed to extract meaningful features, which are matched with the generated captions based on semantic similarity, forming a structured dataset for training and evaluation. In the final stage, an audio generation model is used to synthesize sound outputs aligned with the visual input. The system is evaluated to assess the coherence between images, captions, and generated audio. Results demonstrate the feasibility of using multimodal AI for creative tasks like image-guided audio generation, with potential applications in entertainment and sensory experiences. This study contributes to the field by introducing a practical framework and highlighting its technical and creative potential.
Developing a Pipeline to Generate Audio from Food Images Using Combined AI Models
BÖĞÜRCÜ, ÇAĞIN
2024/2025
Abstract
In recent years, advancements in multimodal artificial intelligence have opened new possibilities for integrating visual and auditory data to build innovative applications. This thesis explores the novel task of generating audio from food images, aiming to develop a pipeline that translates visual features into corresponding auditory outputs. The approach begins by generating descriptive captions from food images using a vision-language model, which are then refined with a language model to enhance contextual accuracy. Audio samples are analyzed to extract meaningful features, which are matched with the generated captions based on semantic similarity, forming a structured dataset for training and evaluation. In the final stage, an audio generation model is used to synthesize sound outputs aligned with the visual input. The system is evaluated to assess the coherence between images, captions, and generated audio. Results demonstrate the feasibility of using multimodal AI for creative tasks like image-guided audio generation, with potential applications in entertainment and sensory experiences. This study contributes to the field by introducing a practical framework and highlighting its technical and creative potential.File | Dimensione | Formato | |
---|---|---|---|
Bogurcu_Cagin.pdf
accesso riservato
Dimensione
2.09 MB
Formato
Adobe PDF
|
2.09 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/85210