This thesis presents the design and development of an interactive system for generating music using both text and audio inputs, leveraging a fine-tuned transformer-based model, Tasty-MusicGen-Small, developed at the University of Padova. The system supports multimodal input: users can provide either a textual description or an audio prompt to generate coherent musical output. The application is implemented in Python and delivered through a web-based interface built using Gradio, enabling real-time interaction and output delivery in WAV format. The backend pipeline integrates Hugging Face’s Transformers and supports GPU acceleration through CUDA, ensuring efficient inference and high-quality synthesis. Key contributions include the dual-modality support (text-to-audio and audio-to-audio), robust preprocessing and handling of audio data, and a lightweight API for experimentation and creative exploration. This work contributes to the growing field of generative AI in music, providing tools that bridge the gap between human expression and machine-generated sound.

This thesis presents the design and development of an interactive system for generating music using both text and audio inputs, leveraging a fine-tuned transformer-based model, Tasty-MusicGen-Small, developed at the University of Padova. The system supports multimodal input: users can provide either a textual description or an audio prompt to generate coherent musical output. The application is implemented in Python and delivered through a web-based interface built using Gradio, enabling real-time interaction and output delivery in WAV format. The backend pipeline integrates Hugging Face’s Transformers and supports GPU acceleration through CUDA, ensuring efficient inference and high-quality synthesis. Key contributions include the dual-modality support (text-to-audio and audio-to-audio), robust preprocessing and handling of audio data, and a lightweight API for experimentation and creative exploration. This work contributes to the growing field of generative AI in music, providing tools that bridge the gap between human expression and machine-generated sound.

Interactive Music Generation from Text and Audio Prompts Using a Fine-Tuned Transformer Model

VIZZAPU, PRAKASH
2025/2026

Abstract

This thesis presents the design and development of an interactive system for generating music using both text and audio inputs, leveraging a fine-tuned transformer-based model, Tasty-MusicGen-Small, developed at the University of Padova. The system supports multimodal input: users can provide either a textual description or an audio prompt to generate coherent musical output. The application is implemented in Python and delivered through a web-based interface built using Gradio, enabling real-time interaction and output delivery in WAV format. The backend pipeline integrates Hugging Face’s Transformers and supports GPU acceleration through CUDA, ensuring efficient inference and high-quality synthesis. Key contributions include the dual-modality support (text-to-audio and audio-to-audio), robust preprocessing and handling of audio data, and a lightweight API for experimentation and creative exploration. This work contributes to the growing field of generative AI in music, providing tools that bridge the gap between human expression and machine-generated sound.
2025
Interactive Music Generation from Text and Audio Prompts Using a Fine-Tuned Transformer Model
This thesis presents the design and development of an interactive system for generating music using both text and audio inputs, leveraging a fine-tuned transformer-based model, Tasty-MusicGen-Small, developed at the University of Padova. The system supports multimodal input: users can provide either a textual description or an audio prompt to generate coherent musical output. The application is implemented in Python and delivered through a web-based interface built using Gradio, enabling real-time interaction and output delivery in WAV format. The backend pipeline integrates Hugging Face’s Transformers and supports GPU acceleration through CUDA, ensuring efficient inference and high-quality synthesis. Key contributions include the dual-modality support (text-to-audio and audio-to-audio), robust preprocessing and handling of audio data, and a lightweight API for experimentation and creative exploration. This work contributes to the growing field of generative AI in music, providing tools that bridge the gap between human expression and machine-generated sound.
Multimodal Music Gen
Transformer-Based Au
Text-to-Audio Synthe
File in questo prodotto:
File Dimensione Formato  
Thesis.pdf

Accesso riservato

Dimensione 1.47 MB
Formato Adobe PDF
1.47 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/106864