Artificial intelligence (AI) has rapidly transformed audio content creation, enabling systems to generate both music and functional sounds for communication and accessibility purposes. Among these applications, the TIScode system represents an original approach, transmitting digital information through Short Sound Messages (SSMs) – brief five-second AI-generated audio clips. However, evaluating how such sounds are perceived and assessing their musical qualities remains an open challenge, as perception is crucially influenced by human factors. This thesis focuses on the generation of SSMs using MusicGen, a state-of-the-art text-to-music model, and on their perceptual evaluation through a structured listening test based on ITU-T standards. Participants examined audio samples created under diverse scenarios, allowing an analysis of how prompt design and model configuration influence perceived quality, recognizability, and acceptability. Results highlight the potential of text-to-music models for sound-based communications and provide directions for future research in perceptual evaluation of AI-generated audio.
L’intelligenza artificiale (IA) ha trasformato profondamente la creazione di contenuti audio, consentendo la generazione automatica sia di musica che di suoni funzionali per finalità comunicative e di accessibilità. Tra queste applicazioni, il sistema TIScode rappresenta un approccio originale, poiché trasmette informazioni digitali attraverso Short Sound Messages (SSMs) – brevi clip audio della durata di cinque secondi generate dall’IA. La valutazione percettiva di tali suoni e delle loro qualità musicali costituisce tuttavia una sfida aperta, dato che la percezione è fortemente influenzata da fattori umani. Questa tesi affronta la generazione di SSMs mediante MusicGen, un modello text‑to‑music allo stato dell'arte, e la loro valutazione percettiva attraverso un test di ascolto strutturato basato sugli standard ITU‑T. I partecipanti hanno valutato campioni creati in diversi scenari, permettendo di analizzare come la progettazione dei prompt e la configurazione del modello incidano sulla qualità percepita, la riconoscibilità e l’accettabilità. I risultati evidenziano il potenziale dei modelli text‑to‑music per la comunicazione sonora e delineano prospettive per future ricerche sulla valutazione percettiva dell’audio generato dall’IA.
Perceptual and Quality Assessment of AI-Generated Short Sound Messages
CHEN, SIMONE
2024/2025
Abstract
Artificial intelligence (AI) has rapidly transformed audio content creation, enabling systems to generate both music and functional sounds for communication and accessibility purposes. Among these applications, the TIScode system represents an original approach, transmitting digital information through Short Sound Messages (SSMs) – brief five-second AI-generated audio clips. However, evaluating how such sounds are perceived and assessing their musical qualities remains an open challenge, as perception is crucially influenced by human factors. This thesis focuses on the generation of SSMs using MusicGen, a state-of-the-art text-to-music model, and on their perceptual evaluation through a structured listening test based on ITU-T standards. Participants examined audio samples created under diverse scenarios, allowing an analysis of how prompt design and model configuration influence perceived quality, recognizability, and acceptability. Results highlight the potential of text-to-music models for sound-based communications and provide directions for future research in perceptual evaluation of AI-generated audio.| File | Dimensione | Formato | |
|---|---|---|---|
|
Chen_Simone.pdf
accesso aperto
Dimensione
2.58 MB
Formato
Adobe PDF
|
2.58 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/97826