Audio classification is a central task in machine learning with applications in environmental monitoring, music information retrieval, speech analysis, and bioacoustics. This thesis investigates how different audio representations influence downstream learning, focusing on three main families: Mel-spectrogram features, neural codec embeddings such as EnCodec, and large-scale self-supervised models such as Perch2.0. A unified evaluation framework is developed to benchmark these representations across heterogeneous datasets, including human vocalizations, musical genres, environmental sounds, and species-specific acoustic signals. In addition to assessing existing approaches, the thesis introduces TriOct–VQ, a perceptually grounded vector-quantised encoder based on third-octave decomposition and residual quantisation, designed to generate compact and interpretable discrete audio tokens. The study also evaluates ensemble strategies that combine different representations to integrate complementary information. All models are tested under consistent training protocols and temporal window configurations, enabling systematic comparison across domains. Overall, the thesis provides a comprehensive experimental analysis of contemporary audio representations, presents a new VQ-based encoder, and establishes a methodological framework for studying how representation design interacts with dataset characteristics and classification tasks.
La classificazione audio è un compito centrale nel machine learning, con applicazioni che spaziano dal monitoraggio ambientale all’analisi musicale, dal parlato alla bioacustica. Questa tesi studia in che modo diverse rappresentazioni audio influenzino le prestazioni dei modelli di classificazione, considerando tre principali famiglie: Mel - spectrograms, codec neurali come EnCodec e modelli self-supervised su larga scala come Perch2.0. È stato sviluppato un framework di valutazione unificato per confrontare queste rappresentazioni su dataset eterogenei, comprendenti vocalizzazioni umane, generi musicali, suoni ambientali e segnali bioacustici specifici di specie. Oltre al confronto tra approcci esistenti, la tesi introduce TriOct–VQ, un vector quantized encoder basato su rappresentazione in terzi d’ottava e residual quantization, progettato per produrre rappresentazioni discrete compatte e interpretabili. Lo studio analizza inoltre tecniche di ensemble che combinano più rappresentazioni per integrare informazioni complementari. Tutti i modelli sono valutati con protocolli di addestramento coerenti, consentendo un confronto sistematico. Nel complesso, la tesi oltre un’analisi sperimentale ampia delle rappresentazioni audio moderne e introduce un nuovo encoder VQ supportato da un solido framework metodologico.
Beyond Spectrograms: Testing Alternative Codec Features for Audio Classification
DE NAT, MARCO
2024/2025
Abstract
Audio classification is a central task in machine learning with applications in environmental monitoring, music information retrieval, speech analysis, and bioacoustics. This thesis investigates how different audio representations influence downstream learning, focusing on three main families: Mel-spectrogram features, neural codec embeddings such as EnCodec, and large-scale self-supervised models such as Perch2.0. A unified evaluation framework is developed to benchmark these representations across heterogeneous datasets, including human vocalizations, musical genres, environmental sounds, and species-specific acoustic signals. In addition to assessing existing approaches, the thesis introduces TriOct–VQ, a perceptually grounded vector-quantised encoder based on third-octave decomposition and residual quantisation, designed to generate compact and interpretable discrete audio tokens. The study also evaluates ensemble strategies that combine different representations to integrate complementary information. All models are tested under consistent training protocols and temporal window configurations, enabling systematic comparison across domains. Overall, the thesis provides a comprehensive experimental analysis of contemporary audio representations, presents a new VQ-based encoder, and establishes a methodological framework for studying how representation design interacts with dataset characteristics and classification tasks.| File | Dimensione | Formato | |
|---|---|---|---|
|
DeNat_Marco.pdf
accesso aperto
Dimensione
24.89 MB
Formato
Adobe PDF
|
24.89 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/98776