Audio classification involves complex temporal dependencies spanning multiple time scales, requiring models that integrate information over time while preserving fine-grained temporal structure. Spiking Neural Networks (SNNs) offer a biologically inspired framework for such processing, by encoding information as sparse, asynchronous binary events (spikes). Their event-driven nature makes them particularly suited for temporal data streams, enabling high responsiveness and significant energy savings compared to conventional neural networks. Extending SNNs with recurrent connections yields Recurrent Spiking Neural Networks (RSNNs), which are better suited to capturing long-range temporal dependencies. However, training RSNNs with gradient-based optimization remains challenging due to vanishing and exploding gradients over long sequences, limiting the scalability of deep architectures, as in standard RNNs. A promising solution is the introduction of learnable delays in recurrent connections. These delays model the propagation time of spikes between neurons and effectively act as temporal skip connections, facilitating gradient flow. The DELREC method introduces a way to learn axonal delays, together with the other network weights, using surrogate gradient learning. The method achieved state-of-the-art performance on the Spiking Speech Commands (SSC) dataset and competitive results on the Spiking Heidelberg Digits (SHD) one. Despite its effectiveness, DELREC relies on fully dense recurrent connections, resulting in a quadratic parameter complexity that hinders scalability. In this thesis we propose replacing dense recurrent connectivity with lightweight one-dimensional (1D) convolutions. This design leverages the strong local correlations present in audio representations, where adjacent frequency channels exhibit similar activation patterns due to the harmonic structure of speech, the spectral continuity of acoustic events, and overlapping cochlear filter responses. By exploiting this locality, the model maintains expressive power while significantly reducing computational cost. The proposed approach achieves a 99.995% reduction in recurrent parameters, resulting in a 25–52× faster inference compared to the original DELREC method. Evaluated on the SHD and SSC datasets, it attains test accuracies of 91.51% ± 0.70% and 78.59% ± 0.39%, respectively. An ablation study further highlights the importance of learnable delays, with improvements of 5.23 (SHD) and 3.50 (SSC) percentage points in test accuracy. Additionally, learnable delays significantly reduce cross-seed variance, contributing to more stable and reliable training compared to fixed-delay approaches. These results demonstrate that convolutional recurrence with learnable delays constitutes an efficient and scalable alternative to fully connected RSNN architectures.
Audio classification involves complex temporal dependencies spanning multiple time scales, requiring models that integrate information over time while preserving fine-grained temporal structure. Spiking Neural Networks (SNNs) offer a biologically inspired framework for such processing, by encoding information as sparse, asynchronous binary events (spikes). Their event-driven nature makes them particularly suited for temporal data streams, enabling high responsiveness and significant energy savings compared to conventional neural networks. Extending SNNs with recurrent connections yields Recurrent Spiking Neural Networks (RSNNs), which are better suited to capturing long-range temporal dependencies. However, training RSNNs with gradient-based optimization remains challenging due to vanishing and exploding gradients over long sequences, limiting the scalability of deep architectures, as in standard RNNs. A promising solution is the introduction of learnable delays in recurrent connections. These delays model the propagation time of spikes between neurons and effectively act as temporal skip connections, facilitating gradient flow. The DELREC method introduces a way to learn axonal delays, together with the other network weights, using surrogate gradient learning. The method achieved state-of-the-art performance on the Spiking Speech Commands (SSC) dataset and competitive results on the Spiking Heidelberg Digits (SHD) one. Despite its effectiveness, DELREC relies on fully dense recurrent connections, resulting in a quadratic parameter complexity that hinders scalability. In this thesis we propose replacing dense recurrent connectivity with lightweight one-dimensional (1D) convolutions. This design leverages the strong local correlations present in audio representations, where adjacent frequency channels exhibit similar activation patterns due to the harmonic structure of speech, the spectral continuity of acoustic events, and overlapping cochlear filter responses. By exploiting this locality, the model maintains expressive power while significantly reducing computational cost. The proposed approach achieves a 99.995% reduction in recurrent parameters, resulting in a 25–52× faster inference compared to the original DELREC method. Evaluated on the SHD and SSC datasets, it attains test accuracies of 91.51% ± 0.70% and 78.59% ± 0.39%, respectively. An ablation study further highlights the importance of learnable delays, with improvements of 5.23 (SHD) and 3.50 (SSC) percentage points in test accuracy. Additionally, learnable delays significantly reduce cross-seed variance, contributing to more stable and reliable training compared to fixed-delay approaches. These results demonstrate that convolutional recurrence with learnable delays constitutes an efficient and scalable alternative to fully connected RSNN architectures.
Convolutional Recurrence in Spiking Neural Networks: A Parameter-Efficient Approach to Learnable Delays for Audio Classification
FOLLY SANCHES ZEBENDO, LÚCIO
2025/2026
Abstract
Audio classification involves complex temporal dependencies spanning multiple time scales, requiring models that integrate information over time while preserving fine-grained temporal structure. Spiking Neural Networks (SNNs) offer a biologically inspired framework for such processing, by encoding information as sparse, asynchronous binary events (spikes). Their event-driven nature makes them particularly suited for temporal data streams, enabling high responsiveness and significant energy savings compared to conventional neural networks. Extending SNNs with recurrent connections yields Recurrent Spiking Neural Networks (RSNNs), which are better suited to capturing long-range temporal dependencies. However, training RSNNs with gradient-based optimization remains challenging due to vanishing and exploding gradients over long sequences, limiting the scalability of deep architectures, as in standard RNNs. A promising solution is the introduction of learnable delays in recurrent connections. These delays model the propagation time of spikes between neurons and effectively act as temporal skip connections, facilitating gradient flow. The DELREC method introduces a way to learn axonal delays, together with the other network weights, using surrogate gradient learning. The method achieved state-of-the-art performance on the Spiking Speech Commands (SSC) dataset and competitive results on the Spiking Heidelberg Digits (SHD) one. Despite its effectiveness, DELREC relies on fully dense recurrent connections, resulting in a quadratic parameter complexity that hinders scalability. In this thesis we propose replacing dense recurrent connectivity with lightweight one-dimensional (1D) convolutions. This design leverages the strong local correlations present in audio representations, where adjacent frequency channels exhibit similar activation patterns due to the harmonic structure of speech, the spectral continuity of acoustic events, and overlapping cochlear filter responses. By exploiting this locality, the model maintains expressive power while significantly reducing computational cost. The proposed approach achieves a 99.995% reduction in recurrent parameters, resulting in a 25–52× faster inference compared to the original DELREC method. Evaluated on the SHD and SSC datasets, it attains test accuracies of 91.51% ± 0.70% and 78.59% ± 0.39%, respectively. An ablation study further highlights the importance of learnable delays, with improvements of 5.23 (SHD) and 3.50 (SSC) percentage points in test accuracy. Additionally, learnable delays significantly reduce cross-seed variance, contributing to more stable and reliable training compared to fixed-delay approaches. These results demonstrate that convolutional recurrence with learnable delays constitutes an efficient and scalable alternative to fully connected RSNN architectures.| File | Dimensione | Formato | |
|---|---|---|---|
|
Zebendo_Lucio.pdf
accesso aperto
Dimensione
492.21 kB
Formato
Adobe PDF
|
492.21 kB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/106018