This thesis begins with an in-depth analysis of the Transformer architecture, which has become the foundation for a wide range of sequence modeling tasks due to its powerful attention mechanism. Despite its success, recent studies have highlighted certain limitations of attention, particularly in terms of computational efficiency and scalability for long sequences. To address these issues, this work explores an alternative class of models based on State Space Models (SSMs). In particular, the S6 model—a recently proposed SSM—was studied within the context of the MAMBA architecture, which leverages the strengths of state space formulations while aiming to match or exceed the efficiency of Transformers. Following a thorough analysis of S6 and its integration into MAMBA, a novel SSM-based model is introduced. This new model, developed as part of this thesis, demonstrates improved performance over S6 in specific application scenarios. The thesis also features an extensive experimental section, where S6, the proposed model, and other baseline architectures are compared across different benchmarks, with a focus on accuracy, efficiency, and generalization capabilities.
This thesis begins with an in-depth analysis of the Transformer architecture, which has become the foundation for a wide range of sequence modeling tasks due to its powerful attention mechanism. Despite its success, recent studies have highlighted certain limitations of attention, particularly in terms of computational efficiency and scalability for long sequences. To address these issues, this work explores an alternative class of models based on State Space Models (SSMs). In particular, the S6 model—a recently proposed SSM—was studied within the context of the MAMBA architecture, which leverages the strengths of state space formulations while aiming to match or exceed the efficiency of Transformers. Following a thorough analysis of S6 and its integration into MAMBA, a novel SSM-based model is introduced. This new model, developed as part of this thesis, demonstrates improved performance over S6 in specific application scenarios. The thesis also features an extensive experimental section, where S6, the proposed model, and other baseline architectures are compared across different benchmarks, with a focus on accuracy, efficiency, and generalization capabilities.
A system-theoretic perspective on Transformers
ZATTRA, RICCARDO
2024/2025
Abstract
This thesis begins with an in-depth analysis of the Transformer architecture, which has become the foundation for a wide range of sequence modeling tasks due to its powerful attention mechanism. Despite its success, recent studies have highlighted certain limitations of attention, particularly in terms of computational efficiency and scalability for long sequences. To address these issues, this work explores an alternative class of models based on State Space Models (SSMs). In particular, the S6 model—a recently proposed SSM—was studied within the context of the MAMBA architecture, which leverages the strengths of state space formulations while aiming to match or exceed the efficiency of Transformers. Following a thorough analysis of S6 and its integration into MAMBA, a novel SSM-based model is introduced. This new model, developed as part of this thesis, demonstrates improved performance over S6 in specific application scenarios. The thesis also features an extensive experimental section, where S6, the proposed model, and other baseline architectures are compared across different benchmarks, with a focus on accuracy, efficiency, and generalization capabilities.| File | Dimensione | Formato | |
|---|---|---|---|
|
Zattra_Riccardo.pdf
embargo fino al 14/03/2027
Dimensione
1.22 MB
Formato
Adobe PDF
|
1.22 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/90729