This thesis studies the training cost of transformer networks by analyzing the number of arithmetic operations required during backpropagation. Focusing on the original transformer block in a simplified, single-headed version, we break down the time complexity of each major component: attention, feedforward layers, and normalization. We also include a comparison with the multi-head version. The aim is to understand the contribution of each part to the total training cost, providing a hint of where future optimization might be most effective.
This thesis studies the training cost of transformer networks by analyzing the number of arithmetic operations required during backpropagation. Focusing on the original transformer block in a simplified, single-headed version, we break down the time complexity of each major component: attention, feedforward layers, and normalization. We also include a comparison with the multi-head version. The aim is to understand the contribution of each part to the total training cost, providing a hint of where future optimization might be most effective.
The cost of backpropagation in transformer network
BARBATO, ALBERTO
2024/2025
Abstract
This thesis studies the training cost of transformer networks by analyzing the number of arithmetic operations required during backpropagation. Focusing on the original transformer block in a simplified, single-headed version, we break down the time complexity of each major component: attention, feedforward layers, and normalization. We also include a comparison with the multi-head version. The aim is to understand the contribution of each part to the total training cost, providing a hint of where future optimization might be most effective.| File | Dimensione | Formato | |
|---|---|---|---|
|
Barbato_Alberto.pdf
accesso aperto
Dimensione
463.09 kB
Formato
Adobe PDF
|
463.09 kB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/91702