This thesis studies the training cost of transformer networks by analyzing the number of arithmetic operations required during backpropagation. Focusing on the original transformer block in a simplified, single-headed version, we break down the time complexity of each major component: attention, feedforward layers, and normalization. We also include a comparison with the multi-head version. The aim is to understand the contribution of each part to the total training cost, providing a hint of where future optimization might be most effective.

This thesis studies the training cost of transformer networks by analyzing the number of arithmetic operations required during backpropagation. Focusing on the original transformer block in a simplified, single-headed version, we break down the time complexity of each major component: attention, feedforward layers, and normalization. We also include a comparison with the multi-head version. The aim is to understand the contribution of each part to the total training cost, providing a hint of where future optimization might be most effective.

The cost of backpropagation in transformer network

BARBATO, ALBERTO
2024/2025

Abstract

This thesis studies the training cost of transformer networks by analyzing the number of arithmetic operations required during backpropagation. Focusing on the original transformer block in a simplified, single-headed version, we break down the time complexity of each major component: attention, feedforward layers, and normalization. We also include a comparison with the multi-head version. The aim is to understand the contribution of each part to the total training cost, providing a hint of where future optimization might be most effective.
2024
The cost of backpropagation in transformer network
This thesis studies the training cost of transformer networks by analyzing the number of arithmetic operations required during backpropagation. Focusing on the original transformer block in a simplified, single-headed version, we break down the time complexity of each major component: attention, feedforward layers, and normalization. We also include a comparison with the multi-head version. The aim is to understand the contribution of each part to the total training cost, providing a hint of where future optimization might be most effective.
Transformer
Time Complexity
Training
File in questo prodotto:
File Dimensione Formato  
Barbato_Alberto.pdf

accesso aperto

Dimensione 463.09 kB
Formato Adobe PDF
463.09 kB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/91702