The cost of backpropagation in transformer network

This thesis studies the training cost of transformer networks by analyzing the number of arithmetic operations required during backpropagation. Focusing on the original transformer block in a simplified, single-headed version, we break down the time complexity of each major component: attention, feedforward layers, and normalization. We also include a comparison with the multi-head version. The aim is to understand the contribution of each part to the total training cost, providing a hint of where future optimization might be most effective.

The cost of backpropagation in transformer network

BARBATO, ALBERTO

2024/2025

Abstract

This thesis studies the training cost of transformer networks by analyzing the number of arithmetic operations required during backpropagation. Focusing on the original transformer block in a simplified, single-headed version, we break down the time complexity of each major component: attention, feedforward layers, and normalization. We also include a comparison with the multi-head version. The aim is to understand the contribution of each part to the total training cost, providing a hint of where future optimization might be most effective.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				INGEGNERIA INFORMATICA Laurea di Primo Livello (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				The cost of backpropagation in transformer network
			
	Abstract in italiano
	
				This thesis studies the training cost of transformer networks by analyzing the number of arithmetic operations required during backpropagation. Focusing on the original transformer block in a simplified, single-headed version, we break down the time complexity of each major component: attention, feedforward layers, and normalization. We also include a comparison with the multi-head version. The aim is to understand the contribution of each part to the total training cost, providing a hint of where future optimization might be most effective.
			
	Parola chiave
	
				Transformer
Time Complexity
Training
			
	Relatore
	
				BILARDI, GIANFRANCO
			
	Appare nelle tipologie:
	
				Lauree triennali

File in questo prodotto:

File	Dimensione	Formato
Barbato_Alberto.pdf accesso aperto Dimensione 463.09 kB Formato Adobe PDF Visualizza/Apri	463.09 kB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/91702