Layer Redundancy in Transformers: Identifying What Truly Matters

Transformer-based architectures have achieved strong results in natural language processing and computer vision, but their deployment is limited by high computational, memory and storage costs, motivating research into model compression techniques, such as pruning, while preserving performance. While prior work has shown substantial redundancy in large language models (LLMs), redundancy in vision transformers has also been investigated, but it remains less systematically analyzed than in LLMs. In this work, we extend similarity-based redundancy analysis to vision models by measuring cosine similarity between layer inputs and outputs to identify transformations with minimal impact. We study redundancy patterns across several transformer architectures, including ViT, DINOv2 and SwinV2, and find that vision transformers, like language models, exhibit significant redundancy. We then evaluate how this redundancy can be exploited to improve efficiency, analyzing the trade-off between inference speed and performance degradation under different pruning strategies. Our results show that moderate pruning can yield substantial acceleration with limited impact on accuracy; for instance, in DINOv2, removing approximately 25% of the transformer blocks achieves a 33% speedup with only a 1.65% reduction in performance. We also explore a model healing approach based on the selective adaptation of highly redundant layers. Finally, we conduct additional experiments on large language models to investigate when, why and how high-similarity behavior emerges.

Layer Redundancy in Transformers: Identifying What Truly Matters

VIESPOLI, ALESSANDRO

2025/2026

Abstract

Transformer-based architectures have achieved strong results in natural language processing and computer vision, but their deployment is limited by high computational, memory and storage costs, motivating research into model compression techniques, such as pruning, while preserving performance. While prior work has shown substantial redundancy in large language models (LLMs), redundancy in vision transformers has also been investigated, but it remains less systematically analyzed than in LLMs. In this work, we extend similarity-based redundancy analysis to vision models by measuring cosine similarity between layer inputs and outputs to identify transformations with minimal impact. We study redundancy patterns across several transformer architectures, including ViT, DINOv2 and SwinV2, and find that vision transformers, like language models, exhibit significant redundancy. We then evaluate how this redundancy can be exploited to improve efficiency, analyzing the trade-off between inference speed and performance degradation under different pruning strategies. Our results show that moderate pruning can yield substantial acceleration with limited impact on accuracy; for instance, in DINOv2, removing approximately 25% of the transformer blocks achieves a 33% speedup with only a 1.65% reduction in performance. We also explore a model healing approach based on the selective adaptation of highly redundant layers. Finally, we conduct additional experiments on large language models to investigate when, why and how high-similarity behavior emerges.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				COMPUTER ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Layer Redundancy in Transformers: Identifying What Truly Matters
			
	Abstract in italiano
	
				Transformer-based architectures have achieved strong results in natural language processing and computer vision, but their deployment is limited by high computational, memory
and storage costs, motivating research into model compression techniques, such as pruning, while preserving performance. While prior work has shown substantial redundancy
in large language models (LLMs), redundancy in vision transformers has also been investigated, but it remains less systematically analyzed than in LLMs.
In this work, we extend similarity-based redundancy analysis to vision models by measuring cosine similarity between layer inputs and outputs to identify transformations
with minimal impact. We study redundancy patterns across several transformer architectures, including ViT, DINOv2 and SwinV2, and find that vision transformers, like
language models, exhibit significant redundancy. We then evaluate how this redundancy can be exploited to improve efficiency, analyzing the trade-off between inference speed
and performance degradation under different pruning strategies. Our results show that moderate pruning can yield substantial acceleration with limited impact on accuracy; for
instance, in DINOv2, removing approximately 25% of the transformer blocks achieves a 33% speedup with only a 1.65% reduction in performance. We also explore a model
healing approach based on the selective adaptation of highly redundant layers. Finally, we conduct additional experiments on large language models to investigate when, why
and how high-similarity behavior emerges.
			
	Parola chiave
	
				Transformers
Pruning
Layer Redundancy
Vision Transformers
			
	Relatore
	
				SATTA, GIORGIO
			
	Correlatore
	
				NANNI, LORIS
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Viespoli_Alessandro.pdf Accesso riservato Dimensione 6.11 MB Formato Adobe PDF	6.11 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/107665