Transformer-based architectures have achieved strong results in natural language processing and computer vision, but their deployment is limited by high computational, memory and storage costs, motivating research into model compression techniques, such as pruning, while preserving performance. While prior work has shown substantial redundancy in large language models (LLMs), redundancy in vision transformers has also been investigated, but it remains less systematically analyzed than in LLMs. In this work, we extend similarity-based redundancy analysis to vision models by measuring cosine similarity between layer inputs and outputs to identify transformations with minimal impact. We study redundancy patterns across several transformer architectures, including ViT, DINOv2 and SwinV2, and find that vision transformers, like language models, exhibit significant redundancy. We then evaluate how this redundancy can be exploited to improve efficiency, analyzing the trade-off between inference speed and performance degradation under different pruning strategies. Our results show that moderate pruning can yield substantial acceleration with limited impact on accuracy; for instance, in DINOv2, removing approximately 25% of the transformer blocks achieves a 33% speedup with only a 1.65% reduction in performance. We also explore a model healing approach based on the selective adaptation of highly redundant layers. Finally, we conduct additional experiments on large language models to investigate when, why and how high-similarity behavior emerges.

Transformer-based architectures have achieved strong results in natural language processing and computer vision, but their deployment is limited by high computational, memory and storage costs, motivating research into model compression techniques, such as pruning, while preserving performance. While prior work has shown substantial redundancy in large language models (LLMs), redundancy in vision transformers has also been investigated, but it remains less systematically analyzed than in LLMs. In this work, we extend similarity-based redundancy analysis to vision models by measuring cosine similarity between layer inputs and outputs to identify transformations with minimal impact. We study redundancy patterns across several transformer architectures, including ViT, DINOv2 and SwinV2, and find that vision transformers, like language models, exhibit significant redundancy. We then evaluate how this redundancy can be exploited to improve efficiency, analyzing the trade-off between inference speed and performance degradation under different pruning strategies. Our results show that moderate pruning can yield substantial acceleration with limited impact on accuracy; for instance, in DINOv2, removing approximately 25% of the transformer blocks achieves a 33% speedup with only a 1.65% reduction in performance. We also explore a model healing approach based on the selective adaptation of highly redundant layers. Finally, we conduct additional experiments on large language models to investigate when, why and how high-similarity behavior emerges.

Layer Redundancy in Transformers: Identifying What Truly Matters

VIESPOLI, ALESSANDRO
2025/2026

Abstract

Transformer-based architectures have achieved strong results in natural language processing and computer vision, but their deployment is limited by high computational, memory and storage costs, motivating research into model compression techniques, such as pruning, while preserving performance. While prior work has shown substantial redundancy in large language models (LLMs), redundancy in vision transformers has also been investigated, but it remains less systematically analyzed than in LLMs. In this work, we extend similarity-based redundancy analysis to vision models by measuring cosine similarity between layer inputs and outputs to identify transformations with minimal impact. We study redundancy patterns across several transformer architectures, including ViT, DINOv2 and SwinV2, and find that vision transformers, like language models, exhibit significant redundancy. We then evaluate how this redundancy can be exploited to improve efficiency, analyzing the trade-off between inference speed and performance degradation under different pruning strategies. Our results show that moderate pruning can yield substantial acceleration with limited impact on accuracy; for instance, in DINOv2, removing approximately 25% of the transformer blocks achieves a 33% speedup with only a 1.65% reduction in performance. We also explore a model healing approach based on the selective adaptation of highly redundant layers. Finally, we conduct additional experiments on large language models to investigate when, why and how high-similarity behavior emerges.
2025
Layer Redundancy in Transformers: Identifying What Truly Matters
Transformer-based architectures have achieved strong results in natural language processing and computer vision, but their deployment is limited by high computational, memory and storage costs, motivating research into model compression techniques, such as pruning, while preserving performance. While prior work has shown substantial redundancy in large language models (LLMs), redundancy in vision transformers has also been investigated, but it remains less systematically analyzed than in LLMs. In this work, we extend similarity-based redundancy analysis to vision models by measuring cosine similarity between layer inputs and outputs to identify transformations with minimal impact. We study redundancy patterns across several transformer architectures, including ViT, DINOv2 and SwinV2, and find that vision transformers, like language models, exhibit significant redundancy. We then evaluate how this redundancy can be exploited to improve efficiency, analyzing the trade-off between inference speed and performance degradation under different pruning strategies. Our results show that moderate pruning can yield substantial acceleration with limited impact on accuracy; for instance, in DINOv2, removing approximately 25% of the transformer blocks achieves a 33% speedup with only a 1.65% reduction in performance. We also explore a model healing approach based on the selective adaptation of highly redundant layers. Finally, we conduct additional experiments on large language models to investigate when, why and how high-similarity behavior emerges.
Transformers
Pruning
Layer Redundancy
Vision Transformers
File in questo prodotto:
File Dimensione Formato  
Viespoli_Alessandro.pdf

Accesso riservato

Dimensione 6.11 MB
Formato Adobe PDF
6.11 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/107665