Transformer-based architectures have achieved strong results in natural language processing and computer vision, but their deployment is limited by high computational, memory and storage costs, motivating research into model compression techniques, such as pruning, while preserving performance. While prior work has shown substantial redundancy in large language models (LLMs), redundancy in vision transformers has also been investigated, but it remains less systematically analyzed than in LLMs. In this work, we extend similarity-based redundancy analysis to vision models by measuring cosine similarity between layer inputs and outputs to identify transformations with minimal impact. We study redundancy patterns across several transformer architectures, including ViT, DINOv2 and SwinV2, and find that vision transformers, like language models, exhibit significant redundancy. We then evaluate how this redundancy can be exploited to improve efficiency, analyzing the trade-off between inference speed and performance degradation under different pruning strategies. Our results show that moderate pruning can yield substantial acceleration with limited impact on accuracy; for instance, in DINOv2, removing approximately 25% of the transformer blocks achieves a 33% speedup with only a 1.65% reduction in performance. We also explore a model healing approach based on the selective adaptation of highly redundant layers. Finally, we conduct additional experiments on large language models to investigate when, why and how high-similarity behavior emerges.
Transformer-based architectures have achieved strong results in natural language processing and computer vision, but their deployment is limited by high computational, memory and storage costs, motivating research into model compression techniques, such as pruning, while preserving performance. While prior work has shown substantial redundancy in large language models (LLMs), redundancy in vision transformers has also been investigated, but it remains less systematically analyzed than in LLMs. In this work, we extend similarity-based redundancy analysis to vision models by measuring cosine similarity between layer inputs and outputs to identify transformations with minimal impact. We study redundancy patterns across several transformer architectures, including ViT, DINOv2 and SwinV2, and find that vision transformers, like language models, exhibit significant redundancy. We then evaluate how this redundancy can be exploited to improve efficiency, analyzing the trade-off between inference speed and performance degradation under different pruning strategies. Our results show that moderate pruning can yield substantial acceleration with limited impact on accuracy; for instance, in DINOv2, removing approximately 25% of the transformer blocks achieves a 33% speedup with only a 1.65% reduction in performance. We also explore a model healing approach based on the selective adaptation of highly redundant layers. Finally, we conduct additional experiments on large language models to investigate when, why and how high-similarity behavior emerges.
Layer Redundancy in Transformers: Identifying What Truly Matters
VIESPOLI, ALESSANDRO
2025/2026
Abstract
Transformer-based architectures have achieved strong results in natural language processing and computer vision, but their deployment is limited by high computational, memory and storage costs, motivating research into model compression techniques, such as pruning, while preserving performance. While prior work has shown substantial redundancy in large language models (LLMs), redundancy in vision transformers has also been investigated, but it remains less systematically analyzed than in LLMs. In this work, we extend similarity-based redundancy analysis to vision models by measuring cosine similarity between layer inputs and outputs to identify transformations with minimal impact. We study redundancy patterns across several transformer architectures, including ViT, DINOv2 and SwinV2, and find that vision transformers, like language models, exhibit significant redundancy. We then evaluate how this redundancy can be exploited to improve efficiency, analyzing the trade-off between inference speed and performance degradation under different pruning strategies. Our results show that moderate pruning can yield substantial acceleration with limited impact on accuracy; for instance, in DINOv2, removing approximately 25% of the transformer blocks achieves a 33% speedup with only a 1.65% reduction in performance. We also explore a model healing approach based on the selective adaptation of highly redundant layers. Finally, we conduct additional experiments on large language models to investigate when, why and how high-similarity behavior emerges.| File | Dimensione | Formato | |
|---|---|---|---|
|
Viespoli_Alessandro.pdf
Accesso riservato
Dimensione
6.11 MB
Formato
Adobe PDF
|
6.11 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/107665