Classical statistical theory predicts that as model complexity increases, the model can risk perfectly interpolating the training data which can lead to overfitting. This is based on a well known Foundational Generalization theory i.e. Bias-Variance Tradeoff. Yet, in modern deep neural networks, overparameterization defies this prediction. In deep learning, overparameterization as expected achieve near-zero training error but surprisingly they still generalize impressively. This thesis evolves our idea of generalization from classical frameworks such as VC (Vapnik Chervonenkis) dimension, PAC (Probably approximately Correct) learning, and explicit regularization to modern explanations like implicit regularization, flat minima, NTK (Neural Tangent Kernels), PAC-Bayes theory, Information bottleneck, and double-descent phenomenon and tries to bridge the gap between them. By synthesizing theoretical and empirical insights, this thesis investigates how classical measures of model capacity fail to capture the geometric and dynamic features of deep learning. Furthermore, this thesis discusses the practical and experimental challenges faced by industry scale models emphasizing the constraints faced by modern deep learning research. The thesis will conclude by outlining open theoretical challenges and suggesting the future work toward a unified generalization theory for deep learning.

Classical statistical theory predicts that as model complexity increases, the model can risk perfectly interpolating the training data which can lead to overfitting. This is based on a well known Foundational Generalization theory i.e. Bias-Variance Tradeoff. Yet, in modern deep neural networks, overparameterization defies this prediction. In deep learning, overparameterization as expected achieve near-zero training error but surprisingly they still generalize impressively. This thesis evolves our idea of generalization from classical frameworks such as VC (Vapnik Chervonenkis) dimension, PAC (Probably approximately Correct) learning, and explicit regularization to modern explanations like implicit regularization, flat minima, NTK (Neural Tangent Kernels), PAC-Bayes theory, Information bottleneck, and double-descent phenomenon and tries to bridge the gap between them. By synthesizing theoretical and empirical insights, this thesis investigates how classical measures of model capacity fail to capture the geometric and dynamic features of deep learning. Furthermore, this thesis discusses the practical and experimental challenges faced by industry scale models emphasizing the constraints faced by modern deep learning research. The thesis will conclude by outlining open theoretical challenges and suggesting the future work toward a unified generalization theory for deep learning.

Why do Overparameterized Neural Networks Generalize?

SINGLA, KAVITA
2024/2025

Abstract

Classical statistical theory predicts that as model complexity increases, the model can risk perfectly interpolating the training data which can lead to overfitting. This is based on a well known Foundational Generalization theory i.e. Bias-Variance Tradeoff. Yet, in modern deep neural networks, overparameterization defies this prediction. In deep learning, overparameterization as expected achieve near-zero training error but surprisingly they still generalize impressively. This thesis evolves our idea of generalization from classical frameworks such as VC (Vapnik Chervonenkis) dimension, PAC (Probably approximately Correct) learning, and explicit regularization to modern explanations like implicit regularization, flat minima, NTK (Neural Tangent Kernels), PAC-Bayes theory, Information bottleneck, and double-descent phenomenon and tries to bridge the gap between them. By synthesizing theoretical and empirical insights, this thesis investigates how classical measures of model capacity fail to capture the geometric and dynamic features of deep learning. Furthermore, this thesis discusses the practical and experimental challenges faced by industry scale models emphasizing the constraints faced by modern deep learning research. The thesis will conclude by outlining open theoretical challenges and suggesting the future work toward a unified generalization theory for deep learning.
2024
Why do Overparameterized Neural Networks Generalize?
Classical statistical theory predicts that as model complexity increases, the model can risk perfectly interpolating the training data which can lead to overfitting. This is based on a well known Foundational Generalization theory i.e. Bias-Variance Tradeoff. Yet, in modern deep neural networks, overparameterization defies this prediction. In deep learning, overparameterization as expected achieve near-zero training error but surprisingly they still generalize impressively. This thesis evolves our idea of generalization from classical frameworks such as VC (Vapnik Chervonenkis) dimension, PAC (Probably approximately Correct) learning, and explicit regularization to modern explanations like implicit regularization, flat minima, NTK (Neural Tangent Kernels), PAC-Bayes theory, Information bottleneck, and double-descent phenomenon and tries to bridge the gap between them. By synthesizing theoretical and empirical insights, this thesis investigates how classical measures of model capacity fail to capture the geometric and dynamic features of deep learning. Furthermore, this thesis discusses the practical and experimental challenges faced by industry scale models emphasizing the constraints faced by modern deep learning research. The thesis will conclude by outlining open theoretical challenges and suggesting the future work toward a unified generalization theory for deep learning.
Overparameterization
Double desent
PAC-Bayes bounds
Learning curves
VC Dimension
File in questo prodotto:
File Dimensione Formato  
singla_kavita.pdf

accesso aperto

Dimensione 644.54 kB
Formato Adobe PDF
644.54 kB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/97713