The rapid growth of high-dimensional data has made reliable inference, not merely prediction, a pressing challenge in linear regression when the number of covariates far exceeds the sample size. While sparsity-inducing estimators such as the Lasso excel at variable selection, post-selection p-values and multiple-testing guarantees remain elusive. This thesis advances the current state of the art by extending the permutation-based framework of De Santis et al. (2022) to Gaussian linear models, delivering valid per-variable p-values . For every coefficient, the proposed procedure treats the remaining predictors as potential confounders. To mitigate the limitations of the screening property, which requires that the preliminary selection step includes all relevant variables, a condition often violated under high collinearity, we explore two complementary strategies. The first performs a principal component decomposition of the confounder set followed by sparse estimation in the reduced space. The second applies a forward stepwise selection directly within the confounder set. In both cases, a standardized, sign-flipped score statistic is then computed for the target variable conditional on the selected components. Embedding these statistics in a Westfall–Young maxT permutation scheme automatically adjusts for dependence across tests, yielding simultaneous confidence statements that remain valid after model exploration. The method is fully non-parametric, distribution-free, and naturally extensible to generalized linear models. An extensive simulation study spanning Toeplitz and other covariance structures evaluates type I error, power and family-wise error rate. Across a wide range of sample sizes, signal-to-noise ratios, sparsity levels, and correlation strengths, both proposed methods achieve reliable control of type I error at the marginal level as well as strong control of the family-wise error rate (FWER). Notably, the procedure based on forward stepwise selection also attains power comparable to that of state-of-the-art approaches such as ridge-projection inference and the debiased Lasso, particularly in settings with strong predictor correlation. The primary trade-off is computational cost, mitigated through parallelized resampling. Overall, the thesis provides a practical and theoretically grounded toolkit for rigorous inference in high-dimensional regression and paves the way for analogous advances in broader classes of models.
The rapid growth of high-dimensional data has made reliable inference, not merely prediction, a pressing challenge in linear regression when the number of covariates far exceeds the sample size. While sparsity-inducing estimators such as the Lasso excel at variable selection, post-selection p-values and multiple-testing guarantees remain elusive. This thesis advances the current state of the art by extending the permutation-based framework of De Santis et al. (2022) to Gaussian linear models, delivering valid per-variable p-values . For every coefficient, the proposed procedure treats the remaining predictors as potential confounders. To mitigate the limitations of the screening property, which requires that the preliminary selection step includes all relevant variables, a condition often violated under high collinearity, we explore two complementary strategies. The first performs a principal component decomposition of the confounder set followed by sparse estimation in the reduced space. The second applies a forward stepwise selection directly within the confounder set. In both cases, a standardized, sign-flipped score statistic is then computed for the target variable conditional on the selected components. Embedding these statistics in a Westfall–Young maxT permutation scheme automatically adjusts for dependence across tests, yielding simultaneous confidence statements that remain valid after model exploration. The method is fully non-parametric, distribution-free, and naturally extensible to generalized linear models. An extensive simulation study spanning Toeplitz and other covariance structures evaluates type I error, power and family-wise error rate. Across a wide range of sample sizes, signal-to-noise ratios, sparsity levels, and correlation strengths, both proposed methods achieve reliable control of type I error at the marginal level as well as strong control of the family-wise error rate (FWER). Notably, the procedure based on forward stepwise selection also attains power comparable to that of state-of-the-art approaches such as ridge-projection inference and the debiased Lasso, particularly in settings with strong predictor correlation. The primary trade-off is computational cost, mitigated through parallelized resampling. Overall, the thesis provides a practical and theoretically grounded toolkit for rigorous inference in high-dimensional regression and paves the way for analogous advances in broader classes of models.
Permutation-Based Inference for High-Dimensional Linear Models
DELLA PENNA, PAOLO
2024/2025
Abstract
The rapid growth of high-dimensional data has made reliable inference, not merely prediction, a pressing challenge in linear regression when the number of covariates far exceeds the sample size. While sparsity-inducing estimators such as the Lasso excel at variable selection, post-selection p-values and multiple-testing guarantees remain elusive. This thesis advances the current state of the art by extending the permutation-based framework of De Santis et al. (2022) to Gaussian linear models, delivering valid per-variable p-values . For every coefficient, the proposed procedure treats the remaining predictors as potential confounders. To mitigate the limitations of the screening property, which requires that the preliminary selection step includes all relevant variables, a condition often violated under high collinearity, we explore two complementary strategies. The first performs a principal component decomposition of the confounder set followed by sparse estimation in the reduced space. The second applies a forward stepwise selection directly within the confounder set. In both cases, a standardized, sign-flipped score statistic is then computed for the target variable conditional on the selected components. Embedding these statistics in a Westfall–Young maxT permutation scheme automatically adjusts for dependence across tests, yielding simultaneous confidence statements that remain valid after model exploration. The method is fully non-parametric, distribution-free, and naturally extensible to generalized linear models. An extensive simulation study spanning Toeplitz and other covariance structures evaluates type I error, power and family-wise error rate. Across a wide range of sample sizes, signal-to-noise ratios, sparsity levels, and correlation strengths, both proposed methods achieve reliable control of type I error at the marginal level as well as strong control of the family-wise error rate (FWER). Notably, the procedure based on forward stepwise selection also attains power comparable to that of state-of-the-art approaches such as ridge-projection inference and the debiased Lasso, particularly in settings with strong predictor correlation. The primary trade-off is computational cost, mitigated through parallelized resampling. Overall, the thesis provides a practical and theoretically grounded toolkit for rigorous inference in high-dimensional regression and paves the way for analogous advances in broader classes of models.| File | Dimensione | Formato | |
|---|---|---|---|
|
DellaPenna_Paolo.pdf
accesso aperto
Dimensione
1.29 MB
Formato
Adobe PDF
|
1.29 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/93034