Overfitting remains one of the most important obstacles in applying Machine Learning techniques to algorithmic trading, especially using high-frequency data. While recent research proves that data selection mitigates this issue, empirical applications often lack robust statistical tools to quantify overfitting risk. This paper extends the analysis by combining a systematic data selection framework with modern overfitting diagnostics, including Purged Cross-Validation, the Probability of Backtest Overfitting, and the deflated Sharpe Ratio, and several machine learning models. Using one-minute Foreign Exchange data across multiple pairs and market regimes, we evaluate how choices of data source, sampling frequency, machine learning model and market instrument impact both predictive accuracy and robustness. Results show that apparent profitability in-sample collapses out-of-sample even if data are carefully selected and validated with stringent statistical tests. None of the selected strategies remain profitable once we apply robustness diagnostics and realistic trading costs. Through this paper we propose a reproducible methodological pipeline that researchers and practitioners can adopt to design more reliable trading strategies.
Overfitting remains one of the most important obstacles in applying Machine Learning techniques to algorithmic trading, especially using high-frequency data. While recent research proves that data selection mitigates this issue, empirical applications often lack robust statistical tools to quantify overfitting risk. This paper extends the analysis by combining a systematic data selection framework with modern overfitting diagnostics, including Purged Cross-Validation, the Probability of Backtest Overfitting, and the deflated Sharpe Ratio, and several machine learning models. Using one-minute Foreign Exchange data across multiple pairs and market regimes, we evaluate how choices of data source, sampling frequency, machine learning model and market instrument impact both predictive accuracy and robustness. Results show that apparent profitability in-sample collapses out-of-sample even if data are carefully selected and validated with stringent statistical tests. None of the selected strategies remain profitable once we apply robustness diagnostics and realistic trading costs. Through this paper we propose a reproducible methodological pipeline that researchers and practitioners can adopt to design more reliable trading strategies.
ROBUST DATA SELECTION AND OVERFITTING FOR INTRADAY TRADING WITH MACHINE LEARNING
BERTO, ENRICO
2024/2025
Abstract
Overfitting remains one of the most important obstacles in applying Machine Learning techniques to algorithmic trading, especially using high-frequency data. While recent research proves that data selection mitigates this issue, empirical applications often lack robust statistical tools to quantify overfitting risk. This paper extends the analysis by combining a systematic data selection framework with modern overfitting diagnostics, including Purged Cross-Validation, the Probability of Backtest Overfitting, and the deflated Sharpe Ratio, and several machine learning models. Using one-minute Foreign Exchange data across multiple pairs and market regimes, we evaluate how choices of data source, sampling frequency, machine learning model and market instrument impact both predictive accuracy and robustness. Results show that apparent profitability in-sample collapses out-of-sample even if data are carefully selected and validated with stringent statistical tests. None of the selected strategies remain profitable once we apply robustness diagnostics and realistic trading costs. Through this paper we propose a reproducible methodological pipeline that researchers and practitioners can adopt to design more reliable trading strategies.| File | Dimensione | Formato | |
|---|---|---|---|
|
Berto_Enrico.pdf
Accesso riservato
Dimensione
2.55 MB
Formato
Adobe PDF
|
2.55 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/101977