Overfitting remains one of the most important obstacles in applying Machine Learning techniques to algorithmic trading, especially using high-frequency data. While recent research proves that data selection mitigates this issue, empirical applications often lack robust statistical tools to quantify overfitting risk. This paper extends the analysis by combining a systematic data selection framework with modern overfitting diagnostics, including Purged Cross-Validation, the Probability of Backtest Overfitting, and the deflated Sharpe Ratio, and several machine learning models. Using one-minute Foreign Exchange data across multiple pairs and market regimes, we evaluate how choices of data source, sampling frequency, machine learning model and market instrument impact both predictive accuracy and robustness. Results show that apparent profitability in-sample collapses out-of-sample even if data are carefully selected and validated with stringent statistical tests. None of the selected strategies remain profitable once we apply robustness diagnostics and realistic trading costs. Through this paper we propose a reproducible methodological pipeline that researchers and practitioners can adopt to design more reliable trading strategies.

Overfitting remains one of the most important obstacles in applying Machine Learning techniques to algorithmic trading, especially using high-frequency data. While recent research proves that data selection mitigates this issue, empirical applications often lack robust statistical tools to quantify overfitting risk. This paper extends the analysis by combining a systematic data selection framework with modern overfitting diagnostics, including Purged Cross-Validation, the Probability of Backtest Overfitting, and the deflated Sharpe Ratio, and several machine learning models. Using one-minute Foreign Exchange data across multiple pairs and market regimes, we evaluate how choices of data source, sampling frequency, machine learning model and market instrument impact both predictive accuracy and robustness. Results show that apparent profitability in-sample collapses out-of-sample even if data are carefully selected and validated with stringent statistical tests. None of the selected strategies remain profitable once we apply robustness diagnostics and realistic trading costs. Through this paper we propose a reproducible methodological pipeline that researchers and practitioners can adopt to design more reliable trading strategies.

ROBUST DATA SELECTION AND OVERFITTING FOR INTRADAY TRADING WITH MACHINE LEARNING

BERTO, ENRICO
2024/2025

Abstract

Overfitting remains one of the most important obstacles in applying Machine Learning techniques to algorithmic trading, especially using high-frequency data. While recent research proves that data selection mitigates this issue, empirical applications often lack robust statistical tools to quantify overfitting risk. This paper extends the analysis by combining a systematic data selection framework with modern overfitting diagnostics, including Purged Cross-Validation, the Probability of Backtest Overfitting, and the deflated Sharpe Ratio, and several machine learning models. Using one-minute Foreign Exchange data across multiple pairs and market regimes, we evaluate how choices of data source, sampling frequency, machine learning model and market instrument impact both predictive accuracy and robustness. Results show that apparent profitability in-sample collapses out-of-sample even if data are carefully selected and validated with stringent statistical tests. None of the selected strategies remain profitable once we apply robustness diagnostics and realistic trading costs. Through this paper we propose a reproducible methodological pipeline that researchers and practitioners can adopt to design more reliable trading strategies.
2024
ROBUST DATA SELECTION AND OVERFITTING FOR INTRADAY TRADING WITH MACHINE LEARNING
Overfitting remains one of the most important obstacles in applying Machine Learning techniques to algorithmic trading, especially using high-frequency data. While recent research proves that data selection mitigates this issue, empirical applications often lack robust statistical tools to quantify overfitting risk. This paper extends the analysis by combining a systematic data selection framework with modern overfitting diagnostics, including Purged Cross-Validation, the Probability of Backtest Overfitting, and the deflated Sharpe Ratio, and several machine learning models. Using one-minute Foreign Exchange data across multiple pairs and market regimes, we evaluate how choices of data source, sampling frequency, machine learning model and market instrument impact both predictive accuracy and robustness. Results show that apparent profitability in-sample collapses out-of-sample even if data are carefully selected and validated with stringent statistical tests. None of the selected strategies remain profitable once we apply robustness diagnostics and realistic trading costs. Through this paper we propose a reproducible methodological pipeline that researchers and practitioners can adopt to design more reliable trading strategies.
Machine learning
Algorithmic trading
Foreign exchange
Overfitting
Trading profitabilit
File in questo prodotto:
File Dimensione Formato  
Berto_Enrico.pdf

Accesso riservato

Dimensione 2.55 MB
Formato Adobe PDF
2.55 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/101977