This thesis addresses the challenge of slow optimization in model-based reinforcement learning (MBRL) by accelerating Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO) through a strategic integration with trajectory optimization. We introduce Exploration-Boosted MC-PILCO (EB-MC-PILCO), a framework that combines Gaussian Process (GP) dynamics models with the iterative Linear Quadratic Regulator (iLQR) to expedite learning. Our approach unfolds in two phases: (1) Guided Exploration, in which iLQR rapidly generates near-optimal trajectories to efficiently train the GP model. This phase serves a twofold purpose: it not only explores the state space to improve the model but also provides a strong initialization for policy optimization, and (2) Pretrained Policy Optimization, where MC-PILCO's policy is initialized using iLQR-derived solutions to avoid costly cold starts. To reconcile the deterministic nature of iLQR with the probabilistic framework of GP-based MBRL, we extend iLQR to accommodate GP-modeled dynamics and enforce input constraints via a squashing function, thereby ensuring real-world feasibility. The primary contributions of this thesis are twofold: (1) developing a novel method that integrates iLQR's exploratory trajectories into the probabilistic policy search of MC-PILCO, and (2) demonstrating that initializing MC-PILCO's policy with iLQR solutions significantly reduces the time required to solve the task while maintaining the same number of system interactions. Extensive simulations validate the efficacy of our approach by comparing various pretraining setups—including our method, an exact mean squared error (MSE) pretraining, and no pretraining. We further benchmark success rates and cumulative costs among MC-PILCO, EB-MC-PILCO, and standalone iLQR. The proposed methods are evaluated on the cartpole system—a nonlinear underactuated system—targeting the swing-up and stabilization task. Experimental results confirm a substantial reduction in optimization time and improved overall performance, highlighting the effectiveness of our integrated approach.
This thesis addresses the challenge of slow optimization in model-based reinforcement learning (MBRL) by accelerating Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO) through a strategic integration with trajectory optimization. We introduce Exploration-Boosted MC-PILCO (EB-MC-PILCO), a framework that combines Gaussian Process (GP) dynamics models with the iterative Linear Quadratic Regulator (iLQR) to expedite learning. Our approach unfolds in two phases: (1) Guided Exploration, in which iLQR rapidly generates near-optimal trajectories to efficiently train the GP model. This phase serves a twofold purpose: it not only explores the state space to improve the model but also provides a strong initialization for policy optimization, and (2) Pretrained Policy Optimization, where MC-PILCO's policy is initialized using iLQR-derived solutions to avoid costly cold starts. To reconcile the deterministic nature of iLQR with the probabilistic framework of GP-based MBRL, we extend iLQR to accommodate GP-modeled dynamics and enforce input constraints via a squashing function, thereby ensuring real-world feasibility. The primary contributions of this thesis are twofold: (1) developing a novel method that integrates iLQR's exploratory trajectories into the probabilistic policy search of MC-PILCO, and (2) demonstrating that initializing MC-PILCO's policy with iLQR solutions significantly reduces the time required to solve the task while maintaining the same number of system interactions. Extensive simulations validate the efficacy of our approach by comparing various pretraining setups—including our method, an exact mean squared error (MSE) pretraining, and no pretraining. We further benchmark success rates and cumulative costs among MC-PILCO, EB-MC-PILCO, and standalone iLQR. The proposed methods are evaluated on the cartpole system—a nonlinear underactuated system—targeting the swing-up and stabilization task. Experimental results confirm a substantial reduction in optimization time and improved overall performance, highlighting the effectiveness of our integrated approach.
Innovative Solutions for Policy Optimisation of Model-Based Reinforcement Learning Algorithms
CALÌ, MARCO
2024/2025
Abstract
This thesis addresses the challenge of slow optimization in model-based reinforcement learning (MBRL) by accelerating Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO) through a strategic integration with trajectory optimization. We introduce Exploration-Boosted MC-PILCO (EB-MC-PILCO), a framework that combines Gaussian Process (GP) dynamics models with the iterative Linear Quadratic Regulator (iLQR) to expedite learning. Our approach unfolds in two phases: (1) Guided Exploration, in which iLQR rapidly generates near-optimal trajectories to efficiently train the GP model. This phase serves a twofold purpose: it not only explores the state space to improve the model but also provides a strong initialization for policy optimization, and (2) Pretrained Policy Optimization, where MC-PILCO's policy is initialized using iLQR-derived solutions to avoid costly cold starts. To reconcile the deterministic nature of iLQR with the probabilistic framework of GP-based MBRL, we extend iLQR to accommodate GP-modeled dynamics and enforce input constraints via a squashing function, thereby ensuring real-world feasibility. The primary contributions of this thesis are twofold: (1) developing a novel method that integrates iLQR's exploratory trajectories into the probabilistic policy search of MC-PILCO, and (2) demonstrating that initializing MC-PILCO's policy with iLQR solutions significantly reduces the time required to solve the task while maintaining the same number of system interactions. Extensive simulations validate the efficacy of our approach by comparing various pretraining setups—including our method, an exact mean squared error (MSE) pretraining, and no pretraining. We further benchmark success rates and cumulative costs among MC-PILCO, EB-MC-PILCO, and standalone iLQR. The proposed methods are evaluated on the cartpole system—a nonlinear underactuated system—targeting the swing-up and stabilization task. Experimental results confirm a substantial reduction in optimization time and improved overall performance, highlighting the effectiveness of our integrated approach.File | Dimensione | Formato | |
---|---|---|---|
Calì_Marco.pdf
accesso aperto
Dimensione
1.58 MB
Formato
Adobe PDF
|
1.58 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/81938