This thesis addresses the challenge of slow optimization in model-based reinforcement learning (MBRL) by accelerating Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO) through a strategic integration with trajectory optimization. We introduce Exploration-Boosted MC-PILCO (EB-MC-PILCO), a framework that combines Gaussian Process (GP) dynamics models with the iterative Linear Quadratic Regulator (iLQR) to expedite learning. Our approach unfolds in two phases: (1) Guided Exploration, in which iLQR rapidly generates near-optimal trajectories to efficiently train the GP model. This phase serves a twofold purpose: it not only explores the state space to improve the model but also provides a strong initialization for policy optimization, and (2) Pretrained Policy Optimization, where MC-PILCO's policy is initialized using iLQR-derived solutions to avoid costly cold starts. To reconcile the deterministic nature of iLQR with the probabilistic framework of GP-based MBRL, we extend iLQR to accommodate GP-modeled dynamics and enforce input constraints via a squashing function, thereby ensuring real-world feasibility. The primary contributions of this thesis are twofold: (1) developing a novel method that integrates iLQR's exploratory trajectories into the probabilistic policy search of MC-PILCO, and (2) demonstrating that initializing MC-PILCO's policy with iLQR solutions significantly reduces the time required to solve the task while maintaining the same number of system interactions. Extensive simulations validate the efficacy of our approach by comparing various pretraining setups—including our method, an exact mean squared error (MSE) pretraining, and no pretraining. We further benchmark success rates and cumulative costs among MC-PILCO, EB-MC-PILCO, and standalone iLQR. The proposed methods are evaluated on the cartpole system—a nonlinear underactuated system—targeting the swing-up and stabilization task. Experimental results confirm a substantial reduction in optimization time and improved overall performance, highlighting the effectiveness of our integrated approach.

This thesis addresses the challenge of slow optimization in model-based reinforcement learning (MBRL) by accelerating Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO) through a strategic integration with trajectory optimization. We introduce Exploration-Boosted MC-PILCO (EB-MC-PILCO), a framework that combines Gaussian Process (GP) dynamics models with the iterative Linear Quadratic Regulator (iLQR) to expedite learning. Our approach unfolds in two phases: (1) Guided Exploration, in which iLQR rapidly generates near-optimal trajectories to efficiently train the GP model. This phase serves a twofold purpose: it not only explores the state space to improve the model but also provides a strong initialization for policy optimization, and (2) Pretrained Policy Optimization, where MC-PILCO's policy is initialized using iLQR-derived solutions to avoid costly cold starts. To reconcile the deterministic nature of iLQR with the probabilistic framework of GP-based MBRL, we extend iLQR to accommodate GP-modeled dynamics and enforce input constraints via a squashing function, thereby ensuring real-world feasibility. The primary contributions of this thesis are twofold: (1) developing a novel method that integrates iLQR's exploratory trajectories into the probabilistic policy search of MC-PILCO, and (2) demonstrating that initializing MC-PILCO's policy with iLQR solutions significantly reduces the time required to solve the task while maintaining the same number of system interactions. Extensive simulations validate the efficacy of our approach by comparing various pretraining setups—including our method, an exact mean squared error (MSE) pretraining, and no pretraining. We further benchmark success rates and cumulative costs among MC-PILCO, EB-MC-PILCO, and standalone iLQR. The proposed methods are evaluated on the cartpole system—a nonlinear underactuated system—targeting the swing-up and stabilization task. Experimental results confirm a substantial reduction in optimization time and improved overall performance, highlighting the effectiveness of our integrated approach.

Innovative Solutions for Policy Optimisation of Model-Based Reinforcement Learning Algorithms

CALÌ, MARCO
2024/2025

Abstract

This thesis addresses the challenge of slow optimization in model-based reinforcement learning (MBRL) by accelerating Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO) through a strategic integration with trajectory optimization. We introduce Exploration-Boosted MC-PILCO (EB-MC-PILCO), a framework that combines Gaussian Process (GP) dynamics models with the iterative Linear Quadratic Regulator (iLQR) to expedite learning. Our approach unfolds in two phases: (1) Guided Exploration, in which iLQR rapidly generates near-optimal trajectories to efficiently train the GP model. This phase serves a twofold purpose: it not only explores the state space to improve the model but also provides a strong initialization for policy optimization, and (2) Pretrained Policy Optimization, where MC-PILCO's policy is initialized using iLQR-derived solutions to avoid costly cold starts. To reconcile the deterministic nature of iLQR with the probabilistic framework of GP-based MBRL, we extend iLQR to accommodate GP-modeled dynamics and enforce input constraints via a squashing function, thereby ensuring real-world feasibility. The primary contributions of this thesis are twofold: (1) developing a novel method that integrates iLQR's exploratory trajectories into the probabilistic policy search of MC-PILCO, and (2) demonstrating that initializing MC-PILCO's policy with iLQR solutions significantly reduces the time required to solve the task while maintaining the same number of system interactions. Extensive simulations validate the efficacy of our approach by comparing various pretraining setups—including our method, an exact mean squared error (MSE) pretraining, and no pretraining. We further benchmark success rates and cumulative costs among MC-PILCO, EB-MC-PILCO, and standalone iLQR. The proposed methods are evaluated on the cartpole system—a nonlinear underactuated system—targeting the swing-up and stabilization task. Experimental results confirm a substantial reduction in optimization time and improved overall performance, highlighting the effectiveness of our integrated approach.
2024
Innovative Solutions for Policy Optimisation of Model-Based Reinforcement Learning Algorithms
This thesis addresses the challenge of slow optimization in model-based reinforcement learning (MBRL) by accelerating Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO) through a strategic integration with trajectory optimization. We introduce Exploration-Boosted MC-PILCO (EB-MC-PILCO), a framework that combines Gaussian Process (GP) dynamics models with the iterative Linear Quadratic Regulator (iLQR) to expedite learning. Our approach unfolds in two phases: (1) Guided Exploration, in which iLQR rapidly generates near-optimal trajectories to efficiently train the GP model. This phase serves a twofold purpose: it not only explores the state space to improve the model but also provides a strong initialization for policy optimization, and (2) Pretrained Policy Optimization, where MC-PILCO's policy is initialized using iLQR-derived solutions to avoid costly cold starts. To reconcile the deterministic nature of iLQR with the probabilistic framework of GP-based MBRL, we extend iLQR to accommodate GP-modeled dynamics and enforce input constraints via a squashing function, thereby ensuring real-world feasibility. The primary contributions of this thesis are twofold: (1) developing a novel method that integrates iLQR's exploratory trajectories into the probabilistic policy search of MC-PILCO, and (2) demonstrating that initializing MC-PILCO's policy with iLQR solutions significantly reduces the time required to solve the task while maintaining the same number of system interactions. Extensive simulations validate the efficacy of our approach by comparing various pretraining setups—including our method, an exact mean squared error (MSE) pretraining, and no pretraining. We further benchmark success rates and cumulative costs among MC-PILCO, EB-MC-PILCO, and standalone iLQR. The proposed methods are evaluated on the cartpole system—a nonlinear underactuated system—targeting the swing-up and stabilization task. Experimental results confirm a substantial reduction in optimization time and improved overall performance, highlighting the effectiveness of our integrated approach.
MBRL
Importance Sampling
Guiding Distribution
Gaussian Processes
iLQR
File in questo prodotto:
File Dimensione Formato  
Calì_Marco.pdf

accesso aperto

Dimensione 1.58 MB
Formato Adobe PDF
1.58 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/81938