Innovative Solutions for Policy Optimisation of Model-Based Reinforcement Learning Algorithms

This thesis addresses the challenge of slow optimization in model-based reinforcement learning (MBRL) by accelerating Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO) through a strategic integration with trajectory optimization. We introduce Exploration-Boosted MC-PILCO (EB-MC-PILCO), a framework that combines Gaussian Process (GP) dynamics models with the iterative Linear Quadratic Regulator (iLQR) to expedite learning. Our approach unfolds in two phases: (1) Guided Exploration, in which iLQR rapidly generates near-optimal trajectories to efficiently train the GP model. This phase serves a twofold purpose: it not only explores the state space to improve the model but also provides a strong initialization for policy optimization, and (2) Pretrained Policy Optimization, where MC-PILCO's policy is initialized using iLQR-derived solutions to avoid costly cold starts. To reconcile the deterministic nature of iLQR with the probabilistic framework of GP-based MBRL, we extend iLQR to accommodate GP-modeled dynamics and enforce input constraints via a squashing function, thereby ensuring real-world feasibility. The primary contributions of this thesis are twofold: (1) developing a novel method that integrates iLQR's exploratory trajectories into the probabilistic policy search of MC-PILCO, and (2) demonstrating that initializing MC-PILCO's policy with iLQR solutions significantly reduces the time required to solve the task while maintaining the same number of system interactions. Extensive simulations validate the efficacy of our approach by comparing various pretraining setups—including our method, an exact mean squared error (MSE) pretraining, and no pretraining. We further benchmark success rates and cumulative costs among MC-PILCO, EB-MC-PILCO, and standalone iLQR. The proposed methods are evaluated on the cartpole system—a nonlinear underactuated system—targeting the swing-up and stabilization task. Experimental results confirm a substantial reduction in optimization time and improved overall performance, highlighting the effectiveness of our integrated approach.

Innovative Solutions for Policy Optimisation of Model-Based Reinforcement Learning Algorithms

CALÌ, MARCO

2024/2025

Abstract

This thesis addresses the challenge of slow optimization in model-based reinforcement learning (MBRL) by accelerating Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO) through a strategic integration with trajectory optimization. We introduce Exploration-Boosted MC-PILCO (EB-MC-PILCO), a framework that combines Gaussian Process (GP) dynamics models with the iterative Linear Quadratic Regulator (iLQR) to expedite learning. Our approach unfolds in two phases: (1) Guided Exploration, in which iLQR rapidly generates near-optimal trajectories to efficiently train the GP model. This phase serves a twofold purpose: it not only explores the state space to improve the model but also provides a strong initialization for policy optimization, and (2) Pretrained Policy Optimization, where MC-PILCO's policy is initialized using iLQR-derived solutions to avoid costly cold starts. To reconcile the deterministic nature of iLQR with the probabilistic framework of GP-based MBRL, we extend iLQR to accommodate GP-modeled dynamics and enforce input constraints via a squashing function, thereby ensuring real-world feasibility. The primary contributions of this thesis are twofold: (1) developing a novel method that integrates iLQR's exploratory trajectories into the probabilistic policy search of MC-PILCO, and (2) demonstrating that initializing MC-PILCO's policy with iLQR solutions significantly reduces the time required to solve the task while maintaining the same number of system interactions. Extensive simulations validate the efficacy of our approach by comparing various pretraining setups—including our method, an exact mean squared error (MSE) pretraining, and no pretraining. We further benchmark success rates and cumulative costs among MC-PILCO, EB-MC-PILCO, and standalone iLQR. The proposed methods are evaluated on the cartpole system—a nonlinear underactuated system—targeting the swing-up and stabilization task. Experimental results confirm a substantial reduction in optimization time and improved overall performance, highlighting the effectiveness of our integrated approach.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				CONTROL SYSTEMS ENGINEERING Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Innovative Solutions for Policy Optimisation of Model-Based Reinforcement Learning Algorithms
			
	Abstract in italiano
	
				This thesis addresses the challenge of slow optimization in model-based reinforcement learning (MBRL) by accelerating Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO) through a strategic integration with trajectory optimization. We introduce Exploration-Boosted MC-PILCO (EB-MC-PILCO), a framework that combines Gaussian Process (GP) dynamics models with the iterative Linear Quadratic Regulator (iLQR) to expedite learning. Our approach unfolds in two phases: (1) Guided Exploration, in which iLQR rapidly generates near-optimal trajectories to efficiently train the GP model. This phase serves a twofold purpose: it not only explores the state space to improve the model but also provides a strong initialization for policy optimization, and (2) Pretrained Policy Optimization, where MC-PILCO's policy is initialized using iLQR-derived solutions to avoid costly cold starts.

To reconcile the deterministic nature of iLQR with the probabilistic framework of GP-based MBRL, we extend iLQR to accommodate GP-modeled dynamics and enforce input constraints via a squashing function, thereby ensuring real-world feasibility. The primary contributions of this thesis are twofold: (1) developing a novel method that integrates iLQR's exploratory trajectories into the probabilistic policy search of MC-PILCO, and (2) demonstrating that initializing MC-PILCO's policy with iLQR solutions significantly reduces the time required to solve the task while maintaining the same number of system interactions.

Extensive simulations validate the efficacy of our approach by comparing various pretraining setups—including our method, an exact mean squared error (MSE) pretraining, and no pretraining. We further benchmark success rates and cumulative costs among MC-PILCO, EB-MC-PILCO, and standalone iLQR. The proposed methods are evaluated on the cartpole system—a nonlinear underactuated system—targeting the swing-up and stabilization task. Experimental results confirm a substantial reduction in optimization time and improved overall performance, highlighting the effectiveness of our integrated approach.
			
	Parola chiave
	
				MBRL
Importance Sampling
Guiding Distribution
Gaussian Processes
iLQR
			
	Relatore
	
				DALLA LIBERA, ALBERTO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Calì_Marco.pdf accesso aperto Dimensione 1.58 MB Formato Adobe PDF Visualizza/Apri	1.58 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/81938