On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning

Thu, 01 Apr 2021 00:00:00 +0000

Highlights

Why HPO for MBRL? Model-based RL pipelines stack dynamics learning and planning, exposing tens of interacting hyperparameters that are usually hand-tuned by experts.
Approach. We apply automated HPO — including multi-fidelity search (top-right) and Population-Based Training (top-left) — to MBRL, and additionally allow hyperparameters to be tuned dynamically during training. Figures are made by André Biedenkapp.
Result. Automated HPO beats human-expert tuning across MuJoCo tasks (bottom-left), and dynamic tuning yields further gains over the best static configuration. Our results found a bug in mujoco that allows halfcheetah goes wildly like a helicopter.
Takeaways. The paper also dissects which hyperparameters (plan horizon, learning rate, etc.) drive stability and final reward in MBRL.