Baohe Zhang 张宝赫

Fitting Reinforcement Learning Model to Behavioral Data under Bandits

Thu, 01 Jan 2026 00:00:00 +0000

Mon, 03 Nov 2025 00:00:00 +0000

Problem. Standard Lagrangian-based constrained RL is hyperparameter-sensitive and often oscillates between greedy and overly-conservative behavior; pure barrier methods are numerically unstable near the constraint boundary.
Idea. We introduce a Smoothed Log Barrier (CSAC-LB) that approximates the indicator of the feasible set while remaining differentiable across the boundary, bridging soft penalty and hard barrier formulations while keeping the optimization numerically stable.
Theory. We show CSAC-LB is a smooth approximation of the feasible-set indicator and prove that its optimal policy converges to the optimal policy of the original constrained problem as the smoothing parameter tightens.
Practice. On safety-critical continuous control benchmarks, CSAC-LB matches or exceeds prior constrained-RL baselines (Lagrangian-SAC, CPO-style methods) in return while staying within the safety budget, and does so without per-task tuning of the barrier hyperparameter. In two example videos above, I showcase the usage of the algorithm in real robots. We define the negative energy consumption as the reward and using different velocity as constraints. The robot dog manages to learn different gaits under different velocity constraints. I have further extended this algorithm to heating system. In the end, we didn’t include the robot part in the paper due to fault hardware issues.

Thu, 31 Oct 2024 00:00:00 +0000

Go2 Beam Climbing

Ant on Ball

Humanoid Ball Dribbling

Humanoid Cartwheel

Humanoid Rope Walking

Problem. Robot-control RL is often stuck at the sparse reward setting — the agent only sees a learning signal after the full task is completed, leading to slow convergence or outright failure.
Key observation. Even when extrinsic reward is sparse, intermediate goal-achievement signals (sub-goals being reached, contacts being made, balance being held) carry useful information that standard exploration ignores.
Method — GAGE. Goal Achievement Guided Exploration turns these goal-achievement signals into an exploration bonus that steers the policy away from premature convergence onto trivial behaviors, without changing the underlying RL algorithm.
Result. Across challenging whole-body control tasks — Go2 beam climbing, Ant balancing on a ball, humanoid dribbling, cartwheels, and rope walking (videos above) — GAGE learns where standard sparse-reward RL stalls, and improves sample efficiency on the tasks where baselines do learn.

Sun, 01 Sep 2024 00:00:00 +0000

Sun, 01 Sep 2024 00:00:00 +0000

Mon, 13 May 2024 00:00:00 +0000

MASA — Trifinger (Real)

Baseline — Trifinger (Real)

MASA — Ant on Ball

Baseline — Ant on Ball

Observation. Many robots are built symmetric — quadrupeds, humanoids, and multi-finger hands all carry reflectional/rotational symmetry in their structure. Yet standard MLP policies for single-agent control ignore this prior entirely.
Idea — MASA. We design network structures for single-agent control that explicitly encode the robot’s intrinsic geometric symmetry, drawing inspiration from parameter-sharing patterns used in cooperative multi-agent RL.
Connection to multi-agent RL. We make the relationship between geometric priors and parameter sharing precise: treating symmetric limbs/fingers as cooperating “agents” with shared parameters recovers an equivariant policy class.
Result. Symmetry-aware policies learn faster and transfer better than baselines across simulated and real-robot tasks (videos above: Trifinger manipulation and Ant-on-Ball balancing), and earned a Best Paper nomination in the Cognitive Robotics Track at ICRA 2024.

Sun, 01 Jan 2023 00:00:00 +0000

Mon, 19 Sep 2022 00:00:00 +0000

Thu, 01 Apr 2021 00:00:00 +0000

Why HPO for MBRL? Model-based RL pipelines stack dynamics learning and planning, exposing tens of interacting hyperparameters that are usually hand-tuned by experts.
Approach. We apply automated HPO — including multi-fidelity search (top-right) and Population-Based Training (top-left) — to MBRL, and additionally allow hyperparameters to be tuned dynamically during training. Figures are made by André Biedenkapp.
Result. Automated HPO beats human-expert tuning across MuJoCo tasks (bottom-left), and dynamic tuning yields further gains over the best static configuration. Our results found a bug in mujoco that allows halfcheetah goes wildly like a helicopter.
Takeaways. The paper also dissects which hyperparameters (plan horizon, learning rate, etc.) drive stability and final reward in MBRL.