<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Baohe Zhang 张宝赫</title><link>https://2bh.github.io/</link><atom:link href="https://2bh.github.io/index.xml" rel="self" type="application/rss+xml"/><description>Baohe Zhang 张宝赫</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en</language><lastBuildDate>Thu, 01 Jan 2026 00:00:00 +0000</lastBuildDate><image><url>https://2bh.github.io/media/icon_hu_87928198b81ce9d.png</url><title>Baohe Zhang 张宝赫</title><link>https://2bh.github.io/</link></image><item><title>Fitting Reinforcement Learning Model to Behavioral Data under Bandits</title><link>https://2bh.github.io/publication/zhu2025fitting/</link><pubDate>Thu, 01 Jan 2026 00:00:00 +0000</pubDate><guid>https://2bh.github.io/publication/zhu2025fitting/</guid><description/></item><item><title>Constrained Reinforcement Learning with Smoothed Log Barrier Function</title><link>https://2bh.github.io/publication/zhang2025constrained/</link><pubDate>Mon, 03 Nov 2025 00:00:00 +0000</pubDate><guid>https://2bh.github.io/publication/zhang2025constrained/</guid><description>&lt;div class="bg-white p-3 rounded"&gt;
&lt;div class="row align-items-center"&gt;
&lt;div class="col-12 text-center mb-3"&gt;
&lt;img src="constrained.png" alt="CSAC-LB: Smoothed Log Barrier for Constrained RL" class="img-fluid" /&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="row align-items-center"&gt;
&lt;div class="col-md-6 text-center mb-3 mb-md-0"&gt;
&lt;div class="embed-responsive embed-responsive-16by9"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/TjLUCbmbXuA" title="CSAC-LB demo 1" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="col-md-6 text-center mb-0"&gt;
&lt;div class="embed-responsive embed-responsive-16by9"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/mNxQfl_C-xg" title="CSAC-LB demo 2" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2 id="highlights"&gt;Highlights&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Standard Lagrangian-based constrained RL is hyperparameter-sensitive and often oscillates between greedy and overly-conservative behavior; pure barrier methods are numerically unstable near the constraint boundary.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Idea.&lt;/strong&gt; We introduce a &lt;strong&gt;Smoothed Log Barrier (CSAC-LB)&lt;/strong&gt; that approximates the indicator of the feasible set while remaining differentiable across the boundary, bridging soft penalty and hard barrier formulations while keeping the optimization numerically stable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Theory.&lt;/strong&gt; We show CSAC-LB is a smooth approximation of the feasible-set indicator and prove that its optimal policy converges to the optimal policy of the original constrained problem as the smoothing parameter tightens.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Practice.&lt;/strong&gt; On safety-critical continuous control benchmarks, CSAC-LB matches or exceeds prior constrained-RL baselines (Lagrangian-SAC, CPO-style methods) in return while staying within the safety budget, and does so without per-task tuning of the barrier hyperparameter. In two example videos above, I showcase the usage of the algorithm in real robots. We define the negative energy consumption as the reward and using different velocity as constraints. The robot dog manages to learn different gaits under different velocity constraints. I have further extended this algorithm to heating system. In the end, we didn&amp;rsquo;t include the robot part in the paper due to fault hardware issues.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Goal Achievement Guided Exploration: Mitigating Premature Convergence in Learning Robot Control</title><link>https://2bh.github.io/publication/yan2024goal/</link><pubDate>Thu, 31 Oct 2024 00:00:00 +0000</pubDate><guid>https://2bh.github.io/publication/yan2024goal/</guid><description>&lt;div class="bg-white p-3 rounded"&gt;
&lt;div class="row align-items-start"&gt;
&lt;div class="col-md-4 text-center mb-3"&gt;
&lt;div class="embed-responsive embed-responsive-16by9"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/soR6QvfyyRc" title="Go2 Beam Climbing" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class="text-dark small mt-2 mb-0"&gt;Go2 Beam Climbing&lt;/p&gt;
&lt;/div&gt;
&lt;div class="col-md-4 text-center mb-3"&gt;
&lt;div class="embed-responsive embed-responsive-16by9"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/HRLLdQJpJWY" title="Ant on Ball" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class="text-dark small mt-2 mb-0"&gt;Ant on Ball&lt;/p&gt;
&lt;/div&gt;
&lt;div class="col-md-4 text-center mb-3"&gt;
&lt;div class="embed-responsive embed-responsive-16by9"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/xsBVo-pnXw0" title="Humanoid Ball Dribbling" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class="text-dark small mt-2 mb-0"&gt;Humanoid Ball Dribbling&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="row align-items-start justify-content-center"&gt;
&lt;div class="col-md-6 text-center mb-3 mb-md-0"&gt;
&lt;div class="embed-responsive embed-responsive-16by9"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/jYuPbZrJJrI" title="Humanoid Cartwheel" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class="text-dark small mt-2 mb-0"&gt;Humanoid Cartwheel&lt;/p&gt;
&lt;/div&gt;
&lt;div class="col-md-6 text-center mb-0"&gt;
&lt;div class="embed-responsive embed-responsive-16by9"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/YvCzhhtGuGo" title="Humanoid Rope Walking" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class="text-dark small mt-2 mb-0"&gt;Humanoid Rope Walking&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2 id="highlights"&gt;Highlights&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem.&lt;/strong&gt; Robot-control RL is often stuck at the &lt;em&gt;sparse reward&lt;/em&gt; setting — the agent only sees a learning signal after the full task is completed, leading to slow convergence or outright failure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Key observation.&lt;/strong&gt; Even when extrinsic reward is sparse, intermediate &lt;em&gt;goal-achievement&lt;/em&gt; signals (sub-goals being reached, contacts being made, balance being held) carry useful information that standard exploration ignores.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Method — GAGE.&lt;/strong&gt; &lt;em&gt;Goal Achievement Guided Exploration&lt;/em&gt; turns these goal-achievement signals into an exploration bonus that steers the policy away from premature convergence onto trivial behaviors, without changing the underlying RL algorithm.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Result.&lt;/strong&gt; Across challenging whole-body control tasks — Go2 beam climbing, Ant balancing on a ball, humanoid dribbling, cartwheels, and rope walking (videos above) — GAGE learns where standard sparse-reward RL stalls, and improves sample efficiency on the tasks where baselines do learn.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Constrained Reinforcement Learning for Safe Heat Pump Control</title><link>https://2bh.github.io/publication/zhang2024constrained/</link><pubDate>Sun, 01 Sep 2024 00:00:00 +0000</pubDate><guid>https://2bh.github.io/publication/zhang2024constrained/</guid><description/></item><item><title>Revisiting Safe Exploration in Safe Reinforcement learning</title><link>https://2bh.github.io/publication/eckel2024revisiting/</link><pubDate>Sun, 01 Sep 2024 00:00:00 +0000</pubDate><guid>https://2bh.github.io/publication/eckel2024revisiting/</guid><description/></item><item><title>Learning Continuous Control with Geometric Regularity from Robot Intrinsic Symmetry</title><link>https://2bh.github.io/publication/yan2024learning/</link><pubDate>Mon, 13 May 2024 00:00:00 +0000</pubDate><guid>https://2bh.github.io/publication/yan2024learning/</guid><description>&lt;div class="bg-white p-3 rounded"&gt;
&lt;div class="row align-items-center"&gt;
&lt;div class="col-12 text-center mb-0"&gt;
&lt;img src="masa.png" alt="MASA: leveraging robot intrinsic symmetry for continuous control" class="img-fluid" /&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="row align-items-start mt-4"&gt;
&lt;div class="col-6 col-md-3 mb-3"&gt;
&lt;div class="embed-responsive" style="padding-bottom: 177.78%;"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/q_JT1ImIFM8" title="MASA Trifinger Real" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class="small mt-2 mb-0 text-center"&gt;MASA — Trifinger (Real)&lt;/p&gt;
&lt;/div&gt;
&lt;div class="col-6 col-md-3 mb-3"&gt;
&lt;div class="embed-responsive" style="padding-bottom: 177.78%;"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/XGPIhZw28sk" title="Baseline Trifinger Real" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class="small mt-2 mb-0 text-center"&gt;Baseline — Trifinger (Real)&lt;/p&gt;
&lt;/div&gt;
&lt;div class="col-6 col-md-3 mb-3"&gt;
&lt;div class="embed-responsive" style="padding-bottom: 177.78%;"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/nqAOm3owZig" title="MASA Ant on Ball" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class="small mt-2 mb-0 text-center"&gt;MASA — Ant on Ball&lt;/p&gt;
&lt;/div&gt;
&lt;div class="col-6 col-md-3 mb-3"&gt;
&lt;div class="embed-responsive" style="padding-bottom: 177.78%;"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/26OSSJZ7eO0" title="Baseline Ant on Ball" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;p class="small mt-2 mb-0 text-center"&gt;Baseline — Ant on Ball&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2 id="highlights"&gt;Highlights&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Observation.&lt;/strong&gt; Many robots are &lt;em&gt;built symmetric&lt;/em&gt; — quadrupeds, humanoids, and multi-finger hands all carry reflectional/rotational symmetry in their structure. Yet standard MLP policies for single-agent control ignore this prior entirely.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Idea — MASA.&lt;/strong&gt; We design network structures for single-agent control that &lt;strong&gt;explicitly encode the robot&amp;rsquo;s intrinsic geometric symmetry&lt;/strong&gt;, drawing inspiration from parameter-sharing patterns used in cooperative multi-agent RL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection to multi-agent RL.&lt;/strong&gt; We make the relationship between geometric priors and &lt;em&gt;parameter sharing&lt;/em&gt; precise: treating symmetric limbs/fingers as cooperating &amp;ldquo;agents&amp;rdquo; with shared parameters recovers an equivariant policy class.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Result.&lt;/strong&gt; Symmetry-aware policies learn faster and transfer better than baselines across simulated and real-robot tasks (videos above: Trifinger manipulation and Ant-on-Ball balancing), and earned a Best Paper nomination in the Cognitive Robotics Track at ICRA 2024.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Automated Reinforcement Learning (AutoRL): A Survey and Open Problems</title><link>https://2bh.github.io/publication/rajan2023autorl/</link><pubDate>Sun, 01 Jan 2023 00:00:00 +0000</pubDate><guid>https://2bh.github.io/publication/rajan2023autorl/</guid><description/></item><item><title>Automated reinforcement learning (autorl): A survey and open problems</title><link>https://2bh.github.io/publication/parkerholder2022autorl/</link><pubDate>Mon, 19 Sep 2022 00:00:00 +0000</pubDate><guid>https://2bh.github.io/publication/parkerholder2022autorl/</guid><description/></item><item><title>On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning</title><link>https://2bh.github.io/publication/hpo4mbrl/</link><pubDate>Thu, 01 Apr 2021 00:00:00 +0000</pubDate><guid>https://2bh.github.io/publication/hpo4mbrl/</guid><description>&lt;div class="bg-white p-3 rounded"&gt;
&lt;div class="row align-items-center"&gt;
&lt;div class="col-md-6 text-center mb-3"&gt;
&lt;img src="pbt.png" alt="Population-Based Training (PBT)" class="img-fluid" /&gt;
&lt;/div&gt;
&lt;div class="col-md-6 text-center mb-3"&gt;
&lt;img src="multi-fidelity.png" alt="Multi-fidelity HPO" class="img-fluid" /&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="row align-items-center"&gt;
&lt;div class="col-md-6 text-center mb-0"&gt;
&lt;img src="hpo4rl_curves.png" alt="HPO for MBRL learning curves" class="img-fluid" /&gt;
&lt;/div&gt;
&lt;div class="col-md-6 text-center mb-0"&gt;
&lt;div class="embed-responsive embed-responsive-16by9"&gt;
&lt;iframe class="embed-responsive-item" src="https://www.youtube.com/embed/ztuyicYEiXw" title="HPO for MBRL talk" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2 id="highlights"&gt;Highlights&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Why HPO for MBRL?&lt;/strong&gt; Model-based RL pipelines stack dynamics learning and planning, exposing tens of interacting hyperparameters that are usually hand-tuned by experts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Approach.&lt;/strong&gt; We apply automated HPO — including &lt;em&gt;multi-fidelity&lt;/em&gt; search (top-right) and &lt;em&gt;Population-Based Training&lt;/em&gt; (top-left) — to MBRL, and additionally allow hyperparameters to be tuned &lt;em&gt;dynamically&lt;/em&gt; during training. Figures are made by André Biedenkapp.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Result.&lt;/strong&gt; Automated HPO beats human-expert tuning across MuJoCo tasks (bottom-left), and dynamic tuning yields further gains over the best static configuration. Our results found a bug in mujoco that allows halfcheetah goes wildly like a &lt;strong&gt;helicopter&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Takeaways.&lt;/strong&gt; The paper also dissects which hyperparameters (plan horizon, learning rate, etc.) drive stability and final reward in MBRL.&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>