jonathangu.com OpenClawBrain OpenClawBrain references Contact

Corrected Policy Gradients: From Gu (2016) to OpenClawBrain

A practical derivation from the classic REINFORCE fix to trajectory-aware graph routing.

Jonathan Gu · February 2026

Introduction

In 2016, I proved that the standard REINFORCE update used in Williams (1992) was only a one-step approximation and did not by itself optimize the value of the current state for all future steps. The corrected rule I derived requires the score term to be summed over the remainder of the trajectory, not only multiplied by the reward at a single time step.

In modern language: the original rule was myopic. It rewarded what you just did. The corrected rule correctly rewards what you did and everything that can still happen after that in that same rollout.

The Original Paper

Gu (2016) shows that the conventional REINFORCE rule is:

\Delta \rho \propto \frac{\partial \log P_\rho(a_t\mid s_t)}{\partial \rho} z_t

and argues this updates toward a high-probability action for the immediate move, but it does not represent the right objective gradient for state-wise value maximization. The corrected update is

\frac{\partial v_\rho(s_t)}{\partial \rho} = E\!\left[ z_T\sum_{l=t}^{T}\frac{\partial\log P_\rho(a_l\mid s_l)}{\partial \rho} \right]

The key difference is the sum over the tail of the trajectory \(l=t,\dots,T\). That sum makes the update aware of how each earlier choice affects all future decisions and outcomes.

So the corrected rule is the one that maximizes the value function at state \(s_t\), not only the immediate one-step contribution.

Policy Gradient From Scratch

Start with a concrete picture: one node, three doors, and a STOP sign

Imagine a node \(i\), and three outgoing actions \(A,B,C\), plus \(\text{STOP}\). OpenClawBrain chooses an action stochastically via a softmax policy:

\pi(a\mid i)= \frac{\exp\left((r_{ia}+w_{ia})/\tau\right)}{\sum_{j\in \text{neighbors}(i)\cup\{\text{STOP}\}} \exp\left((r_{ij}+w_{ij})/\tau\right)}

Here \(r_{ia}\) is a fixed relevance term, \(w_{ia}\) is the trainable parameter, and \(\tau\) controls entropy.

Now, if the episode outcome is good, we should increase the chance of the chosen action and decrease the chance of the alternatives so that the total probability at node \(i\) stays normalized. That is the conservation intuition behind policy gradients.

Step 1: write \(\log \pi(a\mid i)\)

Let \(S_i\) be the set of actions at node \(i\) and \(Z_i=\sum_{j\in S_i}\exp((r_{ij}+w_{ij})/\tau)\). Then:

\log \pi(a\mid i)=\frac{r_{ia}+w_{ia}}{\tau}-\log Z_i

Step 2: gradients w.r.t. learned logits

For the selected action \(a\):

\frac{\partial \log \pi(a\mid i)}{\partial w_{ia}}=\frac{1}{\tau}\left(1-\pi(a\mid i)\right)

For any \(j\neq a\):

\frac{\partial \log \pi(a\mid i)}{\partial w_{ij}}=-\frac{1}{\tau}\pi(j\mid i)

In vector form:

\nabla_{\mathbf{w}_i}\log \pi(a\mid i)=\frac{1}{\tau}\left(\mathbf e_a-\boldsymbol\pi_i\right)

Summing components gives \(\sum_j (e_a(j)-\pi(j\mid i))=0\), so this is always a reallocation of mass.

Numerical example

Take \(S_i=\{A,B,C,\text{STOP}\}\), \(\tau=1\), and combined scores \(2.0,1.0,0.0,0.5\). Then

\(\pi(A\mid i)\approx0.579\)
\(\pi(B\mid i)\approx0.213\)
\(\pi(C\mid i)\approx0.078\)
\(\pi(\text{STOP}\mid i)\approx0.129\)

If we chose \(B\) and got positive outcome:

\Delta w_{iB}\propto1-\pi(B\mid i)\approx0.787,\quad \Delta w_{iA}\propto-\pi(A\mid i)\approx-0.579,\quad \Delta w_{iC}\propto-\pi(C\mid i)\approx-0.078,\quad \Delta w_{iS}\propto-\pi(S\mid i)\approx-0.129

Chosen action goes up, all others go down in proportion to their current mass. Sum is approximately zero.

Full trajectory update

\nabla J(\mathbf w)=\mathbb E\left[G\sum_{t=0}^{T-1}\nabla\log\pi(a_t\mid i_t)\right]

If a baseline \(b(i_t)\) is used:

\Delta\mathbf w\propto\sum_{t=0}^{T-1}(G_t-b(i_t))\nabla_{\mathbf w}\log\pi(a_t\mid i_t)

Why old heuristic updates are directionally right but incomplete

The heuristic “reward the chosen edge only” can increase the chosen edge’s logit by \(\delta\), but it does not explicitly reduce competing actions.

In softmax-based policies, increasing one edge and leaving others unchanged violates the coupling imposed by the normalizer. True policy gradient does both operations together: increase chosen mass and reallocate what was allocated elsewhere.

Comparison: true PG vs. heuristic

Action	\(\pi(j\mid i)\)	True PG update \(\Delta w_{ij}\propto e_B(j)-\pi(j\mid i)\)	Heuristic update
A	0.579	-0.579	0
B	0.213	+0.787	+\(\delta\)
C	0.078	-0.078	0
STOP	0.129	-0.129	0
Sum	1.000	~0	+\(\delta\)

From Theory to Production: OpenClawBrain

In OpenClawBrain, we walk a document graph under a stochastic policy \[ \pi_W(a\mid i)=\frac{\exp\left((r_{ia}+w_{ia})/\tau\right)}{\sum_{j\in\text{out}(i)}\exp\left((r_{ij}+w_{ij})/\tau\right)}. \] The implementation logs one full trajectory of visited states and actions. The corrected rollout-level update applies across that full path, which is the same structure as the Gu (2016) correction.

\Delta W \propto z_T \sum_{\ell=t}^{T}\nabla_W\log\pi_W(a_\ell\mid s_\ell)

That is exactly what apply_outcome_pg implements. For each visited node, it updates the whole local action vector with \(\frac1\tau(\mathbf e_{a_\ell}-\boldsymbol\pi_{s_\ell})\), so updates at each node sum to zero:

\sum_{j\in\text{out}(i)}\Delta w_{ij}=0

The legacy apply_outcome update used only the myopic piece (pushing the chosen edge), which is directionally right but misses the competition term. In production this shows up as unbalanced mass shifts over repeated traversals.

OpenClawBrain sums \(\nabla\log\pi\) across the full traversal, not just the last action.
The softmax policy stays normalized and interpretable as routing probability at each node.
Updates are signed redistributions at each node, not one-sided inflation.

See the open paper and code references here: OpenClawBrain project page and the 2016 PDF.

References

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
Gu, J. (2016). Corrected Policy-Gradient Update Rules for Reinforcement Learning. UCLA Econometrics Field Paper. [PDF]
Gu, J. (2026). OpenClawBrain: Learned Graph Traversal for Retrieval Routing. [Paper/implementation page]