Corrected Policy Gradients: From Gu (2016) to OpenClawBrain

A practical derivation from the classic REINFORCE fix to trajectory-aware graph routing.

Jonathan Gu · February 2026

Introduction

In 2016, I proved that the standard REINFORCE update used in Williams (1992) was only a one-step approximation and did not by itself optimize the value of the current state for all future steps. The corrected rule I derived requires the score term to be summed over the remainder of the trajectory, not only multiplied by the reward at a single time step.

In modern language: the original rule was myopic. It rewarded what you just did. The corrected rule correctly rewards what you did and everything that can still happen after that in that same rollout.

The Original Paper

Gu (2016) shows that the conventional REINFORCE rule is:

\[ \Delta \rho \propto \frac{\partial \log P_\rho(a_t\mid s_t)}{\partial \rho} z_t \]

and argues this updates toward a high-probability action for the immediate move, but it does not represent the right objective gradient for state-wise value maximization. The corrected update is

\[ \frac{\partial v_\rho(s_t)}{\partial \rho} = E\!\left[ z_T\sum_{l=t}^{T}\frac{\partial\log P_\rho(a_l\mid s_l)}{\partial \rho} \right] \]

The key difference is the sum over the tail of the trajectory \(l=t,\dots,T\). That sum makes the update aware of how each earlier choice affects all future decisions and outcomes.

So the corrected rule is the one that maximizes the value function at state \(s_t\), not only the immediate one-step contribution.

Policy Gradient From Scratch

Start with a concrete picture: one node, three doors, and a STOP sign

Imagine a node \(i\), and three outgoing actions \(A,B,C\), plus \(\text{STOP}\). OpenClawBrain chooses an action stochastically via a softmax policy:

\[ \pi(a\mid i)= \frac{\exp\left((r_{ia}+w_{ia})/\tau\right)}{\sum_{j\in \text{neighbors}(i)\cup\{\text{STOP}\}} \exp\left((r_{ij}+w_{ij})/\tau\right)} \]

Here \(r_{ia}\) is a fixed relevance term, \(w_{ia}\) is the trainable parameter, and \(\tau\) controls entropy.

Now, if the episode outcome is good, we should increase the chance of the chosen action and decrease the chance of the alternatives so that the total probability at node \(i\) stays normalized. That is the conservation intuition behind policy gradients.

Step 1: write \(\log \pi(a\mid i)\)

Let \(S_i\) be the set of actions at node \(i\) and \(Z_i=\sum_{j\in S_i}\exp((r_{ij}+w_{ij})/\tau)\). Then:

\[ \log \pi(a\mid i)=\frac{r_{ia}+w_{ia}}{\tau}-\log Z_i \]

Step 2: gradients w.r.t. learned logits

For the selected action \(a\):

\[ \frac{\partial \log \pi(a\mid i)}{\partial w_{ia}}=\frac{1}{\tau}\left(1-\pi(a\mid i)\right) \]

For any \(j\neq a\):

\[ \frac{\partial \log \pi(a\mid i)}{\partial w_{ij}}=-\frac{1}{\tau}\pi(j\mid i) \]

In vector form:

\[ \nabla_{\mathbf{w}_i}\log \pi(a\mid i)=\frac{1}{\tau}\left(\mathbf e_a-\boldsymbol\pi_i\right) \]

Summing components gives \(\sum_j (e_a(j)-\pi(j\mid i))=0\), so this is always a reallocation of mass.

Numerical example

Take \(S_i=\{A,B,C,\text{STOP}\}\), \(\tau=1\), and combined scores \(2.0,1.0,0.0,0.5\). Then

If we chose \(B\) and got positive outcome:

\[ \Delta w_{iB}\propto1-\pi(B\mid i)\approx0.787,\quad \Delta w_{iA}\propto-\pi(A\mid i)\approx-0.579,\quad \Delta w_{iC}\propto-\pi(C\mid i)\approx-0.078,\quad \Delta w_{iS}\propto-\pi(S\mid i)\approx-0.129 \]

Chosen action goes up, all others go down in proportion to their current mass. Sum is approximately zero.

Full trajectory update

\[ \nabla J(\mathbf w)=\mathbb E\left[G\sum_{t=0}^{T-1}\nabla\log\pi(a_t\mid i_t)\right] \]

If a baseline \(b(i_t)\) is used:

\[ \Delta\mathbf w\propto\sum_{t=0}^{T-1}(G_t-b(i_t))\nabla_{\mathbf w}\log\pi(a_t\mid i_t) \]

Why old heuristic updates are directionally right but incomplete

The heuristic “reward the chosen edge only” can increase the chosen edge’s logit by \(\delta\), but it does not explicitly reduce competing actions.

In softmax-based policies, increasing one edge and leaving others unchanged violates the coupling imposed by the normalizer. True policy gradient does both operations together: increase chosen mass and reallocate what was allocated elsewhere.

Comparison: true PG vs. heuristic

Action \(\pi(j\mid i)\) True PG update \(\Delta w_{ij}\propto e_B(j)-\pi(j\mid i)\) Heuristic update
A0.579-0.5790
B0.213+0.787+\(\delta\)
C0.078-0.0780
STOP0.129-0.1290
Sum1.000~0+\(\delta\)

From Theory to Production: OpenClawBrain

In OpenClawBrain, we walk a document graph under a stochastic policy \[ \pi_W(a\mid i)=\frac{\exp\left((r_{ia}+w_{ia})/\tau\right)}{\sum_{j\in\text{out}(i)}\exp\left((r_{ij}+w_{ij})/\tau\right)}. \] The implementation logs one full trajectory of visited states and actions. The corrected rollout-level update applies across that full path, which is the same structure as the Gu (2016) correction.

\[ \Delta W \propto z_T \sum_{\ell=t}^{T}\nabla_W\log\pi_W(a_\ell\mid s_\ell) \]

That is exactly what apply_outcome_pg implements. For each visited node, it updates the whole local action vector with \(\frac1\tau(\mathbf e_{a_\ell}-\boldsymbol\pi_{s_\ell})\), so updates at each node sum to zero:

\[ \sum_{j\in\text{out}(i)}\Delta w_{ij}=0 \]

The legacy apply_outcome update used only the myopic piece (pushing the chosen edge), which is directionally right but misses the competition term. In production this shows up as unbalanced mass shifts over repeated traversals.

See the open paper and code references here: OpenClawBrain project page and the 2016 PDF.

References

  1. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
  2. Gu, J. (2016). Corrected Policy-Gradient Update Rules for Reinforcement Learning. UCLA Econometrics Field Paper. [PDF]
  3. Gu, J. (2026). OpenClawBrain: Learned Graph Traversal for Retrieval Routing. [Paper/implementation page]