Corrected Policy Gradients: From Gu (2016) to OpenClawBrain
A practical derivation from the classic REINFORCE fix to trajectory-aware graph routing.
Introduction
In 2016, I proved that the standard REINFORCE update used in Williams (1992) was only a one-step approximation and did not by itself optimize the value of the current state for all future steps. The corrected rule I derived requires the score term to be summed over the remainder of the trajectory, not only multiplied by the reward at a single time step.
In modern language: the original rule was myopic. It rewarded what you just did. The corrected rule correctly rewards what you did and everything that can still happen after that in that same rollout.
The Original Paper
Gu (2016) shows that the conventional REINFORCE rule is:
and argues this updates toward a high-probability action for the immediate move, but it does not represent the right objective gradient for state-wise value maximization. The corrected update is
The key difference is the sum over the tail of the trajectory \(l=t,\dots,T\). That sum makes the update aware of how each earlier choice affects all future decisions and outcomes.
So the corrected rule is the one that maximizes the value function at state \(s_t\), not only the immediate one-step contribution.
Policy Gradient From Scratch
Start with a concrete picture: one node, three doors, and a STOP sign
Imagine a node \(i\), and three outgoing actions \(A,B,C\), plus \(\text{STOP}\). OpenClawBrain chooses an action stochastically via a softmax policy:
Here \(r_{ia}\) is a fixed relevance term, \(w_{ia}\) is the trainable parameter, and \(\tau\) controls entropy.
Now, if the episode outcome is good, we should increase the chance of the chosen action and decrease the chance of the alternatives so that the total probability at node \(i\) stays normalized. That is the conservation intuition behind policy gradients.
Step 1: write \(\log \pi(a\mid i)\)
Let \(S_i\) be the set of actions at node \(i\) and \(Z_i=\sum_{j\in S_i}\exp((r_{ij}+w_{ij})/\tau)\). Then:
Step 2: gradients w.r.t. learned logits
For the selected action \(a\):
For any \(j\neq a\):
In vector form:
Summing components gives \(\sum_j (e_a(j)-\pi(j\mid i))=0\), so this is always a reallocation of mass.
Numerical example
Take \(S_i=\{A,B,C,\text{STOP}\}\), \(\tau=1\), and combined scores \(2.0,1.0,0.0,0.5\). Then
- \(\pi(A\mid i)\approx0.579\)
- \(\pi(B\mid i)\approx0.213\)
- \(\pi(C\mid i)\approx0.078\)
- \(\pi(\text{STOP}\mid i)\approx0.129\)
If we chose \(B\) and got positive outcome:
Chosen action goes up, all others go down in proportion to their current mass. Sum is approximately zero.
Full trajectory update
If a baseline \(b(i_t)\) is used:
Why old heuristic updates are directionally right but incomplete
The heuristic “reward the chosen edge only” can increase the chosen edge’s logit by \(\delta\), but it does not explicitly reduce competing actions.
In softmax-based policies, increasing one edge and leaving others unchanged violates the coupling imposed by the normalizer. True policy gradient does both operations together: increase chosen mass and reallocate what was allocated elsewhere.
Comparison: true PG vs. heuristic
| Action | \(\pi(j\mid i)\) | True PG update \(\Delta w_{ij}\propto e_B(j)-\pi(j\mid i)\) | Heuristic update |
|---|---|---|---|
| A | 0.579 | -0.579 | 0 |
| B | 0.213 | +0.787 | +\(\delta\) |
| C | 0.078 | -0.078 | 0 |
| STOP | 0.129 | -0.129 | 0 |
| Sum | 1.000 | ~0 | +\(\delta\) |
From Theory to Production: OpenClawBrain
In OpenClawBrain, we walk a document graph under a stochastic policy \[ \pi_W(a\mid i)=\frac{\exp\left((r_{ia}+w_{ia})/\tau\right)}{\sum_{j\in\text{out}(i)}\exp\left((r_{ij}+w_{ij})/\tau\right)}. \] The implementation logs one full trajectory of visited states and actions. The corrected rollout-level update applies across that full path, which is the same structure as the Gu (2016) correction.
That is exactly what apply_outcome_pg implements. For each visited node, it updates the whole local action vector with \(\frac1\tau(\mathbf e_{a_\ell}-\boldsymbol\pi_{s_\ell})\), so updates at each node sum to zero:
The legacy apply_outcome update used only the myopic piece (pushing the chosen edge), which is directionally right but misses the competition term. In production this shows up as unbalanced mass shifts over repeated traversals.
- OpenClawBrain sums \(\nabla\log\pi\) across the full traversal, not just the last action.
- The softmax policy stays normalized and interpretable as routing probability at each node.
- Updates are signed redistributions at each node, not one-sided inflation.
See the open paper and code references here: OpenClawBrain project page and the 2016 PDF.
References
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
- Gu, J. (2016). Corrected Policy-Gradient Update Rules for Reinforcement Learning. UCLA Econometrics Field Paper. [PDF]
- Gu, J. (2026). OpenClawBrain: Learned Graph Traversal for Retrieval Routing. [Paper/implementation page]