We Shipped the Correct REINFORCE Update — and the Graph Started Behaving
OpenClawBrain v12.2.1: policy gradients, self-learning agents, and biological self-regulation
In 2016, during my PhD at UCLA, I derived a correction to the standard REINFORCE update rule. The key insight was simple: the classic Williams (1992) gradient only optimizes for the immediate action, not the value of the state across the full remaining trajectory. I wrote the proof, filed the field paper, and moved on to industry.
Ten years later, I built OpenClawBrain — a memory graph for AI agents where retrieval is a learned policy over document chunks. The system worked. Agents queried, traversed edges, got feedback, and the graph adapted. But something nagged: the learning rule was a heuristic. It was directionally right — reward the chosen edge, discount by depth — but it wasn't the actual policy gradient I'd derived a decade earlier.
So on a Thursday afternoon, I finally implemented the correct update. Within hours, the simulation results told the story: 13-19x better branch separation, 244x faster recovery after concept drift, and 30% less weight inflation. This post is about that correction — the math, the evidence, and the three things it unlocked: agents that teach themselves, a graph that self-regulates, and a system that can finally be trusted to run unsupervised.
TL;DR
- Switched to REINFORCE policy gradient: 13-19x better branch separation, 244x faster drift recovery
- New self_learn API: agents learn from their own outcomes (no human in the loop)
- Self-regulation: homeostatic decay + synaptic scaling + tier hysteresis
- Runtime node splitting: the inverse of merge
- 272 tests, 3 production brains: crash-safe persistence
The Bug in the Old Learning Rule
Imagine you're training a retrieval system. The agent queries your graph, follows edges A→B→C, retrieves some context, and the user says "good job." The natural thing is to reinforce that path: make A→B stronger, make B→C stronger. That's what the heuristic did.
The problem is what it didn't do. At node A, there were also edges to D, E, and F. The agent chose B — implicitly rejecting D, E, and F — but those edges were never updated. In RL terms, the heuristic only performed credit assignment on the on-policy trajectory, ignoring the counterfactual alternatives at each decision point.
This is like a teacher who praises a student for choosing answer B but never marks the other answers wrong. Over hundreds of interactions, the correct answer gets stronger, but the wrong answers never get weaker. All edges inflate. The total weight in the graph grows monotonically. And eventually, the signal-to-noise ratio degrades because incorrect paths maintain their original weight indefinitely.
Here's the concrete comparison at a single node with four outgoing actions:
| Action | π(j|i) | True PG update | Heuristic update |
|---|---|---|---|
| A | 0.579 | -0.579 | 0 |
| B (chosen) | 0.213 | +0.787 | +δ |
| C | 0.078 | -0.078 | 0 |
| STOP | 0.129 | -0.129 | 0 |
| Sum | 1.000 | ~0 | +δ |
The heuristic leaks +δ of probability mass into the system every update. The true policy gradient redistributes mass: the chosen action goes up, all alternatives go down, and the sum is zero. Over thousands of updates, this difference compounds.
The Fix: REINFORCE on Per-Node Softmax
The fix is to recognize that at each node in the graph, the agent is making a choice among outgoing edges. That choice is a policy — specifically, a softmax policy parameterized by edge weights:
with
The gradient of the log-policy with respect to the chosen edge weight is \(\frac{1}{\tau}(1 - \pi(a|i))\) — the complement of the action's current probability. For all unchosen edges, the gradient is \(-\frac{1}{\tau}\pi(j|i)\) — negative, proportional to each alternative's probability.
The key property: gradients at each node sum to exactly zero. This isn't an accident — it's a consequence of the softmax normalization. Every update is a redistribution, never an inflation. When the agent gets positive feedback, the chosen path absorbs probability from alternatives. When it gets negative feedback, the chosen path donates probability back. The total budget at each node is always conserved.
This is the update rule I derived in 2016 for general reinforcement learning. Applying it to graph traversal required recognizing that each node is a "state" and each outgoing edge is an "action." The rest follows directly.
That means updates at each node always sum to zero. The key insight: learning is a redistribution, not inflation. It explains why incorrect branches shrink automatically instead of being ignored.
Full derivation and implementation notes: /openclawbrain/gu2016/.
Simulation: What Changes When You Fix the Math
Theory is nice. Evidence is better. We ran a controlled simulation: 50 nodes organized into three topic clusters sharing a central hub node. Three query types, each with a designated correct path through the graph. Feedback is realistic — 65% positive (the agent retrieved good context), 20% negative (bad retrieval), and 15% partial (first few hops were correct, then it went wrong). At step 200, the ground truth shifts: correct paths rotate to different clusters. Both learning rules start from identical graph states. Ten random seeds, averaged results.
Chart 1: Branch Separation Over Time
Chart 2: Gate Weight Evolution Under Policy Gradient
This chart shows the three competing gate weights at the hub node as PG learning adapts to mixed feedback and concept drift. Before drift (step 200), each query type's correct gate rises toward 1.0 while alternatives are actively suppressed. After drift, correct paths rotate — and PG adapts almost instantly.
Chart 3: Total Graph Weight Over Time
The aggregated numbers across 10 seeds tell the story clearly:
| Metric | Heuristic | PG | Winner |
|---|---|---|---|
| Branch separation | 0.0004 | 0.0058 | PG (15x) |
| Recovery after drift | 24.4 steps | 0.1 steps | PG (244x) |
| Wrong gate suppression | 0.994 | 0.984 | PG |
| Weight inflation | +54.1 | +38.2 | PG (30% less) |
Self-Learning: Agents Teach Themselves
With a principled learning rule in place, we could build something we'd been wanting for a long time: agents that learn from their own observations, not just human feedback.
Here's what happened with Pelican, our options trading agent. During a routine model retraining run, Pelican launched a GPU instance on AWS, trained a GBM model (15 ensemble members, ~40 minutes), and then — eager to stop the cost leak — terminated the instance. The problem: it hadn't downloaded the trained model first. The model was on an ephemeral volume. Gone.
Pelican didn't need a human to tell it this was bad. It tried to access the model, got a failure, and could observe the consequence directly. The question was: could the brain learn from this self-observation the same way it learns from human corrections?
That's what self_learn enables. One API call that covers the full spectrum of agent-initiated learning:
# Agent detected its own mistake
client.self_learn(
content='Always download artifacts before terminating instances',
fired_ids=['infra.md::3', 'cleanup.md::1'],
outcome=-1.0,
node_type='CORRECTION',
)
# Agent observed success — reinforce
client.self_learn(
content='Download-then-terminate sequence works reliably',
fired_ids=['infra.md::3', 'download.md::1'],
outcome=1.0,
node_type='TEACHING',
)
| Situation | outcome | type | Effect |
|---|---|---|---|
| Mistake detected | -1.0 | CORRECTION | Penalize path + inhibitory edges |
| Fact learned | 0.0 | TEACHING | Inject knowledge only |
| Success observed | +1.0 | TEACHING | Reinforce path + inject knowledge |
Production: Real Stories
GUCLAW — General-purpose operator (1,160 nodes)
"We switched from heuristic to policy-gradient learning, retuned decay, added self-regulation, built the self_learn API, ran an audit, and fixed three bugs — all in a single afternoon session. The brain was running the new learning rule within 90 minutes of the decision to switch. Queries that used to pull 52KB of context now retrieve 3-13KB of focused, relevant chunks."
Pelican — Options trading ML agent (584 nodes)
"Pelican terminated a GPU training instance before downloading the model — a $40 mistake in retrain cost. Using self_learn, the lesson was injected directly into the brain: 'Always download artifacts before terminating instances.' The teaching node now fires whenever Pelican queries anything about instance management, preventing the same mistake without any human intervention."
Bountiful — Marketplace agent (286 nodes)
"The smallest brain, but it benefits the most from self-regulation. With homeostatic decay adjusting from the default half-life of 140 to 183 cycles, Bountiful's edges have more time to accumulate before being pruned — important for a smaller brain where each edge matters more. The 27% dormant ratio is healthy: edges that aren't reinforced are correctly fading."
Self-Regulation: Three Biological Lessons
Switching to policy gradients solved the learning rule. But it introduced a new problem: the learning rate, decay rate, temperature, and tier thresholds are all coupled. Changing the learning rule forced us to retune the decay half-life. That's fragile — every parameter change cascades.
We looked at how biological neural systems handle this. The brain doesn't have a sysadmin tuning synaptic decay rates. Instead, it uses homeostatic mechanisms — feedback loops that maintain stable behavior despite changing conditions. We implemented three of them.
Tier Hysteresis
In the graph, edges are classified into tiers: reflex (≥0.6, auto-followed), habitual (0.2-0.6, followed by weight), and dormant (<0.2, skipped). The old thresholds were hard cliffs — an edge at 0.199 was dormant (invisible to traversal), at 0.201 it was habitual (actively used). Tiny fluctuations from stochastic updates caused edges to thrash between states.
The fix: widen the habitual band to 0.15–0.6. This creates a buffer zone where edges transitioning downward get a softer landing before becoming invisible. It's the same principle as hysteresis in electrical engineering or neural threshold adaptation in neuroscience — you need different thresholds for activation and deactivation to prevent oscillation.
Chart 4: Equilibrium Weight by Query Frequency
Once policy gradients + decay converge, edge weight encodes usage frequency. The equilibrium chart uses query frequency from synthetic maintenance traces and overlays tier zones.
Homeostatic Decay
Real neurons maintain a target firing rate. If a neuron fires too rarely, its synapses strengthen globally. If it fires too much, they weaken. The neuron doesn't care about individual synapse weights — it cares about its overall behavior.
We applied the same principle. The graph monitors its reflex edge ratio — the fraction of edges in the highest-confidence tier. The target band is 5-15%. If too many edges become reflex (monoculture — the graph is overconfident), decay speeds up to erode weak edges faster. If too few are reflex (the graph hasn't learned enough), decay slows down to let edges accumulate. The half-life adjusts by 3% per maintenance cycle, bounded between 60 and 300 cycles. Slow, stable, bounded — like a thermostat.
Synaptic Scaling
In neuroscience, synaptic scaling prevents individual neurons from dominating a network. If one neuron's total outgoing synaptic weight grows too large, all its synapses are scaled down proportionally — preserving relative strengths but capping absolute influence.
We do the same for graph nodes. Each node has a soft outgoing weight budget of 5.0. If the sum of positive outgoing edges exceeds this, a gentle fourth-root scaling is applied: w = w × (budget / L1)^0.25. Hub nodes that accumulate connections through Hebbian co-firing get their influence capped without losing their relative edge rankings. It's gentle — not a hard clamp — so the graph can still have strong hubs, just not runaway ones.
Runtime Node Splitting
The final piece: nodes can now split at runtime. Previously, the graph could only consolidate — merging co-firing nodes into larger chunks. But merged nodes sometimes become multi-topic blobs that fire on unrelated queries. The only fix was a full rebuild.
Now the maintenance pipeline detects bloated nodes (via content length, hub degree, and edge-weight variance) and splits them into focused children. Edges are rewired based on embedding similarity — each child inherits the edges most relevant to its content. Inhibitory edges are broadcast to all children for safety. Sibling edges between children maintain cluster coherence.
With both split and merge, the graph has a complete lifecycle: it can grow, consolidate, divide, and prune. Like cell division and fusion in biology, the system breathes — finding its natural granularity over time.
What We Shipped
All of this landed in one afternoon: 9 commits, +1,833 lines across 22 files, 272 tests passing. Three production brains — serving a general-purpose agent (1,160 nodes), an options trading agent (584 nodes), and a marketplace agent (286 nodes) — are now running the policy-gradient learning rule with self-regulation and autonomous self-learning.
The code is open source. The math is documented. The simulations are reproducible.
Our Production Brains
Three brains have been running on these updates since deployment:
- GitHub: github.com/jonathangu/openclawbrain
- PyPI: pip install openclawbrain==12.2.1
- Paper: jonathangu.com/openclawbrain/
- Derivation: jonathangu.com/openclawbrain/gu2016/
What's Next
The self-learn API opens a design space we've only begun to explore. Right now, agents decide when to self-correct based on explicit failure detection ("I can't access the model I need"). The next step is proactive outcome evaluation: after every action, automatically assess whether the retrieved context contributed to a good response. That requires cheap relevance scoring — probably a small model that rates each fired node against the final output.
On the self-regulation side, we're watching the homeostatic decay controller settle across our three production brains. All three currently sit below the 5% reflex floor (0.9-1.6% reflex), so the controller is gradually lengthening the half-life. In a few weeks we'll have real production data on whether the 5-15% target band is right or needs adjustment.
And there's a deeper question: should the policy gradient temperature τ also be adaptive? Lower temperature means sharper, more confident updates — faster learning but more lock-in. Higher temperature means more exploration but slower convergence. An entropy-based controller that maintains a target action-entropy per node would complete the self-regulation story. We're not there yet, but the infrastructure is ready.