← jonathangu.com

OpenClawBrain: Learned Graph Traversal for Retrieval Routing

Retrieval accuracy and context efficiency in one routing layer.

Graph-structured retrieval with trajectory-level learning.

Jonathan Gu · jonathangu@gmail.com

A brain-shaped neural network of document nodes and weighted edges — green reflex paths, blue habitual, gray dormant

Install & Run

# Build a brain
python3 -m openclawbrain init --workspace examples/sample_workspace --output /tmp/brain

# Check state health
python3 -m openclawbrain doctor --state /tmp/brain/state.json
# ... summary includes 8/9 checks

# Query (text output includes node IDs)
python3 -m openclawbrain query "how do I deploy" --state /tmp/brain/state.json --top 3 --json

# Teach positive outcome
python3 -m openclawbrain learn --state /tmp/brain/state.json --outcome 1.0 --fired-ids "deploy.md::0,deploy.md::1"

# Inject correction
python3 -m openclawbrain inject --state /tmp/brain/state.json \
  --id "fix::1" --content "Never skip CI for hotfixes" --type CORRECTION

Core graph and traversal are in pure Python. The package has no required third-party runtime dependencies; callers provide embeddings and LLM callbacks via optional interfaces. See README and REPRODUCE.md.

Abstract

Standard retrieval, including lexical or embedding similarity matching, selects chunks independently per query and does not learn persistent routing over multi-step trajectories from outcome feedback. OpenClawBrain provides value on two axes: retrieval accuracy and context efficiency. It augments query seeding with a learned graph router over document chunks where nodes are chunks and edges are mutable signed pointers. For accuracy, graph structure enables better retrieval than flat embedding-only retrieval on repeated procedural paths. For efficiency, repeated outcomes compress context from 30 nodes to 2.7 in deployment simulation (91% reduction) and from 52--66KB to 3--13KB per query in production via trajectory learning.

This page is the companion guide to the formal paper (openclawbrain.tex / openclawbrain.pdf) in this directory. The paper contains the reference definitions, equations, and benchmarks; this page presents the same claims with implementation context.

Feature highlights

What's new in v11.2

1. The Problem

Memory retrieval should be about behavior selection, not nearest-neighbor matching. Every query should ask: which sequence of dependencies, safeguards, and operations is useful now, given previous outcomes? A static similarity index cannot determine this.

OpenClawBrain treats memory as a decision surface: a directed, weighted graph where edges represent route-level transitions. We evaluate behavior through simulations with only code-backed evidence and replace broad benchmarking claims with a note when those data are not in this suite.

The claims below are constrained to the 15 deterministic simulations and 236 tests, with results from the reproducible benchmark harness in this repo. Axis framing: Section 4 uses Axis 2 (context efficiency) for repeated-task experiments and Axis 1 (accuracy) for behavior correctness.

2. The Graph

OpenClawBrain is a directed graph of chunk nodes with typed edges carrying bounded scalar weight. Implementations may store metadata (for example, edge kind and summaries) alongside the graph state.

from dataclasses import dataclass, field
from typing import Any

@dataclass
class Node:
    id: str
    content: str
    summary: str = ""
    metadata: dict[str, Any] = field(default_factory=dict)

@dataclass
class Edge:
    source: str
    target: str
    weight: float
    kind: str
    summary: str = ""
    metadata: dict[str, Any] = field(default_factory=dict)

A query creates seed nodes \(\mathcal S(q)\) and initial frontier. Fired nodes are the nodes visited in this traversal; only fired nodes are eligible for read/execution.

Vertical flowchart: Query to embedding search, cheap router, graph traversal, and expensive context assembly
Figure 1. Inference architecture: seed retrieval, cheap routing, graph traversal, and expensive context assembly.

Tiering uses explicit precedence on each step: reflex, habitual, inhibitory, then dormant. Inhibition is a hard veto over support, not a soft penalty.

TierWeight RangeBehaviorExpected Cost
Reflex>= 0.6Auto-follow.Near zero deliberation.
Habitual0.2 – 0.6Cheap routing decides whether to follow.Low.
Inhibitory<= -0.01Skips/ suppresses nodes.Near zero.
Dormant< 0.2Skipped unless directly re-seeded.Zero.

The graph is deliberately sparse in active retrieval because routing should remain selective. Dormant edges are not deleted; they can recover if feedback returns.

def route_fn(query, candidate_targets):
    """Policy hook for selecting habitual actions."""

    ...
    return selected_targets


def choose_next(current_node, graph, route_fn, query, visited):
    edges = graph.outgoing(current_node)
    candidates, suppressed = [], set()
    for edge in edges:
        if edge.weight <= -0.01:
            suppressed.add(edge.target)
            continue
        if edge.weight >= 0.6 and edge.target not in visited:
            candidates.append((edge.target, edge.weight))
        if 0.2 <= edge.weight < 0.6 and edge.target not in visited:
            candidates.append((edge.target, edge.weight))

    habitual = [t for t, w in candidates if 0.2 <= w < 0.6]
    auto = [t for t, w in candidates if w >= 0.6]
    selected = route_fn(query, habitual) if route_fn else auto
    return [n for n in auto + selected if n not in suppressed and n not in visited]

3. Learning Rule and Dynamics

A production-oriented objective is:

\[\min_{W}\; \mathbb{E}_{q\sim\mathcal{D}}[\mathrm{tokens}(F(q;W))] \;\text{s.t.}\; \mathbb{E}_{q\sim\mathcal{D}}[\tilde z(q)]\ge \rho,\; |F|\le B,\]

\[\tilde z\in\{0,1\},\quad z=2\tilde z-1\in\{-1,+1\}\]

In plain terms: minimize context tokens while maintaining retrieval success, with signed outcome \(z\) used in updates.

How the Policy Gradient Actually Works

For a fixed-node softmax policy, each chosen action increases its own logit and decreases competing logits in proportion to their current probability mass.

The accessible math

\[\pi_W(a\mid s=i)=\frac{\exp\bigl((r_{ia}+w_{ia})/\tau\bigr)}{\sum_{j\in\mathcal{N}(i)\cup\{\texttt{STOP}\}}\exp\bigl((r_{ij}+w_{ij})/\tau\bigr)}\]

At node \(i\), write \(\boldsymbol{\pi}_i\) for the full action distribution and \(\mathbf{e}_a\) for the one-hot vector at the chosen action. The true score function is:

\[\nabla \log \pi_W(a\mid i)=\frac{1}{\tau}(\mathbf{e}_a-\boldsymbol{\pi}_i)\]

Numerical example

Suppose one node has three outgoing edges plus STOP, with \(w = [0.5, 0.3, -0.2, 0.0]\), \(\tau=1\), and chosen action \(A\).

TargetWeightSoftmax \\(\pi(j\mid i)\\)
A (chosen)0.5000.342
B0.3000.280
C-0.2000.170
STOP0.0000.208

With \(z=+1\), baseline \(b=0\), \(\eta=0.1\), \(\gamma=1\), and \(\tau=1\):

TargetTrue PG updateOld heuristic updateSum
A (chosen)\(+0.066\)\(+0.100\)\(\sum \Delta W_{\text{true}}=0.000,\ \sum \Delta W_{\text{heuristic}}=+0.100\)
B\(-0.028\)\(0\)
C\(-0.017\)\(0\)
STOP\(-0.021\)\(0\)

True PG reallocates probability mass: the +0.066 gain on A is exactly offset by -0.066 across alternatives. The heuristic only adds to A and drifts the local mass upward.

Why this update matters

Connection to Gu (2016): full-trajectory credit

The corrected part is not just “use a better formula at one step.” We sum gradients across the full trajectory, not just the last transition:

\[\Delta W=\eta(z-b)\sum_{\ell=0}^{T}\gamma^\ell\nabla_W\log\pi_W(a_\ell\mid s_\ell)\]

Updates are applied to every trajectory step, including early hops, when only terminal outcome is observed.

In practical deployments, a strong early hop can still receive positive credit when a later branch fails; a bad mid-route branch can receive negative credit without requiring extra internal simulation.

In addition to policy updates, traversal updates include decay and bounded autotuning. This keeps edge weights interpretable and keeps the graph reusable across changing workloads.

Traversal uses episode-local attenuation for repeated loops. For edge \(e\), within-episode effective weight is:

\[\tilde w = w\cdot \lambda^k,\quad \lambda\in(0,1),\ k=\text{reuses in episode}\]

A separate visit-penalty factor can be modeled as \(\tilde w \leftarrow \tilde w - \alpha v_j\) with \(\alpha\) separate from \(\lambda\).

When an edge is reused repeatedly in one route, its effective influence is reduced, reducing lock-in while preserving recoverability.

4. Two-Timescale Architecture

OpenClawBrain runs on two explicit timescales: online learning per query, and scheduled structural maintenance.

Online learning (fast loop)

Each query performs:

query → traverse → log trace → feedback → learn

Maintenance (slow loop)

Periodic maintenance runs structural operators outside request latency:

run_maintenance()

run_maintenance() is scheduler-agnostic (cron, timer, CI, ad-hoc orchestration).

Persistent worker (openclawbrain daemon)

For production, openclawbrain daemon --state PATH starts a long-lived process that loads state once and accepts JSON-RPC over stdin/stdout. Eliminates per-call state reload (~100-800ms savings).

Production timing (Mac Mini M4 Pro, OpenAI embeddings):

Constitutional anchors

Context Lifecycle

Files (edit) → Sync (re-embed) → Graph (learn) → Maintain (prune/merge) → Compact (shrink files)

This matches the operational flow used in setup-runbook automation.

4. Simulation Results

The following results are reproducible from 15 scripts in openclawbrain/sims and their generated figures.

4.1 Deploy Pipeline Compilation

Axis: Context efficiency (Axis 2). The 4-hop deployment chain reaches reflex (1.0) by Q10 and stays there, turning procedural traversal into near-reflex execution.

Measured: in 50 repeated deploy-route queries, edges move from habitual to reflex regime, with final weights at the reflex threshold.

Querydeploy_query→check_cicheck_ci→inspect_manifestinspect_manifest→rollbackrollback→verify
10.4250.4200.4160.411
101.0001.0001.0001.000
251.0001.0001.0001.000
501.0001.0001.0001.000
Deploy pipeline edge weights across 50 queries.
Figure 2. Deploy pipeline: edge strengths climb to reflex weights.

4.2 Negation Learning

Axis: Retrieval accuracy (Axis 1). Bad edge reaches -0.940 while the good edge remains 1.0, so stale wrong guidance is suppressed.

Measured: inhibitory learning reaches \\(-0.940\\) on a deprecated path under negative outcomes.

PhaseGood edgeBad edge
Q11.00.45
Q111.0-0.44
Q121.0-0.94
Q201.0-0.94
Good and bad edge weights over 20 queries
Figure 3. Negation path suppressed after repeated negative feedback.

4.3 Context Reduction

Axis: Context efficiency (Axis 2). Nodes fired drop from 30 on Q1 to 2.7 on average (Q91-100), a 91% reduction.

Measured: context consumption in a focused workload drops from 30 nodes on first query to 2.7 nodes on average in the final 10 queries.

MetricValue
Nodes fired (query 1)30
Avg nodes fired (queries 91–100)2.7
Nodes fired per query in context reduction simulation
Figure 4. Selective context reduction from broad initial retrieval to focused routing.

4.4 Forgetting Dynamics

Axis: Context efficiency (Axis 2). Selective forgetting reaches 93.3% dormant by Q25, leaving 6.7% active.

Measured: selective forgetting produces high dormant mass after adaptation: 93.3% dormant by 100 queries.

QueryDormantHabitualReflexDormant %
1014910.0%
2514001093.3%
5014001093.3%
10014001093.3%
Dormant/habitual/reflex edge proportions across 100 queries
Figure 5. Edge tier distribution: most edges become dormant while useful edges remain reflex.

4.5 Edge Damping vs Undamped Traversal

Axis: Retrieval accuracy (Axis 1). Without damping, the toy cycle loops; with \\(\lambda=0.3\\), the target is reached in 4 hops.

Measured: with damping \\(\lambda=0.3\\), a branch to a target is reached; with undamped dynamics \\(\lambda=1.0\\), the traversal loops.

ConfigReached DObserved steps to DObserved issue
Damped \\(\lambda=0.3\\)Yes4Branch discovered
Undamped \\(\lambda=1.0\\)NoLooping through A→B→C→A
Edge damping comparison
Figure 6. Edge damping enables branch exploration under repeated traversal.

4.6 Domain Separation and Bridges

Axis: Retrieval accuracy (Axis 1). Only 5 cross-file edges emerge across 2 clusters, limiting irrelevant-domain bleed-through.

Measured: mixed query patterns induce sparse cross-file connectivity. The simulation produces 5 cross-file edges and two major clusters.

Query CountCross-file EdgesClustersRepresentative Cross-edge Count
50525
Cross-file bridge emergence across simulated files
Figure 7. Domain separation with sparse bridges between groups.

4.7 Brain-death Recovery

Axis: Context efficiency (Axis 2). Recovery controls detect dormant-heavy states (>90%) and return the graph to recoverable routing dynamics.

Measured: in recovery conditions with aggressive decay, the autotune diagnostics identify and correct unhealthy dynamics. Dormancy is detected at \\(>90\%\\) and the system proposes control updates.

MeasureInitialAfter Recovery Sequence
Dormant %100.0%97.5%
Autotune signalAbsentActive (decay, Hebbian, promotion)
StatusDormant-dominantRecoverable and moving
Health trajectory under brain-death and recovery
Figure 8. Health tracking over recovery rounds.

4.8 Individuation Across Workloads

Axis: Context efficiency (Axis 2). Same workload starts with shared structure; after updates, 27 edges differ by more than 0.05.

Measured: starting from identical graph state, different workloads produce structurally distinct outcomes. The number of edges that differ by more than 0.05 after 50 queries is 27.

MetricGraph AGraph B
Nodes1818
Edges306306
Mean abs diff0.0358
Edges with diff > 0.0527
Final graph weight matrices for coding versus safety workloads
Figure 9. Individuation: workload-specific graph topology after 50 updates.

4.9 Simulation Protocol

All eight reported studies share a single reproducibility convention: fixed random seeds within each script, one output JSON file per script, and one dedicated figure output generated from that JSON.

ResultScriptClaim OutputFigureQuery Budget
Deploy pipelinedeploy_pipeline.pydeploy_pipeline_results.jsondeploy_pipeline.png50
Negationnegation.pynegation_results.jsonnegation.png20
Context reductioncontext_reduction.pycontext_reduction_results.jsoncontext_reduction.png100
Selective forgettingforgetting.pyforgetting_results.jsonforgetting.png100
Edge dampingedge_damping.pyedge_damping_results.jsonedge_damping.png1 traversal run
Domain separationdomain_separation.pydomain_separation_results.jsondomain_separation.png50
PG vs Heuristicpg_vs_heuristic.pypg_vs_heuristic_results.json-100
Noise robustnessnoise_robustness.pynoise_robustness_results.json-100
Static vs learningstatic_vs_learning.pystatic_vs_learning_results.json-100
Scaling analysisscaling_analysis.pyscaling_analysis_results.json-50 (per graph size)
Brain death recoverybrain_death.pybrain_death_results.jsonbrain_death.png30 rounds
Individuationindividuation.pyindividuation_results.jsonindividuation.png50 per workload

One command runs all 15:

cd openclawbrain/sims
python generate_figures.py
python run_all.py

Result claims in this page are extracted from the stored JSON records and are intentionally limited to fields that can be traced to deterministic outputs.

4.10 Provenance of the Claim Set

ClaimEvidenceEvidence Anchor
Pipeline edges compile to reflex in 50 queriesFinal edge map weight 1.0 across four edgesdeploy_pipeline.json
Inhibitory edge reaches -0.940Bad edge final weight fieldnegation_results.json
Context drops 30→2.7 fired nodesFinal 10-query average fieldcontext_reduction_results.json
93.3% dormant after forgetting runFinal dormant percentage fieldforgetting_results.json
Damped branch reached, undamped loopsBoolean reached_D flagsedge_damping_results.json
5 cross-file edges emergeFinal cross-file edge countdomain_separation_results.json
Autotune flags dormant failure and suggests recoveryAutotune suggestion list non-emptybrain_death_results.json
27 edges differ by >0.05Structural distinctness diff countindividuation_results.json
PG vs heuristic keeps lower overall edge mass than heuristic53.68 vs 71.20 total masspg_vs_heuristic_results.json
Noise robustness collapses near 30% noiseReflex at 30% is false by Q100noise_robustness_results.json
Scaling remains stable route length5 fired nodes from 50 to 2000 nodesscaling_analysis_results.json

4.11 New Results (Ablations)

Axis: Context efficiency (Axis 2). True PG vs heuristic show 53.68 vs 71.20 total edge weight and 0.3124 vs 0.4226 distractor average.

True PG vs heuristic: both methods reach the target by query 2; true PG has lower total weight and lower distractor edge weight in this run.

MethodConvergenceTotal edge weightAvg. distractor edgeComment
Heuristic PGQ271.200.4226Higher mass growth
True PGQ253.680.3124Less global inflation

Axis: Retrieval accuracy (Axis 1). Correct path remains reflex through 0%, 10%, and 20% noise; degradation accelerates at 30%.

Noise robustness: degradation is gradual until it becomes sharp at 30%.

Noise rateReflex by Q100Avg nodes fired (last 10)Correct-minus-distractor gapInhibitory risk
0%Yes5.00.5650none
10%Yes4.00.4625low
20%Yes5.00.4622low
30%No4.00.3072moderate

Axis: Retrieval accuracy + Context efficiency (Axes 1+2). Static, heuristic, and PG all converge to stable paths; learned methods strengthen path confidence.

Static vs learning baseline: static traversal is stable but does not raise the same route confidence.

MethodConverges to stable pathFinal correct-path weightsAvg fired nodesQueries until <5 nodes
Static traversalQ10.40, 0.40, 0.40, 0.405.0Not reached in 100
Heuristic learningQ11.00, 1.00, 1.00, 1.005.0Not reached in 100
True PGQ11.00, 1.00, 1.00, 1.005.0Not reached in 100

Axis: Context efficiency (Axis 2). Traversal remains at 5 fired nodes while graph size grows 50 → 2000 and time 0.0962 → 1.8766 ms.

Scaling: fixed hop/beam settings keep route length at 5 while traversal time grows sublinearly.

Graph sizeAvg traversal (ms)Avg nodes fired
500.09625
1000.14325
2500.28145
5000.51695
10000.96575
20001.87665

Benchmark comparison (expanded ablations)

This benchmark run includes 45 queries and covers baseline baselines and ablations, including static_traverse and no_inhibition.

MethodRecall@3Recall@5Precision@3MRRLatency p50 / p95 (ms)
static_traverse0.2700.3040.1700.38146.27 / 47.79
keyword_overlap0.2890.5150.1630.29510.74 / 11.11
hash_embed_similarity0.2700.4330.1700.42844.81 / 46.51
ocb_traverse0.2700.3040.1700.38146.73 / 48.07
ocb_no_inhibition0.2700.3040.1700.38146.46 / 47.93
ocb_pg0.2700.3040.1700.38145.98 / 47.64
ocb_with_replay0.2890.4370.1630.24620.86 / 22.98
ocb_pg_with_replay0.2890.4370.1630.24620.77 / 23.23
Results are reproducible from the 15 simulation scripts and local benchmark harness in this repository.

4.14 Structural Maintenance Results

OpenClawBrain separates per-query learning from periodic structural maintenance.

Maintenance runs through healthdecaymergeprune and can be scheduled with any workflow.

All maintained brains use OpenAI text-embedding-3-small (1536-dim).

4.14.1 Merge Compression

Merge compression reduced active nodes from 40 to 32 and fired value from 2.0 to 1.0 (50% reduction).

MetricBeforeAfter
Nodes fired4032
Fired value2.01.0

4.14.2 Prune Health

Prune-health dropped edges from 870 to 24, while dormant moved from 98% to 29% and traversal still worked after prune.

MetricBeforeAfter
Edges87024
Dormant98%29%

4.14.3 Full Maintain vs Edge-Only (200 queries)

Full maintenance is 12% cheaper in total context cost than edge-only (1,815 → 1,598).

StrategyTotal context costCost delta
Edge-only1,815baseline
Full maintain1,598-12%

4.14.4 Production Maintenance (first cycle)

Merged nodes are larger due to context combination, so context chars do not always shrink; routing becomes materially sparser.

BrainNodesEdgesPrunedMergedLearnings
MAIN1,1602,55170243
PELICAN5552,2112711181
BOUNTIFUL2891,101450235

Maintenance schedule and configuration is documented in docs/setup-guide.md.

4.12 External Benchmarks: MultiHop-RAG (n=1000)

Axis: Retrieval accuracy (Axis 1). Cold traversal is +7.7% over embedding on full hit (0.6430 vs 0.5660), while learning drops to 0.4770 (-16.6% at 1000).

Corpus: dataset facts and paragraphs from MultiHop-RAG facts. For each query, labels define evidence paragraphs to retrieve. Method set: embedding top-k and three OpenClawBrain variants with cold and online updates.

MethodFull hit ratePartial hit rateEvidence recall@10MRRNodes fired
embedding_topk0.56600.98000.80580.794210.000
ocb_cold0.64300.97700.82240.793310.599
ocb_learning0.47700.85300.66580.69169.982
ocb_pg_learning0.40700.78900.59000.65349.871

Cold traversal improves immediate evidence recall over embedding-only retrieval. Online updates on a non-repeating stream reduce final full-hit rate, which matches the repeated-task assumption for sustained gains.

4.13 External Benchmarks: HotPotQA distractor (n=500)

Axis: Retrieval accuracy (Axis 1). Cold traversal improves SP@5 by +33.6% (0.9700 vs 0.6340); learning is roughly neutral over early checks.

Corpus: dataset passages and labeled evidence for distractor suppression evaluation.

MethodSP recall@5SP recall@10Single SP hit@5Distractor suppressionMRRNodes fired
embedding_topk0.63400.80800.97600.82020.896910.000
ocb_cold0.97000.98400.97400.76230.89618.764
ocb_learning0.95200.98400.97400.76560.89618.948
ocb_pg_learning0.89800.97800.97400.75190.89458.316

Cold traversal is strongest at low and late checkpoints, while learned variants can improve early checkpoints and weaken under longer, diverse query streams.

Axis framing: These external benchmarks are Axis 1 only. They use unique, non-repeating queries, so Axis 2 (repetition-driven context compression) is not testable.

5. Production Deployment

Three production brains run on a Mac Mini M4 Pro, each built from workspace markdown files, session replay, and a learning database of human corrections:

Axis framing: Context efficiency on Axis 2. Startup context moved from 52--66KB to 3--13KB per query through on-demand graph routing after repeated task use.

BrainNodesEdgesLearning Corrections
MAIN1,1602,55143
PELICAN5552,211181
BOUNTIFUL2891,10135

All 2,004 nodes have real OpenAI text-embedding-3-small embeddings (1536-dim). Existing embeddings are cached and reused across rebuilds — only new nodes require API calls.

Learning corrections (259 active) are injected as first-class graph nodes and connected to relevant workspace nodes via cosine similarity. This means the graph retrieves not just workspace content but how to avoid past mistakes.

6. How It Compares

Plain RAGOpenClawBrain
RetrievalSimilarity searchLearned graph traversal
FeedbackNone+1/-1 outcomes update edge weights
Wrong answersKeep resurfacingInhibitory edges suppress them
Over timeSame resultsRoutes compile to reflex behavior
DependenciesVector DBPython core; optional embedding/LLM callbacks

Positioning against related systems

MemGPT: a runtime memory operating-system stack, while OpenClawBrain is a routing layer. Think of it as a graph policy for what to retrieve and feed next, not a replacement memory system.

Reflexion: textual reflection over time, whereas OpenClawBrain stores learned routing decisions as numeric edge weights over a graph.

Self-RAG: generation-level self-critique and revision; OpenClawBrain is retrieval-routing-level learning that shapes traversal before generation.

Related family: bandit-style routing and learning-to-rank ideas motivate the same principle in other forms — choose actions with feedback; OpenClawBrain applies it inside a sparse document graph with explicit STOP and edge-level inhibition.

7. Limitations and Conclusion

Limitations

  1. The current paper page presents the 15 reproducible simulation classes listed above, including ablations and robustness checks.
  2. Routing quality depends on seeding quality: poor embeddings or weak seeders produce weak starting routes, and we do not yet have a formal open benchmark gap-filling procedure for seeding failure modes.
  3. Absolute magnitudes depend on synthetic setup and parameter choices in each simulator.
  4. Autotune and traversal behavior should still be treated as adaptive heuristics for further real-workload calibration.

Conclusion

OpenClawBrain improves retrieval accuracy through graph-structured traversal (+7.7% on MultiHop-RAG, +33.6% on HotPotQA versus embedding-only baselines) and compresses context on repeated tasks through trajectory-level learning (30→2.7 nodes in simulation, 52--66KB→3--13KB in production).

The true REINFORCE policy gradient keeps edge weights bounded (53.68 vs 71.20 total weight) and suppresses distractor edges more effectively (0.3124 vs 0.4226 average distractor weight). Benefits depend on graph structure and repeated-query patterns; with diverse one-off queries, learning can add little or degrade accuracy.

This version is scoped to provable outputs: 15 deterministic simulations and the benchmark harness in this repo, with updated ablations and noise robustness.

References

  1. Gu, J. (2016). Corrected policy-gradient update rules for reinforcement learning. UCLA Econometrics Field Paper. [Background and derivation]
  2. Gu, J. (2016). Corrected policy-gradient update for recurrent action sequences. UCLA Econometrics Field Paper. [PDF]
  3. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
  4. Collins, A. M. & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82(6), 407–428.
  5. Graves, A., Wayne, G. & Danihelka, I. (2014). Neural Turing Machines. arXiv:1410.5401.
  6. Graves, A. et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538, 471–476.
  7. Park, J. S. et al. (2023). Generative agents: Interactive simulacra of human behavior. UIST 2023.
  8. Wang, G. et al. (2023). Voyager: An open-ended embodied agent with large language models. arXiv:2305.16291.
  9. Packer, C. et al. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560.
  10. Shinn, N. et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023.
  11. Yao, S. et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR 2023.
  12. Asai, A. et al. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR 2024.
  13. Sun, J. et al. (2024). Think-on-Graph: Deep and responsible reasoning of large language model on knowledge graph. ICLR 2024.