OpenClawBrain: Learned Graph Traversal for Retrieval Routing

Retrieval accuracy and context efficiency in one routing layer.

Graph-structured retrieval with trajectory-level learning.

📦 PyPI 🦞 ClawHub GitHub 📄 PDF 📝 LaTeX 📐 Gu (2016) Derivation v11.2.1

February 2026

A brain-shaped neural network of document nodes and weighted edges — green reflex paths, blue habitual, gray dormant

Install & Run

# Build a brain
python3 -m openclawbrain init --workspace examples/sample_workspace --output /tmp/brain

# Check state health
python3 -m openclawbrain doctor --state /tmp/brain/state.json
# ... summary includes 8/9 checks

# Query (text output includes node IDs)
python3 -m openclawbrain query "how do I deploy" --state /tmp/brain/state.json --top 3 --json

# Teach positive outcome
python3 -m openclawbrain learn --state /tmp/brain/state.json --outcome 1.0 --fired-ids "deploy.md::0,deploy.md::1"

# Inject correction
python3 -m openclawbrain inject --state /tmp/brain/state.json \
  --id "fix::1" --content "Never skip CI for hotfixes" --type CORRECTION

Core graph and traversal are in pure Python. The package has no required third-party runtime dependencies; callers provide embeddings and LLM callbacks via optional interfaces. See README and REPRODUCE.md.

Abstract

Standard retrieval, including lexical or embedding similarity matching, selects chunks independently per query and does not learn persistent routing over multi-step trajectories from outcome feedback. OpenClawBrain provides value on two axes: retrieval accuracy and context efficiency. It augments query seeding with a learned graph router over document chunks where nodes are chunks and edges are mutable signed pointers. For accuracy, graph structure enables better retrieval than flat embedding-only retrieval on repeated procedural paths. For efficiency, repeated outcomes compress context from 30 nodes to 2.7 in deployment simulation (91% reduction) and from 52--66KB to 3--13KB per query in production via trajectory learning.

This page is the companion guide to the formal paper (openclawbrain.tex / openclawbrain.pdf) in this directory. The paper contains the reference definitions, equations, and benchmarks; this page presents the same claims with implementation context.

Feature highlights

Two-axis framing: graph structure for retrieval accuracy and learning for repeated-task context compression.
Real-time corrections via logged fires (`--chat-id`) and same-turn correction flow.
Optional embedder and generation callbacks for user environments; no assumption of a single provider.
TEACHING and DIRECTIVE injection for positive knowledge and instructions.
Budget-based traversal termination with hard limits for hops, fired-node count, and context budget.

What's new in v11.2

Constitutional anchors now regulate maintenance: constitutional (never decay/prune/merge), canonical (slow decay), overlay (default).
Incremental file sync with `openclawbrain sync` is documented for file-change-driven re-embedding.
Daily note compaction with `openclawbrain compact` is documented for file footprint reduction.
Two-timescale architecture is explicit: per-query learning plus periodic maintenance operators.
Maintenance now includes structural runs and reports from 15 sims and 236 tests.
Persistent worker daemon (openclawbrain daemon) keeps state hot in memory — 504ms queries on production brains.
Bugfix: save_state() no longer silently overwrites embedder metadata.
Bugfix: rebuilds now preserve injected CORRECTION/TEACHING nodes.
246 tests, 17 CLI commands, 4 production brains (including CormorantAI).

1. The Problem

Memory retrieval should be about behavior selection, not nearest-neighbor matching. Every query should ask: which sequence of dependencies, safeguards, and operations is useful now, given previous outcomes? A static similarity index cannot determine this.

OpenClawBrain treats memory as a decision surface: a directed, weighted graph where edges represent route-level transitions. We evaluate behavior through simulations with only code-backed evidence and replace broad benchmarking claims with a note when those data are not in this suite.

The claims below are constrained to the 15 deterministic simulations and 236 tests, with results from the reproducible benchmark harness in this repo. Axis framing: Section 4 uses Axis 2 (context efficiency) for repeated-task experiments and Axis 1 (accuracy) for behavior correctness.

2. The Graph

OpenClawBrain is a directed graph of chunk nodes with typed edges carrying bounded scalar weight. Implementations may store metadata (for example, edge kind and summaries) alongside the graph state.

from dataclasses import dataclass, field
from typing import Any

@dataclass
class Node:
    id: str
    content: str
    summary: str = ""
    metadata: dict[str, Any] = field(default_factory=dict)

@dataclass
class Edge:
    source: str
    target: str
    weight: float
    kind: str
    summary: str = ""
    metadata: dict[str, Any] = field(default_factory=dict)

A query creates seed nodes \(\mathcal S(q)\) and initial frontier. Fired nodes are the nodes visited in this traversal; only fired nodes are eligible for read/execution.

Vertical flowchart: Query to embedding search, cheap router, graph traversal, and expensive context assembly

Figure 1. Inference architecture: seed retrieval, cheap routing, graph traversal, and expensive context assembly.

Tiering uses explicit precedence on each step: reflex, habitual, inhibitory, then dormant. Inhibition is a hard veto over support, not a soft penalty.

Tier	Weight Range	Behavior	Expected Cost
Reflex	>= 0.6	Auto-follow.	Near zero deliberation.
Habitual	0.2 – 0.6	Cheap routing decides whether to follow.	Low.
Inhibitory	<= -0.01	Skips/ suppresses nodes.	Near zero.
Dormant	< 0.2	Skipped unless directly re-seeded.	Zero.

The graph is deliberately sparse in active retrieval because routing should remain selective. Dormant edges are not deleted; they can recover if feedback returns.

def route_fn(query, candidate_targets):
    """Policy hook for selecting habitual actions."""

    ...
    return selected_targets


def choose_next(current_node, graph, route_fn, query, visited):
    edges = graph.outgoing(current_node)
    candidates, suppressed = [], set()
    for edge in edges:
        if edge.weight <= -0.01:
            suppressed.add(edge.target)
            continue
        if edge.weight >= 0.6 and edge.target not in visited:
            candidates.append((edge.target, edge.weight))
        if 0.2 <= edge.weight < 0.6 and edge.target not in visited:
            candidates.append((edge.target, edge.weight))

    habitual = [t for t, w in candidates if 0.2 <= w < 0.6]
    auto = [t for t, w in candidates if w >= 0.6]
    selected = route_fn(query, habitual) if route_fn else auto
    return [n for n in auto + selected if n not in suppressed and n not in visited]

3. Learning Rule and Dynamics

A production-oriented objective is:

\min_{W}\; \mathbb{E}_{q\sim\mathcal{D}}[\mathrm{tokens}(F(q;W))] \;\text{s.t.}\; \mathbb{E}_{q\sim\mathcal{D}}[\tilde z(q)]\ge \rho,\; |F|\le B,

\tilde z\in\{0,1\},\quad z=2\tilde z-1\in\{-1,+1\}

In plain terms: minimize context tokens while maintaining retrieval success, with signed outcome \(z\) used in updates.

How the Policy Gradient Actually Works

For a fixed-node softmax policy, each chosen action increases its own logit and decreases competing logits in proportion to their current probability mass.

The accessible math

\pi_W(a\mid s=i)=\frac{\exp\bigl((r_{ia}+w_{ia})/\tau\bigr)}{\sum_{j\in\mathcal{N}(i)\cup\{\texttt{STOP}\}}\exp\bigl((r_{ij}+w_{ij})/\tau\bigr)}

At node \(i\), write \(\boldsymbol{\pi}_i\) for the full action distribution and \(\mathbf{e}_a\) for the one-hot vector at the chosen action. The true score function is:

\nabla \log \pi_W(a\mid i)=\frac{1}{\tau}(\mathbf{e}_a-\boldsymbol{\pi}_i)

“what I chose” minus “what I expected to choose”: the term \(\mathbf{e}_a-\boldsymbol{\pi}_i\) compares action intent to policy mass.
Chosen action: update size is \((1-\pi(a\mid i))/\tau\), so surprises get bigger updates.
Non-chosen actions: each gets \(-\pi(j\mid i)/\tau\), i.e., proportional suppression.
Conservation: component sums are zero, so this is redistribution, not global inflation.

Numerical example

Suppose one node has three outgoing edges plus STOP, with \(w = [0.5, 0.3, -0.2, 0.0]\), \(\tau=1\), and chosen action \(A\).

Target	Weight	Softmax \\(\pi(j\mid i)\\)
A (chosen)	0.500	0.342
B	0.300	0.280
C	-0.200	0.170
STOP	0.000	0.208

With \(z=+1\), baseline \(b=0\), \(\eta=0.1\), \(\gamma=1\), and \(\tau=1\):

Target	True PG update	Old heuristic update	Sum
A (chosen)	\(+0.066\)	\(+0.100\)	\(\sum \Delta W_{\text{true}}=0.000,\ \sum \Delta W_{\text{heuristic}}=+0.100\)
B	\(-0.028\)	\(0\)
C	\(-0.017\)	\(0\)
STOP	\(-0.021\)	\(0\)

True PG reallocates probability mass: the +0.066 gain on A is exactly offset by -0.066 across alternatives. The heuristic only adds to A and drifts the local mass upward.

Why this update matters

The old heuristic is sign-correct but magnitude-wrong: it keeps direction on easy cases but over-updates too aggressively.
True PG naturally stops updating when the policy is certain: \(\pi(a\mid i)\approx 1\Rightarrow 1-\pi(a\mid i)\approx 0\).
True PG gives a usable negative signal too: on bad outcomes, it pushes alternatives up, not just penalizes the chosen edge.
Temperature \(\tau\) scales all magnitudes via \(1/\tau\), matching exploration sharpness to learning step size.

Connection to Gu (2016): full-trajectory credit

The corrected part is not just “use a better formula at one step.” We sum gradients across the full trajectory, not just the last transition:

\Delta W=\eta(z-b)\sum_{\ell=0}^{T}\gamma^\ell\nabla_W\log\pi_W(a_\ell\mid s_\ell)

Updates are applied to every trajectory step, including early hops, when only terminal outcome is observed.

In practical deployments, a strong early hop can still receive positive credit when a later branch fails; a bad mid-route branch can receive negative credit without requiring extra internal simulation.

In addition to policy updates, traversal updates include decay and bounded autotuning. This keeps edge weights interpretable and keeps the graph reusable across changing workloads.

Traversal uses episode-local attenuation for repeated loops. For edge \(e\), within-episode effective weight is:

\tilde w = w\cdot \lambda^k,\quad \lambda\in(0,1),\ k=\text{reuses in episode}

A separate visit-penalty factor can be modeled as \(\tilde w \leftarrow \tilde w - \alpha v_j\) with \(\alpha\) separate from \(\lambda\).

When an edge is reused repeatedly in one route, its effective influence is reduced, reducing lock-in while preserving recoverability.

4. Two-Timescale Architecture

OpenClawBrain runs on two explicit timescales: online learning per query, and scheduled structural maintenance.

Online learning (fast loop)

Each query performs:

query → traverse → log trace → feedback → learn

Per-query updates use apply_outcome() or apply_outcome_pg().
Learned edges steer next-hop probabilities and inhibitory routing.
No structural mutations occur on the request path.

Maintenance (slow loop)

Periodic maintenance runs structural operators outside request latency:

run_maintenance()

health
decay
merge
prune
compact

run_maintenance() is scheduler-agnostic (cron, timer, CI, ad-hoc orchestration).

Persistent worker (`openclawbrain daemon`)

For production, openclawbrain daemon --state PATH starts a long-lived process that loads state once and accepts JSON-RPC over stdin/stdout. Eliminates per-call state reload (~100-800ms savings).

Production timing (Mac Mini M4 Pro, OpenAI embeddings):

MAIN (1,158 nodes): 397ms embed + 107ms traverse = 504ms
PELICAN (582 nodes): 634ms embed + 51ms traverse = 685ms
BOUNTIFUL (285 nodes): 404ms embed + 27ms traverse = 431ms

Constitutional anchors

constitutional: never decay, never prune, never merge.
canonical: slower decay and merge-gated behavior.
overlay: default operational behavior for normal learned nodes.

Context Lifecycle

Files (edit) → Sync (re-embed) → Graph (learn) → Maintain (prune/merge) → Compact (shrink files)

This matches the operational flow used in setup-runbook automation.

4. Simulation Results

The following results are reproducible from 15 scripts in openclawbrain/sims and their generated figures.

4.1 Deploy Pipeline Compilation

Axis: Context efficiency (Axis 2). The 4-hop deployment chain reaches reflex (1.0) by Q10 and stays there, turning procedural traversal into near-reflex execution.

Measured: in 50 repeated deploy-route queries, edges move from habitual to reflex regime, with final weights at the reflex threshold.

Query	deploy_query→check_ci	check_ci→inspect_manifest	inspect_manifest→rollback	rollback→verify
1	0.425	0.420	0.416	0.411
10	1.000	1.000	1.000	1.000
25	1.000	1.000	1.000	1.000
50	1.000	1.000	1.000	1.000

Deploy pipeline edge weights across 50 queries. — Figure 2. Deploy pipeline: edge strengths climb to reflex weights.

4.2 Negation Learning

Axis: Retrieval accuracy (Axis 1). Bad edge reaches -0.940 while the good edge remains 1.0, so stale wrong guidance is suppressed.

Measured: inhibitory learning reaches \\(-0.940\\) on a deprecated path under negative outcomes.

Phase	Good edge	Bad edge
Q1	1.0	0.45
Q11	1.0	-0.44
Q12	1.0	-0.94
Q20	1.0	-0.94

Good and bad edge weights over 20 queries — Figure 3. Negation path suppressed after repeated negative feedback.

4.3 Context Reduction

Axis: Context efficiency (Axis 2). Nodes fired drop from 30 on Q1 to 2.7 on average (Q91-100), a 91% reduction.

Measured: context consumption in a focused workload drops from 30 nodes on first query to 2.7 nodes on average in the final 10 queries.

Metric	Value
Nodes fired (query 1)	30
Avg nodes fired (queries 91–100)	2.7

Nodes fired per query in context reduction simulation — Figure 4. Selective context reduction from broad initial retrieval to focused routing.

4.4 Forgetting Dynamics

Axis: Context efficiency (Axis 2). Selective forgetting reaches 93.3% dormant by Q25, leaving 6.7% active.

Measured: selective forgetting produces high dormant mass after adaptation: 93.3% dormant by 100 queries.

Query	Dormant	Habitual	Reflex	Dormant %
1	0	149	1	0.0%
25	140	0	10	93.3%
50	140	0	10	93.3%
100	140	0	10	93.3%

Dormant/habitual/reflex edge proportions across 100 queries — Figure 5. Edge tier distribution: most edges become dormant while useful edges remain reflex.

4.5 Edge Damping vs Undamped Traversal

Axis: Retrieval accuracy (Axis 1). Without damping, the toy cycle loops; with \\(\lambda=0.3\\), the target is reached in 4 hops.

Measured: with damping \\(\lambda=0.3\\), a branch to a target is reached; with undamped dynamics \\(\lambda=1.0\\), the traversal loops.

Config	Reached D	Observed steps to D	Observed issue
Damped \\(\lambda=0.3\\)	Yes	4	Branch discovered
Undamped \\(\lambda=1.0\\)	No	—	Looping through A→B→C→A

Edge damping comparison — Figure 6. Edge damping enables branch exploration under repeated traversal.

4.6 Domain Separation and Bridges

Axis: Retrieval accuracy (Axis 1). Only 5 cross-file edges emerge across 2 clusters, limiting irrelevant-domain bleed-through.

Measured: mixed query patterns induce sparse cross-file connectivity. The simulation produces 5 cross-file edges and two major clusters.

Query Count	Cross-file Edges	Clusters	Representative Cross-edge Count
50	5	2	5

Cross-file bridge emergence across simulated files — Figure 7. Domain separation with sparse bridges between groups.

4.7 Brain-death Recovery

Axis: Context efficiency (Axis 2). Recovery controls detect dormant-heavy states (>90%) and return the graph to recoverable routing dynamics.

Measured: in recovery conditions with aggressive decay, the autotune diagnostics identify and correct unhealthy dynamics. Dormancy is detected at \\(>90\%\\) and the system proposes control updates.

Measure	Initial	After Recovery Sequence
Dormant %	100.0%	97.5%
Autotune signal	Absent	Active (decay, Hebbian, promotion)
Status	Dormant-dominant	Recoverable and moving

Health trajectory under brain-death and recovery — Figure 8. Health tracking over recovery rounds.

4.8 Individuation Across Workloads

Axis: Context efficiency (Axis 2). Same workload starts with shared structure; after updates, 27 edges differ by more than 0.05.

Measured: starting from identical graph state, different workloads produce structurally distinct outcomes. The number of edges that differ by more than 0.05 after 50 queries is 27.

Metric	Graph A	Graph B
Nodes	18	18
Edges	306	306
Mean abs diff	0.0358
Edges with diff > 0.05	27

Final graph weight matrices for coding versus safety workloads — Figure 9. Individuation: workload-specific graph topology after 50 updates.

4.9 Simulation Protocol

All eight reported studies share a single reproducibility convention: fixed random seeds within each script, one output JSON file per script, and one dedicated figure output generated from that JSON.

Result	Script	Claim Output	Figure	Query Budget
Deploy pipeline	`deploy_pipeline.py`	`deploy_pipeline_results.json`	`deploy_pipeline.png`	50
Negation	`negation.py`	`negation_results.json`	`negation.png`	20
Context reduction	`context_reduction.py`	`context_reduction_results.json`	`context_reduction.png`	100
Selective forgetting	`forgetting.py`	`forgetting_results.json`	`forgetting.png`	100
Edge damping	`edge_damping.py`	`edge_damping_results.json`	`edge_damping.png`	1 traversal run
Domain separation	`domain_separation.py`	`domain_separation_results.json`	`domain_separation.png`	50
PG vs Heuristic	`pg_vs_heuristic.py`	`pg_vs_heuristic_results.json`	-	100
Noise robustness	`noise_robustness.py`	`noise_robustness_results.json`	-	100
Static vs learning	`static_vs_learning.py`	`static_vs_learning_results.json`	-	100
Scaling analysis	`scaling_analysis.py`	`scaling_analysis_results.json`	-	50 (per graph size)
Brain death recovery	`brain_death.py`	`brain_death_results.json`	`brain_death.png`	30 rounds
Individuation	`individuation.py`	`individuation_results.json`	`individuation.png`	50 per workload

One command runs all 15:

cd openclawbrain/sims
python generate_figures.py
python run_all.py

Result claims in this page are extracted from the stored JSON records and are intentionally limited to fields that can be traced to deterministic outputs.

4.10 Provenance of the Claim Set

Claim	Evidence	Evidence Anchor
Pipeline edges compile to reflex in 50 queries	Final edge map weight 1.0 across four edges	deploy_pipeline.json
Inhibitory edge reaches -0.940	Bad edge final weight field	negation_results.json
Context drops 30→2.7 fired nodes	Final 10-query average field	context_reduction_results.json
93.3% dormant after forgetting run	Final dormant percentage field	forgetting_results.json
Damped branch reached, undamped loops	Boolean reached_D flags	edge_damping_results.json
5 cross-file edges emerge	Final cross-file edge count	domain_separation_results.json
Autotune flags dormant failure and suggests recovery	Autotune suggestion list non-empty	brain_death_results.json
27 edges differ by >0.05	Structural distinctness diff count	individuation_results.json
PG vs heuristic keeps lower overall edge mass than heuristic	53.68 vs 71.20 total mass	pg_vs_heuristic_results.json
Noise robustness collapses near 30% noise	Reflex at 30% is false by Q100	noise_robustness_results.json
Scaling remains stable route length	5 fired nodes from 50 to 2000 nodes	scaling_analysis_results.json

4.11 New Results (Ablations)

Axis: Context efficiency (Axis 2). True PG vs heuristic show 53.68 vs 71.20 total edge weight and 0.3124 vs 0.4226 distractor average.

True PG vs heuristic: both methods reach the target by query 2; true PG has lower total weight and lower distractor edge weight in this run.

Method	Convergence	Total edge weight	Avg. distractor edge	Comment
Heuristic PG	Q2	71.20	0.4226	Higher mass growth
True PG	Q2	53.68	0.3124	Less global inflation

Axis: Retrieval accuracy (Axis 1). Correct path remains reflex through 0%, 10%, and 20% noise; degradation accelerates at 30%.

Noise robustness: degradation is gradual until it becomes sharp at 30%.

Noise rate	Reflex by Q100	Avg nodes fired (last 10)	Correct-minus-distractor gap	Inhibitory risk
0%	Yes	5.0	0.5650	none
10%	Yes	4.0	0.4625	low
20%	Yes	5.0	0.4622	low
30%	No	4.0	0.3072	moderate

Axis: Retrieval accuracy + Context efficiency (Axes 1+2). Static, heuristic, and PG all converge to stable paths; learned methods strengthen path confidence.

Static vs learning baseline: static traversal is stable but does not raise the same route confidence.

Method	Converges to stable path	Final correct-path weights	Avg fired nodes	Queries until <5 nodes
Static traversal	Q1	0.40, 0.40, 0.40, 0.40	5.0	Not reached in 100
Heuristic learning	Q1	1.00, 1.00, 1.00, 1.00	5.0	Not reached in 100
True PG	Q1	1.00, 1.00, 1.00, 1.00	5.0	Not reached in 100

Axis: Context efficiency (Axis 2). Traversal remains at 5 fired nodes while graph size grows 50 → 2000 and time 0.0962 → 1.8766 ms.

Scaling: fixed hop/beam settings keep route length at 5 while traversal time grows sublinearly.

Graph size	Avg traversal (ms)	Avg nodes fired
50	0.0962	5
100	0.1432	5
250	0.2814	5
500	0.5169	5
1000	0.9657	5
2000	1.8766	5

Benchmark comparison (expanded ablations)

This benchmark run includes 45 queries and covers baseline baselines and ablations, including static_traverse and no_inhibition.

Method	Recall@3	Recall@5	Precision@3	MRR	Latency p50 / p95 (ms)
static_traverse	0.270	0.304	0.170	0.381	46.27 / 47.79
keyword_overlap	0.289	0.515	0.163	0.295	10.74 / 11.11
hash_embed_similarity	0.270	0.433	0.170	0.428	44.81 / 46.51
ocb_traverse	0.270	0.304	0.170	0.381	46.73 / 48.07
ocb_no_inhibition	0.270	0.304	0.170	0.381	46.46 / 47.93
ocb_pg	0.270	0.304	0.170	0.381	45.98 / 47.64
ocb_with_replay	0.289	0.437	0.163	0.246	20.86 / 22.98
ocb_pg_with_replay	0.289	0.437	0.163	0.246	20.77 / 23.23

Results are reproducible from the 15 simulation scripts and local benchmark harness in this repository.

4.14 Structural Maintenance Results

OpenClawBrain separates per-query learning from periodic structural maintenance.

Maintenance runs through health → decay → merge → prune and can be scheduled with any workflow.

All maintained brains use OpenAI text-embedding-3-small (1536-dim).

4.14.1 Merge Compression

Merge compression reduced active nodes from 40 to 32 and fired value from 2.0 to 1.0 (50% reduction).

Metric	Before	After
Nodes fired	40	32
Fired value	2.0	1.0

4.14.2 Prune Health

Prune-health dropped edges from 870 to 24, while dormant moved from 98% to 29% and traversal still worked after prune.

Metric	Before	After
Edges	870	24
Dormant	98%	29%

4.14.3 Full Maintain vs Edge-Only (200 queries)

Full maintenance is 12% cheaper in total context cost than edge-only (1,815 → 1,598).

Strategy	Total context cost	Cost delta
Edge-only	1,815	baseline
Full maintain	1,598	-12%

4.14.4 Production Maintenance (first cycle)

Merged nodes are larger due to context combination, so context chars do not always shrink; routing becomes materially sparser.

Brain	Nodes	Edges	Pruned	Merged	Learnings
MAIN	1,160	2,551	70	2	43
PELICAN	555	2,211	271	1	181
BOUNTIFUL	289	1,101	450	2	35

Maintenance schedule and configuration is documented in docs/setup-guide.md.

4.12 External Benchmarks: MultiHop-RAG (n=1000)

Axis: Retrieval accuracy (Axis 1). Cold traversal is +7.7% over embedding on full hit (0.6430 vs 0.5660), while learning drops to 0.4770 (-16.6% at 1000).

Corpus: dataset facts and paragraphs from MultiHop-RAG facts. For each query, labels define evidence paragraphs to retrieve. Method set: embedding top-k and three OpenClawBrain variants with cold and online updates.

Method	Full hit rate	Partial hit rate	Evidence recall@10	MRR	Nodes fired
embedding_topk	0.5660	0.9800	0.8058	0.7942	10.000
ocb_cold	0.6430	0.9770	0.8224	0.7933	10.599
ocb_learning	0.4770	0.8530	0.6658	0.6916	9.982
ocb_pg_learning	0.4070	0.7890	0.5900	0.6534	9.871

Cold traversal improves immediate evidence recall over embedding-only retrieval. Online updates on a non-repeating stream reduce final full-hit rate, which matches the repeated-task assumption for sustained gains.

4.13 External Benchmarks: HotPotQA distractor (n=500)

Axis: Retrieval accuracy (Axis 1). Cold traversal improves SP@5 by +33.6% (0.9700 vs 0.6340); learning is roughly neutral over early checks.

Corpus: dataset passages and labeled evidence for distractor suppression evaluation.

Method	SP recall@5	SP recall@10	Single SP hit@5	Distractor suppression	MRR	Nodes fired
embedding_topk	0.6340	0.8080	0.9760	0.8202	0.8969	10.000
ocb_cold	0.9700	0.9840	0.9740	0.7623	0.8961	8.764
ocb_learning	0.9520	0.9840	0.9740	0.7656	0.8961	8.948
ocb_pg_learning	0.8980	0.9780	0.9740	0.7519	0.8945	8.316

Cold traversal is strongest at low and late checkpoints, while learned variants can improve early checkpoints and weaken under longer, diverse query streams.

Axis framing: These external benchmarks are Axis 1 only. They use unique, non-repeating queries, so Axis 2 (repetition-driven context compression) is not testable.

5. Production Deployment

Three production brains run on a Mac Mini M4 Pro, each built from workspace markdown files, session replay, and a learning database of human corrections:

Axis framing: Context efficiency on Axis 2. Startup context moved from 52--66KB to 3--13KB per query through on-demand graph routing after repeated task use.

Brain	Nodes	Edges	Learning Corrections
MAIN	1,160	2,551	43
PELICAN	555	2,211	181
BOUNTIFUL	289	1,101	35

All 2,004 nodes have real OpenAI text-embedding-3-small embeddings (1536-dim). Existing embeddings are cached and reused across rebuilds — only new nodes require API calls.

Learning corrections (259 active) are injected as first-class graph nodes and connected to relevant workspace nodes via cosine similarity. This means the graph retrieves not just workspace content but how to avoid past mistakes.

6. How It Compares

	Plain RAG	OpenClawBrain
Retrieval	Similarity search	Learned graph traversal
Feedback	None	+1/-1 outcomes update edge weights
Wrong answers	Keep resurfacing	Inhibitory edges suppress them
Over time	Same results	Routes compile to reflex behavior
Dependencies	Vector DB	Python core; optional embedding/LLM callbacks

Positioning against related systems

MemGPT: a runtime memory operating-system stack, while OpenClawBrain is a routing layer. Think of it as a graph policy for what to retrieve and feed next, not a replacement memory system.

Reflexion: textual reflection over time, whereas OpenClawBrain stores learned routing decisions as numeric edge weights over a graph.

Self-RAG: generation-level self-critique and revision; OpenClawBrain is retrieval-routing-level learning that shapes traversal before generation.

Related family: bandit-style routing and learning-to-rank ideas motivate the same principle in other forms — choose actions with feedback; OpenClawBrain applies it inside a sparse document graph with explicit STOP and edge-level inhibition.

7. Limitations and Conclusion

Limitations

The current paper page presents the 15 reproducible simulation classes listed above, including ablations and robustness checks.
Routing quality depends on seeding quality: poor embeddings or weak seeders produce weak starting routes, and we do not yet have a formal open benchmark gap-filling procedure for seeding failure modes.
Absolute magnitudes depend on synthetic setup and parameter choices in each simulator.
Autotune and traversal behavior should still be treated as adaptive heuristics for further real-workload calibration.

Conclusion

OpenClawBrain improves retrieval accuracy through graph-structured traversal (+7.7% on MultiHop-RAG, +33.6% on HotPotQA versus embedding-only baselines) and compresses context on repeated tasks through trajectory-level learning (30→2.7 nodes in simulation, 52--66KB→3--13KB in production).

The true REINFORCE policy gradient keeps edge weights bounded (53.68 vs 71.20 total weight) and suppresses distractor edges more effectively (0.3124 vs 0.4226 average distractor weight). Benefits depend on graph structure and repeated-query patterns; with diverse one-off queries, learning can add little or degrade accuracy.

This version is scoped to provable outputs: 15 deterministic simulations and the benchmark harness in this repo, with updated ablations and noise robustness.

References

Gu, J. (2016). Corrected policy-gradient update rules for reinforcement learning. UCLA Econometrics Field Paper. [Background and derivation]
Gu, J. (2016). Corrected policy-gradient update for recurrent action sequences. UCLA Econometrics Field Paper. [PDF]
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
Collins, A. M. & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82(6), 407–428.
Graves, A., Wayne, G. & Danihelka, I. (2014). Neural Turing Machines. arXiv:1410.5401.
Graves, A. et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538, 471–476.
Park, J. S. et al. (2023). Generative agents: Interactive simulacra of human behavior. UIST 2023.
Wang, G. et al. (2023). Voyager: An open-ended embodied agent with large language models. arXiv:2305.16291.
Packer, C. et al. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560.
Shinn, N. et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023.
Yao, S. et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR 2023.
Asai, A. et al. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR 2024.
Sun, J. et al. (2024). Think-on-Graph: Deep and responsible reasoning of large language model on knowledge graph. ICLR 2024.