CrabPath: The Graph is the Prompt
Learned document routing with corrected policy gradients
Abstract
This paper describes CrabPath, a memory system for LLM agents that replaces similarity search with learned document routing. Nodes are document chunks. Edges are weighted pointers. The weights update by the corrected policy-gradient estimator from Gu (2016), which credits the full traversal path rather than only the terminal action. Under simulation, the system reduces per-turn context cost from $0.091 to $0.004 โ a 23ร reduction โ while producing five emergent properties: procedural memory, selective forgetting, domain separation, self-regulation after brain death, and individuation across agents sharing the same source files.
1. The Problem
I ran three autonomous agents on overlapping tasks for twenty days. The bill was $13,000. Most of the cost came from one behavior: the agents reloaded the same workspace files every turn, whether they needed them or not.
This is the default pattern for LLM agents. You have 31 files and 283,098 characters of workspace context. Every time the user asks a question โ any question โ the system loads all of it. It is the equivalent of pulling your entire filing cabinet onto your desk every time you open a drawer.
RAG is the standard fix. Embed the documents, embed the query, return the top-k nearest neighbors. This works when documents are dissimilar. It fails when they are not.
Consider a deployment pipeline with four steps: check CI, inspect manifest, rollback, verify. RAG embeds all four as "deployment-related." It cannot distinguish the first step from the last. It ranks by resemblance, not by usefulness, and has no concept of sequence.
The same failure appears in negation. If your workspace says "run tests before deploy" and also has a deprecated note saying "skip tests for hotfix," RAG returns both as relevant. It has no mechanism to suppress the wrong one.
| Case | Static Context | RAG | CrabPath |
|---|---|---|---|
| Deploy pipeline | Loads everything | Loads similar subset | Loads learned sequence |
| Negation | Ambiguous candidates | Ambiguous candidates | Suppresses invalid path |
The question this paper asks: can a memory system learn which documents actually help, from experience? Not by measuring text similarity, but by tracking which retrieval paths lead to good outcomes and which do not.
2. The Graph
CrabPath is a directed graph. Nodes are document chunks. Edges are weighted pointers with values in \([-1, 1]\). That is the entire data structure.
from dataclasses import dataclass, field
from typing import Literal, List
@dataclass
class Node:
id: str
content: str
summary: str
type: Literal["fact", "procedure", "action", "tool_call"]
pointers: List["Edge"] = field(default_factory=list)
@dataclass
class Edge:
target: str
weight: float # in [-1, 1]
kind: Literal["support", "inhibit", "follows", "tool"]
summary: str
When a query arrives, an embedding search finds the entry node. From there, a cheap LLM reads the current node's outgoing edges and decides which to follow. It repeats for 2โ3 hops. The expensive LLM only reads the nodes that fire.
Edge weights determine what happens at each hop. Three tiers:
| Tier | Weight Range | What Happens | Cost |
|---|---|---|---|
| Reflex | > 0.8 | Auto-follow. No deliberation. | Near zero |
| Habitual | 0.3 โ 0.8 | Cheap LLM decides whether to follow. | Low |
| Dormant | < 0.3 | Skipped unless re-entered by seed search. | Zero |
Think of edge weights like paths in a forest. A path nobody walks becomes overgrown and invisible โ that is dormant. A path people sometimes take stays clear but you still have to decide whether to take it โ that is habitual. A path so well-worn it has become a paved road, where you turn without thinking โ that is reflex. CrabPath starts with all paths in the habitual range. Usage paves the important ones. Neglect lets the rest overgrow. After 100 queries, 95% of edges are overgrown, and the 5% that remain are the paths this particular agent actually needs.
def choose_next(current_node, graph, router, query, visited):
edges = graph.outgoing(current_node)
next_nodes = []
# Reflex: zero deliberation
for edge in edges:
if edge.weight > 0.8 and edge.target not in visited:
next_nodes.append(edge.target)
# Habitual: LLM decides
candidates = [e for e in edges
if 0.3 <= edge.weight <= 0.8
and e.target not in visited]
if candidates:
prompt = format_routing_prompt(query, current_node, candidates)
selected = router.decide(prompt)
next_nodes.extend(selected)
# Dormant: ignored during live traversal
return next_nodes
The forward pass is a controlled traversal, not wave propagation. At each hop, the LLM reasons about relevance, negation, and weight evidence before choosing candidates.
Negation is explicit. An edge with weight โ0.4 does not vanish โ it stays visible as negative evidence. The routing LLM sees skip_tests (โ0.4) and knows to suppress that path, rather than simply not finding it.
3. How It Learns
The first time you drive to work, you think about every turn. After a year, you do not think at all. CrabPath does the same thing with document routes.
3.1 Policy Gradients over Document Routes
Each query produces an episode: a sequence of nodes visited and a terminal signal. The formulation is a standard MDP.
- State \(s_t\): current node, accumulated context, query.
- Action \(a_t\): follow a pointer, or stop.
- Reward \(z\): +1 for silent success, โ1 for explicit correction.
- Policy: softmax over outgoing relevance scores and edge weights.
\[\pi_W(a\mid s=i)=\frac{\exp\bigl((r_{ia}+w_{ia})/\tau\bigr)}{\sum_{j\in\mathcal{N}(i)\cup\{\texttt{STOP}\}}\exp\bigl((r_{ij}+w_{ij})/\tau\bigr)}\]
Good outcome: strengthen the edges in the path. Bad outcome: weaken them. This is the standard policy-gradient update.
3.2 Why the Myopic Update Fails
The naive rule updates only the last action before the terminal signal.
\[\Delta W\propto z\nabla_W\log\pi_W(a_t\mid s_t)\]
This is wrong. If a three-hop route fails, the mistake might have been at hop one, not hop three. The myopic update credits (or blames) only the final edge.
The corrected estimator from Gu (2016) sums over the full trajectory:
\[\Delta W=\eta z\sum_{\ell=0}^{T}\nabla_W\log\pi_W(a_\ell\mid s_\ell)\]
With baseline and discount:
\[\Delta W=\eta(z-b)\sum_{\ell=0}^{T}\gamma^\ell\nabla_W\log\pi_W(a_\ell\mid s_\ell)\]
3.3 Why Not Just Use a Value Network?
AlphaGo solves the long-range credit problem differently: train a separate value network on millions of games, then use it to evaluate any position immediately. Monte Carlo Tree Search simulates thousands of futures from each move. Both approaches sidestep trajectory credit assignment through brute force.
CrabPath cannot afford either. We have no millions of training games โ each "game" is one user query. We cannot simulate forward โ each "move" is an LLM call. We have only the trajectory and the terminal signal.
The corrected update is the right tool when data is sparse and simulation is expensive. AlphaGo needs a value network because it has one. CrabPath needs the trajectory sum because it has nothing else. Both get long-range credit assignment right. Ours is cheaper.
The mapping from Gu (2016) to CrabPath is direct โ only the state space changes.
| Gu (2016) | CrabPath |
|---|---|
| Board states and moves | Document nodes and edge transitions |
| Weights \(\rho\) | Edge weights \(W\) |
| Terminal reward | Query feedback (+1 silent, โ1 correction) |
3.3 What Emerges
Structural learning is always on. Hebbian co-firing reinforcement is free and runs by default on every co-activation; it does not call a scorer. It builds and maintains routing structure from co-occurrence alone.
Apply this learning rule to a document graph over hundreds of queries. Five things happen without being explicitly programmed.
Procedural memory. Repeated multi-step tasks create chains. A four-step deployment procedure โ check CI โ inspect manifest โ rollback โ verify โ starts as disconnected nodes. By query 25, the edges connecting them reach the habitual tier (weight 0.46). By query 50, all three edges are reflex (0.88 โ 0.82 โ 0.82). The agent no longer deliberates about this sequence. It fires automatically.
Selective forgetting. In a healthy graph, 87โ95% of edges are dormant. This is not a failure. It is the graph learning that most possible connections are not useful. A brain that remembers everything is not intelligent โ it is cluttered.
A brain that remembers everything is as useless as one that remembers nothing. The left panel shows bootstrap day: every chunk of every file is connected to its siblings. The graph loads everything because it has no evidence about what matters. The right panel shows the same graph after 100 queries. Ninety-five percent of those connections have faded to dormant. The surviving paths โ the ones glowing green โ are exactly the connections this agent uses. The graph did not delete anything. The unused edges are still there, dormant, waiting. If a query pattern changes, dormant edges can be reinforced back to life. Forgetting in CrabPath is reversible.
Domain separation with bridges. Nodes from the same file form clusters. A small number of cross-file edges connect them: 2โ8 in simulation, depending on workload. The graph self-organizes into specialized domains with sparse bridging, not a homogeneous mesh.
Self-regulation. When external perturbation kills the graph (Section 5 shows the full demo), the autotuner detects the failure state and recovers. This is closed-loop: the system does not need a human to notice the problem.
Individuation. Two agents initialized from the same files, given different query workloads, produce structurally distinct graphs. Same genome, different phenotype. Brain A (coding-heavy queries) develops 8 cross-file edges and 5.3% reflex. Brain B (safety-heavy queries) develops 2 cross-file edges and 6.25% reflex concentrated in safety-related clusters.
Scoring matters. Correctness learning is the optional RL layer. It costs $0.001 per query and updates ranking/weights toward truthful retrieval rather than mere co-occurrence. In the A/B sim, this reduces wrong-path retrievals from 24% to 16%, a 33% reduction.
4. How It Grows
A new CrabPath graph starts empty. It needs to become a working memory. Four mechanisms handle this.
4.1 Bootstrap (Cell Division)
Read the workspace files. A cheap LLM splits each file into chunks by heading or semantic boundary. Each chunk becomes a node. Sibling edges connect chunks from the same file. On the GUCLAW workspace (31 files, 283,098 chars), this produces 278 nodes and 3,822 initial edges.
This is how the graph finds its resolution. A file like TOOLS.md starts as one node. The cheap LLM splits it into coherent sections โ coding rules, browser config, cost tracking, git hygiene. Initially, all sections are connected: loading one loads them all. But as queries arrive, the coding rules and git hygiene sections co-fire while the browser config and cost tracking do not. The unused connections fade. The co-firing sections grow stronger. Eventually, the graph has learned that "codex" and "worktree" belong together, without anyone telling it so.
All edges start at weight 0.27 โ solidly in the habitual tier. Everything requires deliberation. Nothing is free yet.
Nodes that grow too broad are split recursively: the router generates sub-node candidates, and stable small nodes merge when evidence suggests redundancy. This is LLM-driven and metadata-aware, not random.
4.2 New Edges (Synaptogenesis)
Bootstrap creates only within-file edges. Cross-file structure must emerge from use.
When two nodes are selected in the same query but have no edge between them, a proto-edge forms. If the co-selection repeats above a promotion threshold (minimum 2), the proto-edge promotes to a real edge. Its initial weight is set by the Hebbian increment parameter.
This is how the graph discovers connections that no one explicitly authored. A safety rule and a tool usage guide, selected together repeatedly, eventually form a direct link.
4.3 New Nodes (Neurogenesis)
When a query finds no matching entry node, the system can create a new node from the query content itself. This is rare but necessary โ it handles topics that do not exist in the original workspace files.
4.4 Decay (Death)
Edges decay geometrically unless reinforced by traversal. The decay half-life controls the rate: at half-life 80, an edge loses half its weight every 80 queries of disuse.
Without forgetting, the graph bloats. Every proto-edge that ever formed would persist. Every temporary correlation would become permanent structure. Decay is the mechanism that makes the graph converge to a sparse, useful representation instead of a dense, noisy one.
5. How It Stays Healthy
A learning system without guardrails will destroy itself. The autotuner is the immune system.
5.1 Health Metrics
Eight metrics define what a healthy graph looks like. These targets come from observed operating points in simulation, not from theory.
| Metric | Target Range | Why |
|---|---|---|
avg_nodes_fired_per_query | 1.0 โ 2.0 | Too few hurts coverage. Too many hurts cost. |
cross_file_edge_pct | 0% โ 15% | Specialization with occasional bridges. |
dormant_pct | 70% โ 95% | Suppresses noise. Keeps an active spine. |
reflex_pct | 0% โ 10% | Only stable, high-confidence links auto-fire. |
context_compression | โค 20% | Per-query token read stays bounded. |
proto_promotion_rate | 0% โ 50% | Controls discovery without churn. |
reconvergence_rate | 0% | Sibling splits should diversify, not collapse. |
orphan_nodes | 0 | Every node must be reachable. |
5.2 The Autotuner
The autotuner runs after every query. It measures health, diagnoses out-of-range metrics, and applies bounded adjustments to four knobs: decay half-life, promotion threshold, Hebbian increment, and reflex threshold.
health = measure_health(graph, state, query_stats)
adjustments = autotune(health)
changes = filter_with_guardrails(adjustments)
apply_adjustments(graph, changes)
Every candidate change hits a safety filter before application:
- Decay half-life: bounded to
[20, 200]. - Promotion threshold: integer, never below 2.
- Hebbian increment: capped at 0.12.
- Reflex threshold: bounded to
[0.8, 0.95].
5.3 Meta-Learning
The autotuner keeps a history of every (metric, knob, direction) triple and whether it improved health. Successful triples are preferred in future cycles. Repeated failures are down-ranked.
This is a simple form of meta-learning: the controller learns which fixes work for this particular graph and workload.
5.4 Brain Death and Recovery
Set decay half-life to 20 โ far too aggressive. Here is what happens.
| Checkpoint | Dormant | Healthy Metrics | Status | Config |
|---|---|---|---|---|
| Q25 | 86.7% | 4 / 8 in range | Stable | ฯd=20, h=0.06 |
| Q50 | 100% | 3 / 8 in range | Brain dead | ฯd=20, h=0.06 |
| Q75 | 93.3% | 4 / 8 in range | Recovery underway | ฯd=162, h=0.12 |
| Q100 | 86.7% | 4 / 8 in range | Stabilized | ฯd=200, h=0.12 |
At Q50, every edge is dormant. The graph is dead โ it cannot route anything. No human intervenes.
Every 25 queries, the autotuner runs a health check. It measures eight metrics โ how many nodes fire per query, what percentage of edges are dormant, whether cross-file connections are forming. If any metric falls outside its empirically calibrated range, the autotuner adjusts: increase decay, lower promotion threshold, strengthen Hebbian reinforcement. It tracks which adjustments worked and which made things worse. Over time, it learns to tune THIS particular graph for THIS particular usage pattern. If something goes badly wrong โ say, decay is so aggressive that the brain is dying โ the emergency brake reverts all changes to the last known good configuration.
The autotuner detects the failure: dormant percentage exceeds range, nodes fired too low, reflex at zero. It begins increasing decay half-life (20 โ 30 โ 40 โ 53 โ ...) and Hebbian increment (0.06 โ 0.07 โ 0.08 โ ...) at every tuning cycle. From Q50 to Q100 (50 queries), decay half-life rises to 200; edges recover by Q75 and stabilize by Q100 at ฯd=200 with dormant percentage back in range.
5.5 Migration
When an agent migrates to a new machine or upgrades its workspace files, the bootstrap runs fresh: read files, split into chunks, create sibling edges. If historical session logs exist, they replay through the graph to warm up edge weights before the system goes live.
Replay is memory distillation. It accelerates early structure formation and creates cross-file links before real feedback arrives.
6. Does It Work
6.1 Procedural Memory Compilation
A simulated agent receives 50 queries about deployment, interleaved with unrelated queries. The deployment procedure has four steps: check CI โ inspect manifest โ rollback โ verify.
| Query | Chain State | Edge Weights | Notes |
|---|---|---|---|
| Q1 | No path | All edges missing | No stable pointer evidence yet. |
| Q25 | Habitual chain | 0.46 โ 0.46 โ 0.46 | Three habitual edges form a route. |
| Q50 | Reflex chain | 0.88 โ 0.82 โ 0.82 | Full route auto-fires. Zero deliberation. |
No distractors appear on this chain. The sequence self-organizes from repeated successful completions under sparse corrective signals. Zero distractor weight at all checkpoints.
6.2 Twin Brains
Two graphs initialized from the same 3 files (11 nodes, 30 edges). Brain A receives 100 coding-heavy queries. Brain B receives 100 safety-heavy queries.
| Brain | Dormant | Cross-file Edges | Reflex | Phenotype |
|---|---|---|---|---|
| Brain A (coding) | 73.7% | 8 (21.1%) | 5.3% | Tooling-heavy reflexes, broad cross-domain links |
| Brain B (safety) | 87.5% | 2 (6.25%) | 6.25% | Safety-oriented clusters, sparse bridging |
Same source files. Different structure. Brain A has 4ร more cross-file connections because coding queries require jumping between tools, identity, and safety context. Brain B concentrates its reflex paths in the safety domain, where credentials and destructive action rules co-fire repeatedly.
Every agent that runs CrabPath develops a unique cognitive structure.
6.X Hebbian Alone vs Hebbian + RL Scoring
Does outcome-based learning matter, or is structural co-firing sufficient? We ran the same 50 queries on the same ambiguous deployment corpus twice: once with Hebbian reinforcement only, once with Hebbian plus cheap-LLM scoring at $0.001 per query.
The corpus was deliberately adversarial: 5 correct deployment procedure nodes, 5 wrong/outdated procedure nodes, and 5 hotfix nodes โ all using similar vocabulary. A keyword-based router with 30% error rate simulated real routing confusion.
| Metric | Hebbian Only | Hebbian + RL |
|---|---|---|
| Wrong-path hit rate | 24% | 16% (-33%) |
| Reflex edges | 0 | 4 |
| Inhibitory edges | 0 | 21 |
| Health score | 3/8 | 5/8 |
| All edges dormant | 100% | 83% |
Hebbian alone produces a graph where everything is dormant โ it learned what fires together but has no mechanism to distinguish correct from incorrect. With RL scoring, the graph compiled 4 correct procedures into reflex paths AND created 21 inhibitory edges suppressing the wrong procedures. The cheap LLM scorer costs $0.001 per query. For that price, the graph learns not just structure, but truth.
6.X Shadow Mode in Production
CrabPath runs in shadow mode on a live agent (GUCLAW) processing real conversations. After 40 queries with GPT-5-mini scoring at $0.001 per query, we observe:
| Signal | Fires | Edge Changes |
|---|---|---|
| Hebbian co-fire | 39/40 (98%) | 672 reinforcements |
| Proto-edge promotion | โ | 122 new edges |
| Proto-edge creation | โ | 1,550 exploring |
| RL scoring (GOOD) | 8/40 (20%) | ~40 updates |
| RL scoring (SKIP) | 32/40 (80%) | 0 (below gate) |
Hebbian learning dominates early graph life โ 16ร more edge changes than RL scoring. This is expected. The graph must first build structure (what fires together) before it can judge quality (what is actually correct). RL scoring fires on 20% of queries where retrieval relevance exceeds the 0.3 helpfulness gate. The remaining 80% receive no RL update โ the graph learns from co-firing alone, and that is sufficient for structural formation.
We are honest about this tradeoff: the RL signal is sparse in a young graph. The A/B comparison (Section 6.X above) proves that when RL does fire, it matters โ wrong-path retrieval drops 33%. As the graph matures and retrieval improves, we expect the scoring rate to increase. The Hebbian base layer ensures the graph never stops learning, even when RL is silent.
6.3 Scenario Comparison
Four routing configurations evaluated on the full GUCLAW simulation (278 nodes, 3,822 edges):
| Scenario | Queries | Edges | Cross-file | Proto | Ref / Hab / Dorm | Decay ฯ |
|---|---|---|---|---|---|---|
| Baseline | 153 | 3,822 | 82 | 100 | 13 / 51 / 3,758 | 80 |
| Aggressive Decay | 160 | 3,790 | 54 | 77 | 1 / 29 / 3,760 | 40 |
| Conservative Decay | 160 | 3,853 | 111 | 152 | 24 / 3,775 / 54 | 200 |
| High Traffic | 500 | 3,861 | 146 | 76 | 16 / 66 / 3,779 | 80 |
This histogram tells the whole story in one image. The massive spike near zero is 3,800 sibling edges that decayed because the agent never needed them together. The small bump in the habitual range is 170 actively useful connections. The tiny green spike at the right is 25 reflex edges โ procedures so well-learned that the router skips them automatically. This is what a healthy brain looks like: mostly quiet, selectively active, with a few compiled reflexes.
Aggressive decay (ฯ=40) collapses structure: reflex edges drop to 1 and cross-file links fall 34%. Conservative decay (ฯ=200) shifts 98% of edges into the habitual tier โ everything requires deliberation, nothing is free. High traffic improves cross-file linking through co-firing frequency: 146 cross-file edges vs 82 at baseline.
6.4 Cost Reduction
| Experiment | Static (tokens) | RAG | CrabPath | Reduction |
|---|---|---|---|---|
| Context Bloat | 6,066 | 744 | 297 | 95% |
| Gate Bloat | 8,163 | 407 | 89 | 99% |
| Stale Context | 895 | 507 | 88 | 90% |
| Procedure | 548 | 467 | 205 | 63% |
7. Limitations and Conclusion
Limitations
CrabPath has not been deployed in production. The results here come from simulation. Long-horizon stability under real workloads is still an open question.
Fresh graphs are expensive: 100% habitual at bootstrap means every edge requires deliberation until the system has enough feedback to learn. The cold-start window depends on workload diversity and query volume.
Credit assignment remains hard. The corrected estimator from Gu (2016) is better than myopic, but discount factor selection and baseline estimation are still sensitive. The cheap routing model's convergence needs caveats โ it is not a general-purpose optimizer, and pathological workloads could produce degenerate policies.
Health metric target ranges were calibrated on one workspace. They will shift with corpus size and workload distribution. Multi-user credit isolation does not exist yet.
Conclusion
We have five main findings:
- Context load falls 63โ99% versus static retrieval, with estimated per-turn cost dropping from $0.091 to $0.004.
- Corrected policy gradients propagate signal to every routing decision in the trajectory, not just the terminal one.
- Procedural memory self-compiles from repeated behavior: a four-step deployment chain reaches reflex tier within 50 queries.
- The autotuner keeps the graph bounded and recovers from brain death without human intervention.
- Agents sharing the same source files develop structurally distinct graphs under different workloads.
Code: github.com/jonathangu/crabpath.
References
- Gu, J. (2016). Corrected policy-gradient update for recurrent action sequences. UCLA Econometrics Field Paper. [PDF]
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3โ4), 229โ256.
- Collins, A. M. & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82(6), 407โ428.
- Graves, A., Wayne, G. & Danihelka, I. (2014). Neural Turing Machines. arXiv:1410.5401.
- Graves, A. et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538, 471โ476.
- Weston, J., Chopra, S. & Bordes, A. (2015). Memory Networks. ICLR 2015.
- Park, J. S. et al. (2023). Generative agents: Interactive simulacra of human behavior. UIST 2023.
- Wang, G. et al. (2023). Voyager: An open-ended embodied agent with large language models. arXiv:2305.16291.
- Packer, C. et al. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560.
- Shinn, N. et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023.
- Yao, S. et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR 2023.
- Asai, A. et al. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR 2024.
- Edge, D. et al. (2024). From local to global: A graph RAG approach to query-focused summarization. arXiv:2404.16130.
- Sun, J. et al. (2024). Think-on-Graph: Deep and responsible reasoning of large language model on knowledge graph. ICLR 2024.
- Schulman, J. et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
- Rafailov, R. et al. (2023). Direct preference optimization: Your language model is secretly a reward model. NeurIPS 2023.