CrabPath: The Graph is the Prompt

Learned document routing with corrected policy gradients

GitHub 📄 PDF 📝 LaTeX v3.0

February 2026

A brain-shaped neural network of document nodes and weighted edges — green reflex paths, blue habitual, gray dormant

Abstract

This paper describes CrabPath, a memory system for LLM agents that replaces similarity search with learned document routing. Nodes are document chunks. Edges are weighted pointers. The weights update by the corrected policy-gradient estimator from Gu (2016), which credits the full traversal path rather than only the terminal action. Under simulation, the system reduces per-turn context cost from $0.091 to $0.004 — a 23× reduction — while producing five emergent properties: procedural memory, selective forgetting, domain separation, self-regulation after brain death, and individuation across agents sharing the same source files.

1. The Problem

I ran three autonomous agents on overlapping tasks for twenty days. The bill was $13,000. Most of the cost came from one behavior: the agents reloaded the same workspace files every turn, whether they needed them or not.

This is the default pattern for LLM agents. You have 31 files and 283,098 characters of workspace context. Every time the user asks a question — any question — the system loads all of it. It is the equivalent of pulling your entire filing cabinet onto your desk every time you open a drawer.

RAG is the standard fix. Embed the documents, embed the query, return the top-k nearest neighbors. This works when documents are dissimilar. It fails when they are not.

Consider a deployment pipeline with four steps: check CI, inspect manifest, rollback, verify. RAG embeds all four as "deployment-related." It cannot distinguish the first step from the last. It ranks by resemblance, not by usefulness, and has no concept of sequence.

The same failure appears in negation. If your workspace says "run tests before deploy" and also has a deprecated note saying "skip tests for hotfix," RAG returns both as relevant. It has no mechanism to suppress the wrong one.

Left: drowning in documents. Right: calmly following a learned path.

The problem and the solution. Left: every query loads everything. Right: a learned route to exactly what you need.

Case	Static Context	RAG	CrabPath
Deploy pipeline	Loads everything	Loads similar subset	Loads learned sequence
Negation	Ambiguous candidates	Ambiguous candidates	Suppresses invalid path

The question this paper asks: can a memory system learn which documents actually help, from experience? Not by measuring text similarity, but by tracking which retrieval paths lead to good outcomes and which do not.

2. The Graph

CrabPath is a directed graph. Nodes are document chunks. Edges are weighted pointers with values in $[-1, 1]$. That is the entire data structure.

from dataclasses import dataclass, field
from typing import Literal, List

@dataclass
class Node:
    id: str
    content: str
    summary: str
    type: Literal["fact", "procedure", "action", "tool_call"]
    pointers: List["Edge"] = field(default_factory=list)

@dataclass
class Edge:
    target: str
    weight: float      # in [-1, 1]
    kind: Literal["support", "inhibit", "follows", "tool"]
    summary: str

When a query arrives, an embedding search finds the entry node. From there, a cheap LLM reads the current node's outgoing edges and decides which to follow. It repeats for 2–3 hops. The expensive LLM only reads the nodes that fire.

Vertical flowchart: Query → Embedding Search → Cheap Router → Graph Traversal → Context Assembly → Expensive LLM

Figure 1. Inference architecture: query seeding, cheap routing, tiered graph traversal, and expensive-context read.

Edge weights determine what happens at each hop. Three tiers:

Tier	Weight Range	What Happens	Cost
Reflex	> 0.8	Auto-follow. No deliberation.	Near zero
Habitual	0.3 – 0.8	Cheap LLM decides whether to follow.	Low
Dormant	< 0.3	Skipped unless re-entered by seed search.	Zero

Three edge tiers: dormant gray, habitual blue, reflex green

Figure 2. Edge tiers: weight determines visibility and behavior.

Think of edge weights like paths in a forest. A path nobody walks becomes overgrown and invisible — that is dormant. A path people sometimes take stays clear but you still have to decide whether to take it — that is habitual. A path so well-worn it has become a paved road, where you turn without thinking — that is reflex. CrabPath starts with all paths in the habitual range. Usage paves the important ones. Neglect lets the rest overgrow. After 100 queries, 95% of edges are overgrown, and the 5% that remain are the paths this particular agent actually needs.

def choose_next(current_node, graph, router, query, visited):
    edges = graph.outgoing(current_node)
    next_nodes = []

    # Reflex: zero deliberation
    for edge in edges:
        if edge.weight > 0.8 and edge.target not in visited:
            next_nodes.append(edge.target)

    # Habitual: LLM decides
    candidates = [e for e in edges
                  if 0.3 <= edge.weight <= 0.8
                  and e.target not in visited]
    if candidates:
        prompt = format_routing_prompt(query, current_node, candidates)
        selected = router.decide(prompt)
        next_nodes.extend(selected)

    # Dormant: ignored during live traversal
    return next_nodes

The forward pass is a controlled traversal, not wave propagation. At each hop, the LLM reasons about relevance, negation, and weight evidence before choosing candidates.

Negation is explicit. An edge with weight –0.4 does not vanish — it stays visible as negative evidence. The routing LLM sees skip_tests (–0.4) and knows to suppress that path, rather than simply not finding it.

Instead of the expensive model reading everything, a cheap model figures out what is worth reading. Estimated turn cost drops from $0.091 to $0.004 — a 23× reduction.

3. How It Learns

The first time you drive to work, you think about every turn. After a year, you do not think at all. CrabPath does the same thing with document routes.

3.1 Policy Gradients over Document Routes

Each query produces an episode: a sequence of nodes visited and a terminal signal. The formulation is a standard MDP.

State $s_t$: current node, accumulated context, query.
Action $a_t$: follow a pointer, or stop.
Reward $z$: +1 for silent success, –1 for explicit correction.
Policy: softmax over outgoing relevance scores and edge weights.

\pi_W(a\mid s=i)=\frac{\exp\bigl((r_{ia}+w_{ia})/\tau\bigr)}{\sum_{j\in\mathcal{N}(i)\cup\{\texttt{STOP}\}}\exp\bigl((r_{ij}+w_{ij})/\tau\bigr)}

Good outcome: strengthen the edges in the path. Bad outcome: weaken them. This is the standard policy-gradient update.

3.2 Why the Myopic Update Fails

The naive rule updates only the last action before the terminal signal.

\Delta W\propto z\nabla_W\log\pi_W(a_t\mid s_t)

This is wrong. If a three-hop route fails, the mistake might have been at hop one, not hop three. The myopic update credits (or blames) only the final edge.

The corrected estimator from Gu (2016) sums over the full trajectory:

\Delta W=\eta z\sum_{\ell=0}^{T}\nabla_W\log\pi_W(a_\ell\mid s_\ell)

With baseline and discount:

\Delta W=\eta(z-b)\sum_{\ell=0}^{T}\gamma^\ell\nabla_W\log\pi_W(a_\ell\mid s_\ell)

3.3 Why Not Just Use a Value Network?

AlphaGo solves the long-range credit problem differently: train a separate value network on millions of games, then use it to evaluate any position immediately. Monte Carlo Tree Search simulates thousands of futures from each move. Both approaches sidestep trajectory credit assignment through brute force.

CrabPath cannot afford either. We have no millions of training games — each "game" is one user query. We cannot simulate forward — each "move" is an LLM call. We have only the trajectory and the terminal signal.

The corrected update is the right tool when data is sparse and simulation is expensive. AlphaGo needs a value network because it has one. CrabPath needs the trajectory sum because it has nothing else. Both get long-range credit assignment right. Ours is cheaper.

The mapping from Gu (2016) to CrabPath is direct — only the state space changes.

Gu (2016)	CrabPath
Board states and moves	Document nodes and edge transitions
Weights $\rho$	Edge weights $W$
Terminal reward	Query feedback (+1 silent, –1 correction)

3.3 What Emerges

Structural learning is always on. Hebbian co-firing reinforcement is free and runs by default on every co-activation; it does not call a scorer. It builds and maintains routing structure from co-occurrence alone.

Apply this learning rule to a document graph over hundreds of queries. Five things happen without being explicitly programmed.

Procedural memory. Repeated multi-step tasks create chains. A four-step deployment procedure — check CI → inspect manifest → rollback → verify — starts as disconnected nodes. By query 25, the edges connecting them reach the habitual tier (weight 0.46). By query 50, all three edges are reflex (0.88 → 0.82 → 0.82). The agent no longer deliberates about this sequence. It fires automatically.

Three stages: Q1 no path, Q25 habitual chain, Q50 reflex chain

Figure 3. A four-step deployment procedure compiles from missing → habitual → reflex over 50 queries.

Three-panel visualization showing graph evolution: Q1 isolated nodes, Q25 habitual connections, Q50 reflex chain

Figure 4. The same compilation visualized: scattered nodes → tentative connections → compiled reflex path.

Selective forgetting. In a healthy graph, 87–95% of edges are dormant. This is not a failure. It is the graph learning that most possible connections are not useful. A brain that remembers everything is not intelligent — it is cluttered.

Before: tangled web of connections. After: clean structure with only important paths remaining

Figure 5. Day 1: everything connected. Day 100: the brain learned what matters.

A brain that remembers everything is as useless as one that remembers nothing. The left panel shows bootstrap day: every chunk of every file is connected to its siblings. The graph loads everything because it has no evidence about what matters. The right panel shows the same graph after 100 queries. Ninety-five percent of those connections have faded to dormant. The surviving paths — the ones glowing green — are exactly the connections this agent uses. The graph did not delete anything. The unused edges are still there, dormant, waiting. If a query pattern changes, dormant edges can be reinforced back to life. Forgetting in CrabPath is reversible.

Domain separation with bridges. Nodes from the same file form clusters. A small number of cross-file edges connect them: 2–8 in simulation, depending on workload. The graph self-organizes into specialized domains with sparse bridging, not a homogeneous mesh.

Self-regulation. When external perturbation kills the graph (Section 5 shows the full demo), the autotuner detects the failure state and recovers. This is closed-loop: the system does not need a human to notice the problem.

Individuation. Two agents initialized from the same files, given different query workloads, produce structurally distinct graphs. Same genome, different phenotype. Brain A (coding-heavy queries) develops 8 cross-file edges and 5.3% reflex. Brain B (safety-heavy queries) develops 2 cross-file edges and 6.25% reflex concentrated in safety-related clusters.

Scoring matters. Correctness learning is the optional RL layer. It costs $0.001 per query and updates ranking/weights toward truthful retrieval rather than mere co-occurrence. In the A/B sim, this reduces wrong-path retrievals from 24% to 16%, a 33% reduction.

4. How It Grows

A new CrabPath graph starts empty. It needs to become a working memory. Four mechanisms handle this.

4.1 Bootstrap (Cell Division)

Read the workspace files. A cheap LLM splits each file into chunks by heading or semantic boundary. Each chunk becomes a node. Sibling edges connect chunks from the same file. On the GUCLAW workspace (31 files, 283,098 chars), this produces 278 nodes and 3,822 initial edges.

Document node splitting into smaller specialized chunks over time

Recursive cell division. A file splits into coherent chunks. Over time, some connections fade while others strengthen. Chunks that outgrow their role split again.

This is how the graph finds its resolution. A file like TOOLS.md starts as one node. The cheap LLM splits it into coherent sections — coding rules, browser config, cost tracking, git hygiene. Initially, all sections are connected: loading one loads them all. But as queries arrive, the coding rules and git hygiene sections co-fire while the browser config and cost tracking do not. The unused connections fade. The co-firing sections grow stronger. Eventually, the graph has learned that "codex" and "worktree" belong together, without anyone telling it so.

All edges start at weight 0.27 — solidly in the habitual tier. Everything requires deliberation. Nothing is free yet.

Nodes that grow too broad are split recursively: the router generates sub-node candidates, and stable small nodes merge when evidence suggests redundancy. This is LLM-driven and metadata-aware, not random.

4.2 New Edges (Synaptogenesis)

Bootstrap creates only within-file edges. Cross-file structure must emerge from use.

When two nodes are selected in the same query but have no edge between them, a proto-edge forms. If the co-selection repeats above a promotion threshold (minimum 2), the proto-edge promotes to a real edge. Its initial weight is set by the Hebbian increment parameter.

This is how the graph discovers connections that no one explicitly authored. A safety rule and a tool usage guide, selected together repeatedly, eventually form a direct link.

4.3 New Nodes (Neurogenesis)

When a query finds no matching entry node, the system can create a new node from the query content itself. This is rare but necessary — it handles topics that do not exist in the original workspace files.

4.4 Decay (Death)

Edges decay geometrically unless reinforced by traversal. The decay half-life controls the rate: at half-life 80, an edge loses half its weight every 80 queries of disuse.

Without forgetting, the graph bloats. Every proto-edge that ever formed would persist. Every temporary correlation would become permanent structure. Decay is the mechanism that makes the graph converge to a sparse, useful representation instead of a dense, noisy one.

The graph starts as a flat copy of your files and self-organizes into something uniquely shaped by your usage. No manual curation. No taxonomy. Just feedback.

5. How It Stays Healthy

A learning system without guardrails will destroy itself. The autotuner is the immune system.

5.1 Health Metrics

Eight metrics define what a healthy graph looks like. These targets come from observed operating points in simulation, not from theory.

Metric	Target Range	Why
`avg_nodes_fired_per_query`	1.0 – 2.0	Too few hurts coverage. Too many hurts cost.
`cross_file_edge_pct`	0% – 15%	Specialization with occasional bridges.
`dormant_pct`	70% – 95%	Suppresses noise. Keeps an active spine.
`reflex_pct`	0% – 10%	Only stable, high-confidence links auto-fire.
`context_compression`	≤ 20%	Per-query token read stays bounded.
`proto_promotion_rate`	0% – 50%	Controls discovery without churn.
`reconvergence_rate`	0%	Sibling splits should diversify, not collapse.
`orphan_nodes`	0	Every node must be reachable.

5.2 The Autotuner

The autotuner runs after every query. It measures health, diagnoses out-of-range metrics, and applies bounded adjustments to four knobs: decay half-life, promotion threshold, Hebbian increment, and reflex threshold.

health = measure_health(graph, state, query_stats)
adjustments = autotune(health)
changes = filter_with_guardrails(adjustments)
apply_adjustments(graph, changes)

Every candidate change hits a safety filter before application:

Decay half-life: bounded to [20, 200].
Promotion threshold: integer, never below 2.
Hebbian increment: capped at 0.12.
Reflex threshold: bounded to [0.8, 0.95].

5.3 Meta-Learning

The autotuner keeps a history of every (metric, knob, direction) triple and whether it improved health. Successful triples are preferred in future cycles. Repeated failures are down-ranked.

This is a simple form of meta-learning: the controller learns which fixes work for this particular graph and workload.

5.4 Brain Death and Recovery

Set decay half-life to 20 — far too aggressive. Here is what happens.

Checkpoint	Dormant	Healthy Metrics	Status	Config
Q25	86.7%	4 / 8 in range	Stable	τ_d=20, h=0.06
Q50	100%	3 / 8 in range	Brain dead	τ_d=20, h=0.06
Q75	93.3%	4 / 8 in range	Recovery underway	τ_d=162, h=0.12
Q100	86.7%	4 / 8 in range	Stabilized	τ_d=200, h=0.12

At Q50, every edge is dormant. The graph is dead — it cannot route anything. No human intervenes.

Circular feedback loop: measure, diagnose, apply, evaluate

The self-tuning loop. The autotuner is the immune system of the memory graph.

Every 25 queries, the autotuner runs a health check. It measures eight metrics — how many nodes fire per query, what percentage of edges are dormant, whether cross-file connections are forming. If any metric falls outside its empirically calibrated range, the autotuner adjusts: increase decay, lower promotion threshold, strengthen Hebbian reinforcement. It tracks which adjustments worked and which made things worse. Over time, it learns to tune THIS particular graph for THIS particular usage pattern. If something goes badly wrong — say, decay is so aggressive that the brain is dying — the emergency brake reverts all changes to the last known good configuration.

Brain death at Q0-Q50 then recovery Q50-Q100 as autotuner ramps decay

Figure 6. Decay half-life over queries with brain death until Q50 and recovery after autotuning begins.

The autotuner detects the failure: dormant percentage exceeds range, nodes fired too low, reflex at zero. It begins increasing decay half-life (20 → 30 → 40 → 53 → ...) and Hebbian increment (0.06 → 0.07 → 0.08 → ...) at every tuning cycle. From Q50 to Q100 (50 queries), decay half-life rises to 200; edges recover by Q75 and stabilize by Q100 at τ_d=200 with dormant percentage back in range.

5.5 Migration

When an agent migrates to a new machine or upgrades its workspace files, the bootstrap runs fresh: read files, split into chunks, create sibling edges. If historical session logs exist, they replay through the graph to warm up edge weights before the system goes live.

Replay is memory distillation. It accelerates early structure formation and creates cross-file links before real feedback arrives.

6. Does It Work

6.1 Procedural Memory Compilation

A simulated agent receives 50 queries about deployment, interleaved with unrelated queries. The deployment procedure has four steps: check CI → inspect manifest → rollback → verify.

Query	Chain State	Edge Weights	Notes
Q1	No path	All edges missing	No stable pointer evidence yet.
Q25	Habitual chain	0.46 → 0.46 → 0.46	Three habitual edges form a route.
Q50	Reflex chain	0.88 → 0.82 → 0.82	Full route auto-fires. Zero deliberation.

No distractors appear on this chain. The sequence self-organizes from repeated successful completions under sparse corrective signals. Zero distractor weight at all checkpoints.

6.2 Twin Brains

Two graphs initialized from the same 3 files (11 nodes, 30 edges). Brain A receives 100 coding-heavy queries. Brain B receives 100 safety-heavy queries.

Coding brain vs safety brain stacked vertically

Figure 7. Same workspace files. Different query patterns. Different brains.

Two brains side by side - coding brain with dense tool connections, safety brain with dense safety connections

Figure 8. Individuation: same starting structure, different cognitive specialization from different experiences.

Brain	Dormant	Cross-file Edges	Reflex	Phenotype
Brain A (coding)	73.7%	8 (21.1%)	5.3%	Tooling-heavy reflexes, broad cross-domain links
Brain B (safety)	87.5%	2 (6.25%)	6.25%	Safety-oriented clusters, sparse bridging

Same source files. Different structure. Brain A has 4× more cross-file connections because coding queries require jumping between tools, identity, and safety context. Brain B concentrates its reflex paths in the safety domain, where credentials and destructive action rules co-fire repeatedly.

Every agent that runs CrabPath develops a unique cognitive structure.

6.X Hebbian Alone vs Hebbian + RL Scoring

Does outcome-based learning matter, or is structural co-firing sufficient? We ran the same 50 queries on the same ambiguous deployment corpus twice: once with Hebbian reinforcement only, once with Hebbian plus cheap-LLM scoring at $0.001 per query.

The corpus was deliberately adversarial: 5 correct deployment procedure nodes, 5 wrong/outdated procedure nodes, and 5 hotfix nodes — all using similar vocabulary. A keyword-based router with 30% error rate simulated real routing confusion.

Hebbian-only versus Hebbian plus RL scoring graph comparison

Metric	Hebbian Only	Hebbian + RL
Wrong-path hit rate	24%	16% (-33%)
Reflex edges	0	4
Inhibitory edges	0	21
Health score	3/8	5/8
All edges dormant	100%	83%

Hebbian alone produces a graph where everything is dormant — it learned what fires together but has no mechanism to distinguish correct from incorrect. With RL scoring, the graph compiled 4 correct procedures into reflex paths AND created 21 inhibitory edges suppressing the wrong procedures. The cheap LLM scorer costs $0.001 per query. For that price, the graph learns not just structure, but truth.

6.X Shadow Mode in Production

CrabPath runs in shadow mode on a live agent (GUCLAW) processing real conversations. After 40 queries with GPT-5-mini scoring at $0.001 per query, we observe:

Signal	Fires	Edge Changes
Hebbian co-fire	39/40 (98%)	672 reinforcements
Proto-edge promotion	—	122 new edges
Proto-edge creation	—	1,550 exploring
RL scoring (GOOD)	8/40 (20%)	~40 updates
RL scoring (SKIP)	32/40 (80%)	0 (below gate)

Hebbian learning dominates early graph life — 16× more edge changes than RL scoring. This is expected. The graph must first build structure (what fires together) before it can judge quality (what is actually correct). RL scoring fires on 20% of queries where retrieval relevance exceeds the 0.3 helpfulness gate. The remaining 80% receive no RL update — the graph learns from co-firing alone, and that is sufficient for structural formation.

We are honest about this tradeoff: the RL signal is sparse in a young graph. The A/B comparison (Section 6.X above) proves that when RL does fire, it matters — wrong-path retrieval drops 33%. As the graph matures and retrieval improves, we expect the scoring rate to increase. The Hebbian base layer ensures the graph never stops learning, even when RL is silent.

6.3 Scenario Comparison

Four routing configurations evaluated on the full GUCLAW simulation (278 nodes, 3,822 edges):

Scenario	Queries	Edges	Cross-file	Proto	Ref / Hab / Dorm	Decay τ
Baseline	153	3,822	82	100	13 / 51 / 3,758	80
Aggressive Decay	160	3,790	54	77	1 / 29 / 3,760	40
Conservative Decay	160	3,853	111	152	24 / 3,775 / 54	200
High Traffic	500	3,861	146	76	16 / 66 / 3,779	80

Figure 9. Edge weight distribution after 100 queries. The spike near zero is selective forgetting in action.

This histogram tells the whole story in one image. The massive spike near zero is 3,800 sibling edges that decayed because the agent never needed them together. The small bump in the habitual range is 170 actively useful connections. The tiny green spike at the right is 25 reflex edges — procedures so well-learned that the router skips them automatically. This is what a healthy brain looks like: mostly quiet, selectively active, with a few compiled reflexes.

Aggressive decay (τ=40) collapses structure: reflex edges drop to 1 and cross-file links fall 34%. Conservative decay (τ=200) shifts 98% of edges into the habitual tier — everything requires deliberation, nothing is free. High traffic improves cross-file linking through co-firing frequency: 146 cross-file edges vs 82 at baseline.

6.4 Cost Reduction

Experiment	Static (tokens)	RAG	CrabPath	Reduction
Context Bloat	6,066	744	297	95%
Gate Bloat	8,163	407	89	99%
Stale Context	895	507	88	90%
Procedure	548	467	205	63%

Estimated turn cost: $0.091 (static) → $0.004 (CrabPath). At 500 queries/day, that is the difference between $45.50 and $2.00.

7. Limitations and Conclusion

Limitations

CrabPath has not been deployed in production. The results here come from simulation. Long-horizon stability under real workloads is still an open question.

Fresh graphs are expensive: 100% habitual at bootstrap means every edge requires deliberation until the system has enough feedback to learn. The cold-start window depends on workload diversity and query volume.

Credit assignment remains hard. The corrected estimator from Gu (2016) is better than myopic, but discount factor selection and baseline estimation are still sensitive. The cheap routing model's convergence needs caveats — it is not a general-purpose optimizer, and pathological workloads could produce degenerate policies.

Health metric target ranges were calibrated on one workspace. They will shift with corpus size and workload distribution. Multi-user credit isolation does not exist yet.

Conclusion

We have five main findings:

Context load falls 63–99% versus static retrieval, with estimated per-turn cost dropping from $0.091 to $0.004.
Corrected policy gradients propagate signal to every routing decision in the trajectory, not just the terminal one.
Procedural memory self-compiles from repeated behavior: a four-step deployment chain reaches reflex tier within 50 queries.
The autotuner keeps the graph bounded and recovers from brain death without human intervention.
Agents sharing the same source files develop structurally distinct graphs under different workloads.

Code: github.com/jonathangu/crabpath.

References

Gu, J. (2016). Corrected policy-gradient update for recurrent action sequences. UCLA Econometrics Field Paper. [PDF]
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
Collins, A. M. & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82(6), 407–428.
Graves, A., Wayne, G. & Danihelka, I. (2014). Neural Turing Machines. arXiv:1410.5401.
Graves, A. et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538, 471–476.
Weston, J., Chopra, S. & Bordes, A. (2015). Memory Networks. ICLR 2015.
Park, J. S. et al. (2023). Generative agents: Interactive simulacra of human behavior. UIST 2023.
Wang, G. et al. (2023). Voyager: An open-ended embodied agent with large language models. arXiv:2305.16291.
Packer, C. et al. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560.
Shinn, N. et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023.
Yao, S. et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR 2023.
Asai, A. et al. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR 2024.
Edge, D. et al. (2024). From local to global: A graph RAG approach to query-focused summarization. arXiv:2404.16130.
Sun, J. et al. (2024). Think-on-Graph: Deep and responsible reasoning of large language model on knowledge graph. ICLR 2024.
Schulman, J. et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
Rafailov, R. et al. (2023). Direct preference optimization: Your language model is secretly a reward model. NeurIPS 2023.

Gu (2016)	CrabPath
Board states and moves	Document nodes and edge transitions
Weights \(\rho\)	Edge weights \(W\)
Terminal reward	Query feedback (+1 silent, –1 correction)