The number
about $34k Approximate combined API spend: $10.3k observed (OpenAI) + ~$23.7k reconstructed (Anthropic), Feb 11 – Mar 13, 2026 (31 days, UTC buckets)That is not a typo. It is not annualized. It is one month of direct API calls from a single person's development environment across two providers. But it is also not one clean observed number: the OpenAI half is directly observed from billing exports, and the Anthropic half is reconstructed from token logs. Read the total as approximate, not exact.
I found this number by pulling the raw billing exports from both dashboards on March 13, 2026 and analyzing them line by line. The OpenAI side is clean: a daily cost export for the full 31-day window. The Anthropic side is not. Its matched-window cost export covers only 15 of 31 UTC dates, so I reconstructed the rest from token-level logs and pricing recovered from the overlap days. That reconstruction matches the observed cost rows closely on most overlap dates, but it is still reconstructed.
This piece is an honest accounting of where that money seems to have gone, what it appears to have bought, and whether I'd do it again.
Where the money went
| Metric | OpenAI Obs | Anthropic Rec |
|---|---|---|
| Total spend | $10,316 | ~$23,729 |
| Avg daily spend | $333 | ~$765 |
| Peak single day | $1,755 (Mar 7) | ~$3,175 (Feb 25) |
| Hottest 7-day run | $8,100 | ~$14,828 |
| Tokens (input + output) | 11.2B | 9.8B* |
| Dominant model | GPT-5.4 | Claude Opus 4.6 |
The spend gap is striking: reconstructed Anthropic spend was roughly 2.3x observed OpenAI spend, while Anthropic also shows fewer comparable served tokens. That is not a clean apples-to-apples cost-per-1M verdict on the models themselves. The workloads were different. Anthropic appears dominated by giant cached Opus contexts; OpenAI looks more like shorter-context work across a mix of models.
OpenAI: a late-stage ramp
The first 14 days of OpenAI usage were almost idle: $246 total. Then the OpenAI spend curve inflected sharply in the final week. The last 7 days alone were $8,100, which is 78.5% of the entire month. GPT-5.4 showing up in the workflow is the obvious concurrent change, but the export does not prove a single-cause story. Inferred
The reconstructed OpenAI model mix is concentrated. GPT-5.4 accounted for an estimated 80% of normalized spend and 46% of all requests. GPT-5.2 was the secondary model at roughly 18%. Everything else rounds to noise. Reconstructed
One anomaly I can't fully explain: actual billed spend was 25% higher than what visible token pricing would predict. Possible explanations include reasoning tokens, tool-use charges, or internal pricing that differs from published rates. I don't have the granularity to pin that down. Inferred
Anthropic: the prompt-caching bill
Anthropic's spend was earlier and sharper. The big spike cluster ran from February 19 through February 27, with 84% of the month's estimated spend concentrated in that 10-day window.
The cost was not driven by generating text. Output tokens accounted for only 2.4% of the bill. The main visible driver was cache orchestration — writing and re-reading massive contexts through Claude's prompt-caching system:
- Cache writes (5-minute and 1-hour TTLs): ~67% of spend
- Cache reads: ~31% of spend
- Output: ~2.4%
- Uncached input: ~0.04%
Put another way: about 97.8% of reconstructed Anthropic spend was cache mechanics rather than uncached input or output. Claude Opus 4.6 was 94.7% of covered spend. The 200k–1M context window accounted for 80% of estimated spend. Reconstructed
The served-input cache hit rate was 99.99%, which sounds efficient until you look at where the money went. Cache writes themselves accounted for roughly two-thirds of estimated Anthropic spend. Each cached token was reused about 8x on average, but the write side still appears to have dominated the economics. Inferred
Workspaces: multiple lanes, not clean attribution
The Anthropic token logs show usage across four workspace labels: Default (52% of token volume), guclaw (27%), bountiful (14%), and pelican (7%). That suggests multiple active lanes, not one clean runaway loop, but it does not prove project-level ROI. Workspace labels are observed. Mapping them to shipped artifacts or productivity is inference. Inferred
What we were actually building
During this 31-day window, my development environment, anchored by OpenClaw, was doing real work across multiple projects. Not all of this maps cleanly to specific dollar amounts. I'm going to be honest about what I can tie loosely to the spend and what I cannot cleanly attribute at all.
Shipped and durable Inferred
- OpenClawBrain — the memory/learning system for OpenClaw agents — went through a major push: four infrastructure PRs merged (replay watchdog, stall hardening, fast boot, default learning-on), a 0.1.5 release, and the public site and documentation were rewritten top-to-bottom multiple times.
- This site (jonathangu.com) got a full design refresh, new hero art for all project cards, and a rewrite of the copy from jargon-heavy to plain English.
- openclawbrain.ai was rewritten from a compliance-style technical memo into a proper product site, with benchmark evidence surfaced prominently and a clear outsider-first explanation of what the brain does.
- OpenClawBrain proof artifacts — recorded-session head-to-head benchmarks (full_brain 0.975 vs vector_rag_rerank 0.896 across 800 queries), 10-seed sparse-feedback proof sweeps, and publication-style reporting.
- Bountiful Garden and Project Pelican (options trading) show workspace activity in the logs, but their token volume largely reflects positioning updates and automated data refresh rather than clearly model-spend-intensive shipped work. Attribution of spend to specific outputs in these workspaces is weak. Inferred
Thrash and inefficiency
I'm not going to pretend every dollar was productive.
- Multiple rewrite waves of the same pages. The OpenClawBrain site and this homepage went through at least four rounds of copy rewrites in a single weekend (March 7-8): technical → plain English → metaphor-first → elevator-pitch → "everywhere." That clearly cost money. My workflow likely also caused broad rereads and repeated passes over the same material, though I cannot prove the exact token mechanics of each pass. Inferred
- Long-context caching overhead. The Anthropic bill suggests that maintaining large cached contexts was itself the majority of cost, not just the work those contexts enabled. A smaller-context strategy, different TTL choices, or lighter models probably would have cut spend substantially, though it's hard to say how much agent effectiveness would have suffered. Inferred
- Model-switching churn. Several outages appear to line up with batch-editing model configs and service configurations simultaneously. The cost of the outage-recovery cycles was real. Inferred
- Mega brain rebuild stalls. A multi-hour replay process stalled repeatedly on pathological items, burning compute while producing nothing. That episode appears to have fed into the hardening PRs that landed afterward, so the waste at least seems to have produced a durable fix. Inferred
Timeline: spend vs. shipping
| Period | What happened | ~Spend |
|---|---|---|
| Feb 11–18 | Anthropic heavy: brain building, learning pipeline, early OpenClawBrain work | ~$4.5k |
| Feb 19–27 | Anthropic peak: massive long-context Opus 4.6 caching, multi-workspace activity | ~$15.5k |
| Feb 28–Mar 4 | Transition period: Anthropic winding down, OpenAI starting to ramp | ~$3k |
| Mar 5–8 | GPT-5.4 enters the workflow. OpenAI ramps. Site rewrites ship. OpenClawBrain PRs land. | ~$6k |
| Mar 9–13 | OpenAI at full burn. Design refresh. Proof benchmarks. This report. | ~$5k |
What changed after we killed the API keys
On March 13, 2026 — the day I pulled these exports — I deleted the direct OpenAI API key and the direct Anthropic API key from my environment.
This was not a gradual optimization. It was a hard cut. The main interactive chat and coding workflows now route through Codex OAuth / ChatGPT Pro subscription (for OpenAI) and Claude Code on Max subscription (for Anthropic) rather than direct API billing. That moved the biggest human-in-the-loop lanes off per-token billing, but it did not erase all cost exposure.
What this changes:
- Main interactive workflows moved off direct billing. The subscription plans cover the high-volume chat and coding lanes that appear to have been driving most of the bill.
- The remaining API surface is narrower. Direct API keys are still needed for specific programmatic use cases, including Pelican's automated trading pipeline and cron-driven agent tasks, but those are lower-volume and easier to monitor.
- The cost exposure is reduced, not eliminated. I didn't find the perfect per-token strategy. I moved the high-volume interactive work behind subscription tiers. Some direct API use remains, and subscription economics depend on provider plan terms staying favorable.
One honest caveat: the subscription approach works because these providers currently offer plans that absorb high interactive volume. If those plans change their caps or pricing, the economics change with them. And not all workloads transfer: programmatic pipelines still need direct API access. What changed is that the main interactive workflows moved off direct billing, not that every billable AI path disappeared.
How solid is this data?
I want to be precise about what I actually know versus what I'm inferring.
| Claim | Evidence level | Source |
|---|---|---|
| OpenAI total: $10,316 | Observed | Daily cost export, all 31 days present |
| OpenAI model-level spend shares | Reconstructed | Token-price proxy normalized to daily actuals |
| Anthropic total: $23,729 | Reconstructed | Token logs + pricing inferred from overlapping cost/token rows |
| Anthropic partial: $13,121 | Observed | Cost export (covers ~55% of dates) |
| Token volumes (both providers) | Observed | Usage/token exports |
| Spend tied to specific projects | Inferred | Workspace labels + git timeline + memory |
| Period spend by date range | Inferred | Mixed observed + reconstructed daily sums |
The biggest data gap is on Anthropic: the cost export stops at February 24 and then jumps to a single day on March 11. The token logs cover the full period, and the inferred pricing matches observed rows closely, so I trust the reconstructed total within a few percent. But I'd feel better with the complete cost export.
Was it worth it?
Verdict: Partly.
Real work shipped. Real infrastructure was built. Multiple public-facing products improved materially. But the spend was badly uncontrolled, a significant fraction was thrash, and the fix — deleting the API keys and moving to subscriptions — was straightforward and should have happened weeks earlier.
Worth it compared to what?
"Partly worth it" needs a benchmark. The real comparison is not versus doing nothing. It is versus a cheaper, slower, more disciplined version of the same month. Here are the two most obvious counterfactuals:
- Subscription plans from day one. Both OpenAI and Anthropic offer subscription tiers that cover a lot of high-volume interactive use. If I had started on those plans on February 11 instead of switching on March 13, the main interactive workloads would likely have been far cheaper than running them through direct API billing. The exact savings depend on plan tiers and caps, so I am not pretending this is a precise substitute.
- Smaller contexts and cheaper models. The Anthropic data is suggestive: an estimated 97.8% of Anthropic spend was cache mechanics on 200k–1M context windows. Running shorter contexts or using lighter models for some of the rewrite work would have produced a dramatically smaller bill. Whether it would have produced the same quality of output is harder to say — I don't have controlled evidence for that.
The uncomfortable answer is that a meaningful chunk of the spend looks avoidable with decisions that were available at the time, not just in hindsight. That is what keeps the verdict at "partly" instead of "yes."
The case for yes
- Durable systems shipped. OpenClawBrain went from a half-built prototype to a released package with proof artifacts, public documentation, and a real product site. The hardening PRs (replay watchdog, fast boot, stall recovery) directly addressed failures discovered during the spend window.
- Public surfaces improved. Both jonathangu.com and openclawbrain.ai look and read materially better than they did on February 11. That matters for a product trying to earn credibility.
- Learning and leverage. Running multiple AI agents across multiple projects at this scale taught me things about prompt caching economics, model-switching costs, and agent orchestration that I likely would not have learned as quickly at lower volume. The cost data itself is useful evidence.
- Multiple workstreams were active. OpenClawBrain, jonathangu.com, and at least some Bountiful and Pelican activity were moving through the same agent infrastructure. That matters, even though the project-level attribution is loose.
The case for no
- Massive waste in rewrite cycles. Rewriting the same site copy four times in one weekend is not productive iteration. It's expensive indecision running on a meter.
- The Anthropic caching bill was likely avoidable. An estimated ~$24k, most of it on cache writes for 200k–1M contexts, suggests a strategy mistake. Smaller contexts, smarter cache TTL choices, or shorter system prompts would likely have cut this dramatically — though the impact on agent effectiveness is uncertain.
- The fix was available, not just obvious in hindsight. Subscription plans from both providers appear to cover the main interactive workloads at far lower direct cost. Every day I ran those lanes on raw API billing after those plans existed was money left on the table.
- Attribution is weak. I can't cleanly tie most of the ~$34k to specific shipped artifacts. The workspace labels give me project-level buckets, but "guclaw workspace spent $X" doesn't tell me what that workspace actually accomplished per dollar.
The honest bottom line
I'd make the same bet, meaning run hard on AI-assisted development across multiple projects, but I'd set the cost controls first. The work was real. The scale was real. The waste was also real, and it looks more like a failure of cost hygiene than a failure of the underlying approach.
About $34k bought a combined footprint of roughly 21.0B comparable tokens if you use Anthropic served-input-plus-output, or 22.3B if you use Anthropic raw cache-inclusive totals. Some of that built things that will last. Some of it burned on repeated passes over the same material. The ratio between those two is the thing I'd fix.
If you're running AI agents at scale against direct API billing: check your caching costs, check your context window sizes, and check whether a subscription plan covers your workload. Do it before the month ends, not after.
Appendix: Model concentration
OpenAI — top models by spend share Reconstructed
| Model | Spend share | Request share | Input token share |
|---|---|---|---|
| GPT-5.4 | 80.3% | 46.3% | 70.5% |
| GPT-5.2 | 17.9% | 15.5% | 24.8% |
| GPT-5 mini | 1.7% | 36.6% | 4.1% |
| All others | 0.1% | 1.5% | 0.2% |
Anthropic — top models by spend share Reconstructed
| Model | Spend share | Token share |
|---|---|---|
| Claude Opus 4.6 | 94.7% | 91.1% |
| Claude Sonnet 4.5 | 4.9% | 8.5% |
| All others | 0.4% | 0.4% |