We Spent About $34k on AI in 31 Days

The number

about $34k Approximate combined API spend: $10.3k observed (OpenAI) + ~$23.7k reconstructed (Anthropic), Feb 11 – Mar 13, 2026 (31 days, UTC buckets)

That is not a typo. It is not annualized. It is one month of direct API calls from a single person's development environment across two providers. But it is also not one clean observed number: the OpenAI half is directly observed from billing exports, and the Anthropic half is reconstructed from token logs. Read the total as approximate, not exact.

I found this number by pulling the raw billing exports from both dashboards on March 13, 2026 and analyzing them line by line. The OpenAI side is clean: a daily cost export for the full 31-day window. The Anthropic side is not. Its matched-window cost export covers only 15 of 31 UTC dates, so I reconstructed the rest from token-level logs and pricing recovered from the overlap days. That reconstruction matches the observed cost rows closely on most overlap dates, but it is still reconstructed.

This piece is an honest accounting of where that money seems to have gone, what it appears to have bought, and whether I'd do it again.

Where the money went

Metric	OpenAI Obs	Anthropic Rec
Total spend	$10,316	~$23,729
Avg daily spend	$333	~$765
Peak single day	$1,755 (Mar 7)	~$3,175 (Feb 25)
Hottest 7-day run	$8,100	~$14,828
Tokens (input + output)	11.2B	9.8B*
Dominant model	GPT-5.4	Claude Opus 4.6

*Anthropic's raw reported token total is 11.1B, but that includes cache-write overhead. The more comparable served-input-plus-output figure is 9.8B, and that is what appears in the table. Do not compare OpenAI's 11.2B and Anthropic's 11.1B raw total as if they were the same thing.

The spend gap is striking: reconstructed Anthropic spend was roughly 2.3x observed OpenAI spend, while Anthropic also shows fewer comparable served tokens. That is not a clean apples-to-apples cost-per-1M verdict on the models themselves. The workloads were different. Anthropic appears dominated by giant cached Opus contexts; OpenAI looks more like shorter-context work across a mix of models.

OpenAI: a late-stage ramp

The first 14 days of OpenAI usage were almost idle: $246 total. Then the OpenAI spend curve inflected sharply in the final week. The last 7 days alone were $8,100, which is 78.5% of the entire month. GPT-5.4 showing up in the workflow is the obvious concurrent change, but the export does not prove a single-cause story. Inferred

The reconstructed OpenAI model mix is concentrated. GPT-5.4 accounted for an estimated 80% of normalized spend and 46% of all requests. GPT-5.2 was the secondary model at roughly 18%. Everything else rounds to noise. Reconstructed

Model-level spend shares are reconstructed via a token-price proxy normalized to observed daily totals. The OpenAI cost export is daily-only and does not break out spend by model.

One anomaly I can't fully explain: actual billed spend was 25% higher than what visible token pricing would predict. Possible explanations include reasoning tokens, tool-use charges, or internal pricing that differs from published rates. I don't have the granularity to pin that down. Inferred

Anthropic: the prompt-caching bill

Anthropic's spend was earlier and sharper. The big spike cluster ran from February 19 through February 27, with 84% of the month's estimated spend concentrated in that 10-day window.

The cost was not driven by generating text. Output tokens accounted for only 2.4% of the bill. The main visible driver was cache orchestration — writing and re-reading massive contexts through Claude's prompt-caching system:

Cache writes (5-minute and 1-hour TTLs): ~67% of spend
Cache reads: ~31% of spend
Output: ~2.4%
Uncached input: ~0.04%

Put another way: about 97.8% of reconstructed Anthropic spend was cache mechanics rather than uncached input or output. Claude Opus 4.6 was 94.7% of covered spend. The 200k–1M context window accounted for 80% of estimated spend. Reconstructed

The served-input cache hit rate was 99.99%, which sounds efficient until you look at where the money went. Cache writes themselves accounted for roughly two-thirds of estimated Anthropic spend. Each cached token was reused about 8x on average, but the write side still appears to have dominated the economics. Inferred

Workspaces: multiple lanes, not clean attribution

The Anthropic token logs show usage across four workspace labels: Default (52% of token volume), guclaw (27%), bountiful (14%), and pelican (7%). That suggests multiple active lanes, not one clean runaway loop, but it does not prove project-level ROI. Workspace labels are observed. Mapping them to shipped artifacts or productivity is inference. Inferred

What we were actually building

During this 31-day window, my development environment, anchored by OpenClaw, was doing real work across multiple projects. Not all of this maps cleanly to specific dollar amounts. I'm going to be honest about what I can tie loosely to the spend and what I cannot cleanly attribute at all.

Shipped and durable Inferred

OpenClawBrain — the memory/learning system for OpenClaw agents — went through a major push: four infrastructure PRs merged (replay watchdog, stall hardening, fast boot, default learning-on), a 0.1.5 release, and the public site and documentation were rewritten top-to-bottom multiple times.
This site (jonathangu.com) got a full design refresh, new hero art for all project cards, and a rewrite of the copy from jargon-heavy to plain English.
openclawbrain.ai was rewritten from a compliance-style technical memo into a proper product site, with benchmark evidence surfaced prominently and a clear outsider-first explanation of what the brain does.
OpenClawBrain proof artifacts — recorded-session head-to-head benchmarks (full_brain 0.975 vs vector_rag_rerank 0.896 across 800 queries), 10-seed sparse-feedback proof sweeps, and publication-style reporting.
Bountiful Garden and Project Pelican (options trading) show workspace activity in the logs, but their token volume largely reflects positioning updates and automated data refresh rather than clearly model-spend-intensive shipped work. Attribution of spend to specific outputs in these workspaces is weak. Inferred

Thrash and inefficiency

I'm not going to pretend every dollar was productive.

Multiple rewrite waves of the same pages. The OpenClawBrain site and this homepage went through at least four rounds of copy rewrites in a single weekend (March 7-8): technical → plain English → metaphor-first → elevator-pitch → "everywhere." That clearly cost money. My workflow likely also caused broad rereads and repeated passes over the same material, though I cannot prove the exact token mechanics of each pass. Inferred
Long-context caching overhead. The Anthropic bill suggests that maintaining large cached contexts was itself the majority of cost, not just the work those contexts enabled. A smaller-context strategy, different TTL choices, or lighter models probably would have cut spend substantially, though it's hard to say how much agent effectiveness would have suffered. Inferred
Model-switching churn. Several outages appear to line up with batch-editing model configs and service configurations simultaneously. The cost of the outage-recovery cycles was real. Inferred
Mega brain rebuild stalls. A multi-hour replay process stalled repeatedly on pathological items, burning compute while producing nothing. That episode appears to have fed into the hardening PRs that landed afterward, so the waste at least seems to have produced a durable fix. Inferred

Timeline: spend vs. shipping

Period	What happened	~Spend
Feb 11–18	Anthropic heavy: brain building, learning pipeline, early OpenClawBrain work	~$4.5k
Feb 19–27	Anthropic peak: massive long-context Opus 4.6 caching, multi-workspace activity	~$15.5k
Feb 28–Mar 4	Transition period: Anthropic winding down, OpenAI starting to ramp	~$3k
Mar 5–8	GPT-5.4 enters the workflow. OpenAI ramps. Site rewrites ship. OpenClawBrain PRs land.	~$6k
Mar 9–13	OpenAI at full burn. Design refresh. Proof benchmarks. This report.	~$5k

Period spend figures are approximate and combine observed OpenAI with reconstructed Anthropic. They should not be summed for exact totals.

What changed after we killed the API keys

On March 13, 2026 — the day I pulled these exports — I deleted the direct OpenAI API key and the direct Anthropic API key from my environment.

This was not a gradual optimization. It was a hard cut. The main interactive chat and coding workflows now route through Codex OAuth / ChatGPT Pro subscription (for OpenAI) and Claude Code on Max subscription (for Anthropic) rather than direct API billing. That moved the biggest human-in-the-loop lanes off per-token billing, but it did not erase all cost exposure.

What this changes:

Main interactive workflows moved off direct billing. The subscription plans cover the high-volume chat and coding lanes that appear to have been driving most of the bill.
The remaining API surface is narrower. Direct API keys are still needed for specific programmatic use cases, including Pelican's automated trading pipeline and cron-driven agent tasks, but those are lower-volume and easier to monitor.
The cost exposure is reduced, not eliminated. I didn't find the perfect per-token strategy. I moved the high-volume interactive work behind subscription tiers. Some direct API use remains, and subscription economics depend on provider plan terms staying favorable.

One honest caveat: the subscription approach works because these providers currently offer plans that absorb high interactive volume. If those plans change their caps or pricing, the economics change with them. And not all workloads transfer: programmatic pipelines still need direct API access. What changed is that the main interactive workflows moved off direct billing, not that every billable AI path disappeared.

How solid is this data?

I want to be precise about what I actually know versus what I'm inferring.

Claim	Evidence level	Source
OpenAI total: $10,316	Observed	Daily cost export, all 31 days present
OpenAI model-level spend shares	Reconstructed	Token-price proxy normalized to daily actuals
Anthropic total: $23,729	Reconstructed	Token logs + pricing inferred from overlapping cost/token rows
Anthropic partial: $13,121	Observed	Cost export (covers ~55% of dates)
Token volumes (both providers)	Observed	Usage/token exports
Spend tied to specific projects	Inferred	Workspace labels + git timeline + memory
Period spend by date range	Inferred	Mixed observed + reconstructed daily sums

The biggest data gap is on Anthropic: the cost export stops at February 24 and then jumps to a single day on March 11. The token logs cover the full period, and the inferred pricing matches observed rows closely, so I trust the reconstructed total within a few percent. But I'd feel better with the complete cost export.

Was it worth it?

Verdict: Partly.

Real work shipped. Real infrastructure was built. Multiple public-facing products improved materially. But the spend was badly uncontrolled, a significant fraction was thrash, and the fix — deleting the API keys and moving to subscriptions — was straightforward and should have happened weeks earlier.

Worth it compared to what?

"Partly worth it" needs a benchmark. The real comparison is not versus doing nothing. It is versus a cheaper, slower, more disciplined version of the same month. Here are the two most obvious counterfactuals:

Subscription plans from day one. Both OpenAI and Anthropic offer subscription tiers that cover a lot of high-volume interactive use. If I had started on those plans on February 11 instead of switching on March 13, the main interactive workloads would likely have been far cheaper than running them through direct API billing. The exact savings depend on plan tiers and caps, so I am not pretending this is a precise substitute.
Smaller contexts and cheaper models. The Anthropic data is suggestive: an estimated 97.8% of Anthropic spend was cache mechanics on 200k–1M context windows. Running shorter contexts or using lighter models for some of the rewrite work would have produced a dramatically smaller bill. Whether it would have produced the same quality of output is harder to say — I don't have controlled evidence for that.

The uncomfortable answer is that a meaningful chunk of the spend looks avoidable with decisions that were available at the time, not just in hindsight. That is what keeps the verdict at "partly" instead of "yes."

The case for yes

Durable systems shipped. OpenClawBrain went from a half-built prototype to a released package with proof artifacts, public documentation, and a real product site. The hardening PRs (replay watchdog, fast boot, stall recovery) directly addressed failures discovered during the spend window.
Public surfaces improved. Both jonathangu.com and openclawbrain.ai look and read materially better than they did on February 11. That matters for a product trying to earn credibility.
Learning and leverage. Running multiple AI agents across multiple projects at this scale taught me things about prompt caching economics, model-switching costs, and agent orchestration that I likely would not have learned as quickly at lower volume. The cost data itself is useful evidence.
Multiple workstreams were active. OpenClawBrain, jonathangu.com, and at least some Bountiful and Pelican activity were moving through the same agent infrastructure. That matters, even though the project-level attribution is loose.

The case for no

Massive waste in rewrite cycles. Rewriting the same site copy four times in one weekend is not productive iteration. It's expensive indecision running on a meter.
The Anthropic caching bill was likely avoidable. An estimated ~$24k, most of it on cache writes for 200k–1M contexts, suggests a strategy mistake. Smaller contexts, smarter cache TTL choices, or shorter system prompts would likely have cut this dramatically — though the impact on agent effectiveness is uncertain.
The fix was available, not just obvious in hindsight. Subscription plans from both providers appear to cover the main interactive workloads at far lower direct cost. Every day I ran those lanes on raw API billing after those plans existed was money left on the table.
Attribution is weak. I can't cleanly tie most of the ~$34k to specific shipped artifacts. The workspace labels give me project-level buckets, but "guclaw workspace spent $X" doesn't tell me what that workspace actually accomplished per dollar.

The honest bottom line

I'd make the same bet, meaning run hard on AI-assisted development across multiple projects, but I'd set the cost controls first. The work was real. The scale was real. The waste was also real, and it looks more like a failure of cost hygiene than a failure of the underlying approach.

About $34k bought a combined footprint of roughly 21.0B comparable tokens if you use Anthropic served-input-plus-output, or 22.3B if you use Anthropic raw cache-inclusive totals. Some of that built things that will last. Some of it burned on repeated passes over the same material. The ratio between those two is the thing I'd fix.

If you're running AI agents at scale against direct API billing: check your caching costs, check your context window sizes, and check whether a subscription plan covers your workload. Do it before the month ends, not after.

Appendix: Model concentration

OpenAI — top models by spend share Reconstructed

Model	Spend share	Request share	Input token share
GPT-5.4	80.3%	46.3%	70.5%
GPT-5.2	17.9%	15.5%	24.8%
GPT-5 mini	1.7%	36.6%	4.1%
All others	0.1%	1.5%	0.2%

Anthropic — top models by spend share Reconstructed

Model	Spend share	Token share
Claude Opus 4.6	94.7%	91.1%
Claude Sonnet 4.5	4.9%	8.5%
All others	0.4%	0.4%

Anthropic model-level spend shares are reconstructed from the partial cost export (covering ~15 of 31 dates) plus token logs. Token shares use the full token export.

← Back to jonathangu.com

We Spent About $34k on AI in 31 Days: observed OpenAI + reconstructed Anthropic

The number

Where the money went

OpenAI: a late-stage ramp

Anthropic: the prompt-caching bill

Workspaces: multiple lanes, not clean attribution

What we were actually building

Shipped and durable Inferred

Thrash and inefficiency

Timeline: spend vs. shipping

What changed after we killed the API keys

How solid is this data?

Was it worth it?

Verdict: Partly.

Worth it compared to what?

The case for yes

The case for no

The honest bottom line

Appendix: Model concentration

OpenAI — top models by spend share Reconstructed

Anthropic — top models by spend share Reconstructed