How I Cut Claude Code Token Usage by 80% — Without Losing Quality

I was spending over $300 a day on tokens. That’s not an exaggeration — it’s a real number from my dashboard for March 4: $353.53 in a single day. Opus alone took $298 of that $353. And the worst part: a significant chunk of those tokens generated no value. They were noise.

I’ve written about using specifications to guide AI agents and about building coordinated agent teams. But I hadn’t yet covered something everyone using Claude Code daily eventually discovers: tokens are a finite resource, and optimizing how you use them is just as important as optimizing the code you produce.

This article is the result of weeks of testing, measuring, and tuning. These aren’t tricks — they are infrastructure changes that drastically reduced my token consumption without sacrificing output quality.

The problem: invisible tokens

Before optimizing, you need to understand where the tokens go. The answer will surprise you: most of them are not consumed by the code the agent generates. They’re consumed by everything else.

Every time an agent runs git status, the full output enters its context window. Every ls -la, every npm run build, every compilation error — all of it accumulates. In a typical development session, the agent runs between 20 and 50 commands. If each one returns between 500 and 5,000 tokens of output, you’re burning between 10,000 and 250,000 tokens just on terminal responses. Tokens the agent reads once and never needs again.

Then there’s the model. Opus is Anthropic’s most capable model. It’s also the most expensive. My March data confirms it: Opus consistently represented between 80% and 85% of my daily spend. Sonnet and Haiku split the rest. But not every task needs Opus. Many are solved just as well with Sonnet. And for some, Haiku is more than enough.

And then there are the MCPs. The Playwright MCP, for example, injects the full browser accessibility tree into every interaction. In 10 navigation steps, you accumulate over 10,000 tokens just in page states that aren’t even relevant anymore.

The good news: every one of these waste sources has a fix.

RTK: compress what the agent doesn’t need to read in full

RTK (Rust Token Killer) is a CLI proxy written in Rust that intercepts command output and compresses it before it reaches the LLM context. The agent doesn’t know it’s there — it just sees a clean response instead of the raw output.

How it works

RTK installs as a Claude Code PreToolUse hook. It intercepts each Bash command and rewrites it transparently:

git status  →  rtk git status
npm test    →  rtk npm test
ls -la      →  rtk ls -la

The agent asks to run git status. RTK intercepts, runs rtk git status, and returns a compressed version. The agent receives the information it needs without the noise it doesn’t.

RTK applies four compression strategies: smart filtering (removes headers, hints, and visual noise), grouping (groups similar items), truncation (cuts redundancy while keeping context), and deduplication (collapses repeated lines with counters).

My real numbers

These are the numbers from my installation after 136 intercepted commands:

Command	Without RTK	With RTK	Reduction
git status	805 bytes	268 bytes	67%
ls -la	~900 bytes	~400 bytes	71%
npm run build (Astro)	8,518 bytes	221 bytes	97%
Small files (cat)	3,489 bytes	3,489 bytes	0%

The pattern is clear: RTK saves the most on build logs, test output, and git status — exactly the most frequent commands in a development session. It does nothing for small file reads and greps with few results, which are already concise on their own.

My RTK dashboard shows 91.9% global savings across 136 commands. Out of 714,900 input tokens, only 58,200 reached the agent’s context. The other 656,700 were noise that RTK eliminated.

Installation

brew install rtk
rtk init -g

rtk init -g configures the Claude Code hook automatically in ~/.claude/settings.json. Restart Claude Code and you’re done — RTK starts intercepting from the next session.

When it matters most

If you use Agent Teams with a 9-agent pipeline, each one runs multiple commands (git status, npm test, build, etc). RTK compresses everything transparently. The savings multiply by the number of agents. In a full pipeline, we’re talking about hundreds of thousands of tokens saved on a single feature.

Playwright CLI: the MCP that was burning tokens without you knowing

This was the optimization that surprised me the most because the problem was so invisible.

The Playwright MCP is the standard way to give an AI agent browser access. It works. But it has a fundamental design problem: every interaction injects the full page accessibility tree into the agent’s context.

In a typical testing session:

	MCP	CLI
Tokens per session	~114,000	~27,000
Per interaction	~1,000+ (full accessibility tree)	~20 (shell command)
Across 7 steps	7,400+ accumulated tokens of page state	~150 tokens + 1 snapshot on disk
Savings	—	~75% less

The problem is cumulative. In 10 navigation steps, the agent has 10 versions of the accessibility tree stacked in its context. Old versions that are no longer relevant. And worse: the agent starts referencing elements from old versions because it has multiple page states mixed together.

The alternative: Playwright CLI

Playwright CLI is Microsoft’s answer to its own problem. Instead of injecting the browser state into the LLM context, it stores snapshots on disk as YAML. The agent runs shell commands to control the browser and reads snapshots only when it needs them.

playwright-cli open https://example.com
playwright-cli snapshot
playwright-cli fill e5 "user@example.com"
playwright-cli click e3
playwright-cli screenshot
playwright-cli close

The e5 and e3 are references to elements in the snapshot. The agent reads the snapshot YAML, identifies the reference, and acts. The snapshot doesn’t enter the LLM context — it’s on disk.

Installation and migration

npm install -g @playwright/cli@latest
playwright-cli install --skills

The second command installs the Claude Code skill that teaches the agent how to use the CLI instead of the MCP. After that, you disable the Playwright MCP from Claude Code (/mcp → select Playwright → Disable).

In my setup, I have the skill installed globally in ~/.claude/skills/playwright-cli/ so all my projects inherit it automatically.

When to keep using the MCP

The CLI is better in almost every scenario where the agent has filesystem access. But if you’re using an agent in a sandbox without disk access (like Claude Desktop), the MCP is still your only option. It’s also more convenient for very short sessions (fewer than 5 interactions) where context accumulation isn’t a problem.

Model selection: the most obvious optimization nobody does

My March cost data, broken down by model:

Day	Total	Opus	Sonnet	Haiku
Mar 2	$106.82	$85.66 (80%)	$13.59	$7.57
Mar 4	$353.53	$298.58 (84%)	$42.79	$12.17
Mar 7	$202.32	$172.72 (85%)	$25.66	$3.95

Opus is between 10x and 15x more expensive than Sonnet. And yet I was using it for everything. For implementing simple functions, generating migrations, writing tests that follow established patterns. Tasks where Sonnet produces exactly the same result.

When to use each model

The rule I follow now:

Opus: Architecture design, complex domain decisions, code review, debugging subtle problems. When I need judgment, not speed.

Sonnet: Implementation following specs, test generation, tasks with clear patterns. When project rules and the specification already define what to do, Sonnet executes just as well as Opus.

Haiku: Trivial tasks, formatting, boilerplate generation, lookup queries.

In my Agent Teams pipeline, the distribution is:

Model	Agents
Opus	Architect, Devil’s Advocate, Code Reviewer
Sonnet	Domain Dev, Infra Dev, Unit Tester, Endpoint Tester, Documenter

The agents that decide use Opus. The ones that execute use Sonnet. It’s the same logic as in a human team: the senior architect is paid more than the junior developer, and that’s fine — because their value lies in a different kind of work.

Context management: what the agent loads matters

Every Claude Code session starts by loading: the system prompt, tool definitions, the CLAUDE.md, the rules, the skills, and the custom agents. All of that consumes tokens before you even type a single line.

A quick look at my current context with /context:

System prompt: 3,600 tokens
System tools: 20,800 tokens
Memory files (CLAUDE.md): 1,500 tokens
Skills: 2,000 tokens
Custom agents: 740 tokens
Total before starting: ~28,600 tokens

That’s 14% of the 200k window consumed before writing anything. Can you optimize? Yes, but carefully.

.claudeignore

If you have a .gitignore, you need a .claudeignore. This file tells Claude Code which directories and files to ignore when indexing the project. Without it, the agent can end up reading node_modules/, dist/, logs, or build artifacts that don’t add anything.

node_modules/
dist/
.next/
coverage/
*.log
*.lock

It’s not direct token savings — it’s context savings when the agent searches files or explores the project.

/compact and /clear

/compact compresses the current conversation while keeping the essential context. Useful when you’ve had a long session and you notice the agent starting to repeat itself or losing the thread. You don’t lose what’s important, but you reclaim window space.

/clear wipes everything and starts fresh. More aggressive, but sometimes that’s what you need. If the previous task is done and you’re starting something new, /clear keeps the agent from dragging irrelevant context.

My rule: when /context shows more than 50% usage, I evaluate whether to compact. If I’m above 70%, I compact. If I’m switching tasks, I /clear.

Plan Mode

Shift+Tab activates Plan Mode. In this mode, the agent plans before executing. It looks like a small detail, but it has direct implications for token consumption.

Without Plan Mode, the agent can start implementing, realize the approach was wrong, backtrack, and re-implement. Every failed attempt consumes tokens. With Plan Mode, the agent proposes a plan, you validate it, and then it executes with confidence. Less back and forth, fewer wasted tokens.

For complex tasks, I always start in Plan Mode. For simple tasks, I go straight in.

Monitoring: what you don’t measure doesn’t improve

There’s no point optimizing if you can’t verify the optimizations work. These are the tools I use to measure.

ccusage

ccusage reads Claude Code’s local JSONL logs and generates cost reports. It needs no API or configuration — it works directly with npx.

npx ccusage daily --breakdown --compact --since 20260301

This gives me a daily breakdown by model: how much I spent on Opus, Sonnet, Haiku. It’s what allowed me to discover that Opus was eating 85% of the budget.

npx ccusage session is even more revealing — it shows the cost of each individual session. That’s how you identify which tasks consume the most and where there’s room to optimize.

rtk gain

rtk gain

RTK’s built-in dashboard shows total tokens, saved tokens, percentage reduction by command, and the ranking of which commands save the most. It’s the fastest way to validate that RTK is doing its job.

rtk gain --history gives the daily breakdown. rtk discover shows savings opportunities that RTK detected but didn’t apply (commands that could benefit from a rule that doesn’t exist yet).

/cost and /context

The native Claude Code commands. /cost shows tokens and cost for the current session. /context shows how the context window is distributed across system prompt, tools, memory, skills, and messages.

I use them as quick checks during sessions. If /cost shows an unexpected spike, I investigate which command caused it. If /context shows that messages occupy 70%, it’s time to compact.

The combined impact

None of these techniques is revolutionary on its own. The real impact comes from combining them.

RTK strips out terminal noise — 60-90% savings on command output.

Playwright CLI replaces an MCP that was accumulating unnecessary context — 75% fewer tokens in browser sessions.

Model selection matches cost to task type — switching from Opus to Sonnet on implementation reduces cost 10-15x for those tasks.

Context management keeps the window clean and the agent focused — fewer repetitions, less back and forth, fewer wasted tokens.

Monitoring closes the loop — you measure, you detect, you adjust.

My daily spend went from $200-350 down to $60-100 for the same volume of work. Not because I do less — because every token generates more value.

How much you save depends on how you work

If you use Claude Code for one-off tasks — a bug here, a function there — optimization has a moderate impact. The base consumption is already low and sessions are short.

But if you work like I do, with Agent Teams running 9-agent pipelines on top of formal specifications, every optimization multiplies. RTK saves tokens in every agent. Model selection cuts the cost of 5 of the 9 agents. Playwright CLI removes thousands of tokens from each e2e testing session.

The rule is simple: the more tokens you consume, the more impact optimizations have. And if you use agents, you consume a lot of tokens.

Conclusion

Tokens are the most expensive and most invisible resource in AI-assisted development. Most developers don’t know how much they spend, on what, or why. They install Claude Code, get to work, and at the end of the month they get a bill they don’t understand.

Optimization is not about restricting — it’s about being smart with a finite resource. RTK compresses what the agent doesn’t need to read in full. Playwright CLI removes an MCP that was accumulating unnecessary state. Model selection picks the right model for each task. And monitoring gives you the data to keep adjusting.

The first step is to measure. Run npx ccusage daily --breakdown and look where your budget is going. I guarantee you’ll find surprises.

P.S.: If you’ve found other ways to optimize token consumption, you’ll find me on Twitter as @lmmartinb. And if you don’t yet have a specification system for your agents, start with my article on SDD with OpenSpec.