This Free Tool Cuts Your Claude Token Costs by 95% (Setup in 10 Seconds)

A Netflix engineer got tired of watching his AI coding agent burn through tokens reading log files, so he built a free tool that sits between you and Claude and shrinks everything before it arrives. One real test went from 65,694 tokens to 5,118. Same answer, 92% less spend.

Here’s the thing. If you’ve used Claude Code or any coding agent for more than a week, you already know the pattern. You ask it to look at your project, it reads every file it can find, and your token usage jumps off a cliff before it’s written a single line back to you. Most of what you’re paying for isn’t thinking. It’s noise: build logs, repeated JSON, file contents the model only needed to skim.

That’s the exact problem Tejas Chopra, a senior engineer at Netflix, ran into while building an AI agent to handle SRE tasks like fetching logs and searching code. He couldn’t find a tool that fixed it without breaking the agent’s ability to actually do its job, so he built one. It’s called Headroom, it’s free, it’s open source, and it’s now sitting at over 37,000 stars on GitHub.

Let’s break down what it does, look at the actual numbers, and walk through getting it running on your machine.


What Headroom actually does

Headroom works as a middleman. It sits between your coding agent (Claude Code, Cursor, Codex, Aider, and a few others) and the model itself. Every tool output, file read, log dump, and search result passes through Headroom first. It detects what kind of content it’s looking at, compresses it using the right method for that content type, and only forwards what’s actually useful to the model.

It runs locally on your laptop, so your code and data never leave your machine to get compressed. It’s completely free under an Apache 2.0 license. And it doesn’t touch your system Python or your shell setup. Uninstall it and everything goes back to exactly how it was.

The part that matters most for daily use: it’s reversible. Headroom doesn’t throw information away. It compresses it and keeps the original cached locally, so if the model genuinely needs the full detail later, it can retrieve it. You get the savings without the model flying blind.


The numbers, verified

The claim that started this conversation is a 95% token cut, and the source video described a real test: reading every file in a project for a deep overview burned 65,000 tokens without Headroom, and the exact same task used 5,000 tokens with it running.

That lines up closely with the benchmark Headroom publishes on its own GitHub page. For an SRE incident debugging workload, the real-world numbers are 65,694 tokens before compression and 5,118 tokens after, a 92% reduction. Other workloads in the same benchmark table show similar territory: code search across 100 results dropped from 17,765 tokens to 1,408 (92%), and GitHub issue triage went from 54,174 to 14,761 (73%).

The number that should actually convince you isn’t the percentage. It’s the accuracy column sitting right next to it. On standard benchmarks, Headroom’s published results show no meaningful quality loss: GSM8K math accuracy stayed exactly the same before and after compression, and TruthfulQA factual accuracy actually went up slightly. Compression that costs you correctness isn’t a deal. Compression that doesn’t is.


Why this is happening at all

Here’s the part most people miss. When you talk to Claude in a chat window, you control what goes into the context. When an agent is working autonomously, it doesn’t. It reads files, runs commands, calls tools, and every single one of those outputs gets dumped into the conversation in full, whether the model needs all of it or not.

A directory listing doesn’t need to repeat the same boilerplate for every file. A build log doesn’t need every line when only the error matters. A JSON array of 200 search results doesn’t need every field for every result. None of that is the model’s fault and none of it is your fault. It’s just how raw tool output works by default, and it adds up fast across a long session.

Headroom’s approach is to put a content router in front of all of it. JSON gets compressed one way, code gets compressed in a way that’s aware of its actual syntax tree, and plain text gets compressed using a dedicated model trained specifically on this kind of agentic output. Each content type gets handled differently because each one wastes tokens differently.


Step-by-step: setting up Headroom

This takes about ten seconds of actual typing, plus however long the install takes to finish. Here’s the full walkthrough.

Step 1: Install Headroom

Open your terminal and run one of these, depending on your setup:

pip install "headroom-ai[all]"          # Python
npm install headroom-ai                 # Node / TypeScript

If you’re on a corporate machine with SSL inspection and the install fails with a certificate error, that’s a known issue tied to how the build pulls in Rust. The fix and full workaround are in the Headroom GitHub repo under the install section.

Step 2: Wrap your coding agent

If you’re using Claude Code, this is the one command that does the heavy lifting:

headroom wrap claude

This launches Claude Code through Headroom automatically. Every prompt, tool output, and file read now passes through the compression pipeline before it reaches the model. You don’t change how you use Claude Code at all. You just start it through this command instead of starting it directly.

Headroom also supports wrapping Codex, Cursor, Aider, and Copilot CLI the same way, so if you bounce between tools, the same setup covers all of them.

Step 3: Confirm it’s working

Run a normal task, the kind that usually burns through a chunk of your daily limit, like asking the agent to read through your whole project. Once it finishes, check the savings:

headroom perf

This shows you the before and after token counts for your actual session, not a generic benchmark. That’s the number that matters, because it’s your code and your workflow.

Step 4 (optional): Use it as a drop-in proxy instead

If you’re not using one of the supported coding agents directly and just want compression on any OpenAI-compatible client, run it as a standalone proxy with zero code changes on your end:

headroom proxy --port 8787

Point your existing setup at this local port instead of the provider’s API directly, and Headroom handles the compression transparently in between.


Where this helps, and where it doesn’t

Being honest about limits matters more than the headline number.

It’s a clear win if: you run Claude Code, Cursor, or a similar agent daily and regularly hit your usage limit before the day is done. It’s also a strong pick if you bounce between multiple agents, since Headroom shares memory and dedupes context across them.

It won’t do much for you if: you mostly send short, plain-text prompts in a chat window with no file reads or tool calls. Headroom skips anything under roughly 200 tokens or short arrays, because compressing something tiny costs more in overhead than it saves. The savings show up specifically on the noisy, repetitive stuff: logs, JSON, large file reads, and long agent sessions.

There’s also a real text-compression mode that adds a small delay and needs a one-time model download, so it’s built for cutting cost, not for shaving milliseconds off response time.


The bottom line

You don’t fix a token problem by writing shorter prompts. You fix it by stopping the waste that happens before your prompt even gets read, the logs, the JSON, the repeated file content nobody asked the model to memorize in full. Headroom does exactly that, runs locally, costs nothing, and the published numbers back up the claim instead of just repeating it.

If you’ve been rationing prompts toward the end of the week or paying overage charges on a coding subscription, this is worth the ten seconds it takes to install.

If this saved you from hitting your Claude limit before Friday, your developer friends are about to thank you. Share this with the one person on your team who always runs out of tokens first.


Sources

  1. Headroom, official GitHub repository and benchmark data
  2. Tejas Chopra, DEV Community, on why he built Headroom
  3. Tejas Chopra, LinkedIn profile
  4. Headroom official documentation

Leave a comment

Website Built by WordPress.com.

Up ↑