llmtrim cuts AI agent token costs with a local compression proxy
llmtrim is a local proxy that reduces the size of requests sent to and the answers returned from them. It trims repeated or less useful material from system instructions, tool descriptions, chat history, tool output, code, and long context before the request reaches the model provider. In 112 live A/B tests, input tokens fell by 31%, output tokens fell by 74%, total tokens fell by 43%, and round-trip cost dropped from $0.0365 to $0.0126, a 66% reduction.
Answer scores were 78.9% for the original requests and 82.2% for the compressed requests, but the difference sits inside the stated error range, so the safer reading is no clear quality loss rather than better answers. On live Claude Code traffic, it claims to cut 68% of compressible input while leaving the cached prefix alone, so existing prompt cache discounts can still apply. If a change does not save tokens, llmtrim rolls it back; if the provider rejects a compressed request, it retries the original request.
It can be installed as a and works with many providers, including OpenAI, Anthropic, Google, and others, as long as the tool respects HTTPS proxy settings. Important limits remain: the default mode is quality-gated rather than fully lossless, and token counts for Anthropic and Gemini are approximate because exact public tokenizers are not available.
Key points
- A local proxy compresses requests and responses to reduce token use.
- Live tests reported 31% fewer input tokens, 74% fewer output tokens, and 66% lower round-trip cost.
- Claude Code traffic reportedly kept prompt cache savings while cutting 68% of compressible input.
- llmtrim rolls back changes that do not reduce tokens and retries the original request if compression fails.
- The default mode is not fully lossless, and Anthropic and Gemini token counts are approximate.