A Reddit benchmark compares token costs across 8 code LLM providers

A Reddit user said they compared 8 LLM providers for a code generation pipeline that uses about 50 million tokens per month. They used the same 200 coding tasks and measured success rate, cost per task, and latency. The post claims DeepSeek V3 had low cost with an 83% pass rate, while a secondary market endpoint matched OpenAI and Anthropic model quality at about 10% of the direct cost.

Key points

  • The providers listed were OpenAI, Anthropic, Groq, Together, Fireworks, OpenRouter, DeepSeek API, and a secondary market endpoint.
  • The test used 200 coding tasks, including writing functions, refactoring, adding tests, and debugging.
  • The measured items were pass@1, total cost per task, and P95 latency.
  • The post says DeepSeek V3 cost $0.42 per 1 million completion tokens with an 83% pass rate.
  • The post claims the secondary market endpoint had the same quality at lower cost, but its operational risk needs separate checking.

Quick term guide

code generation
When an AI tool writes computer code from a user's instruction.
debugging
The process of finding and fixing the cause of errors or unexpected behavior in code.
benchmark
A test used to compare speed, quality, or cost.
production
The live version of a service that real users use.
reliability
How consistently a tool works without failing or behaving unexpectedly.
liability
Legal responsibility for causing an accident or damage.
OpenRouter
A service that gives access to many AI models through a single API, making it easy to switch between them
refactoring
The process of reorganizing and cleaning up the internal code of a program without changing what it actually does on the outside.
Read original