Agent results may depend as much on the harness as on the model

This Reddit post says AI agents can perform very differently even when they use the same model. The writer argues that the harness, tool setup, work loop, permissions, tests, and traces affect results as much as the model itself. The post cites a LangChain example where the same model rose from 52.8% to 66.5% on Terminal-Bench 2.0 after a harness change. It also says Vercel removed about 80% of an agent's tools and got better results.

Key points

  • The writer compares the model to a CPU, not a whole computer.
  • The harness includes tools, permissions, the work loop, tests, and traces.
  • The post says the same model can score much higher when the harness changes.
  • More tools are not always better; the post says Vercel improved results after removing many tools.
  • The post argues smaller or open-weight models may need more harness work to compete.

Quick term guide

AI agents
AI agents are AI tools that can carry out steps toward a goal, not just answer once.
permissions
Settings that define what files or actions a system or user is allowed to access.
LangChain
A popular open-source framework for building AI agents and applications that chain together language model calls.
Terminal-Bench 2.0
A test that measures how well an AI system handles tasks in a computer terminal.
tool calls
Times when an AI system uses another function, such as search or file access.
open-weight models
AI models whose internal weights are available for people to run themselves.
open-weight
The model's internal numbers are publicly released, so anyone can download and run or modify it freely.
reasoning
The ability of the AI to think through complex steps to find a solution.
Read original