Agent results may depend as much on the harness as on the model
This Reddit post says AI agents can perform very differently even when they use the same model. The writer argues that the harness, tool setup, work loop, permissions, tests, and traces affect results as much as the model itself. The post cites a LangChain example where the same model rose from 52.8% to 66.5% on Terminal-Bench 2.0 after a harness change. It also says Vercel removed about 80% of an agent's tools and got better results.
Key points
- The writer compares the model to a CPU, not a whole computer.
- The harness includes tools, permissions, the work loop, tests, and traces.
- The post says the same model can score much higher when the harness changes.
- More tools are not always better; the post says Vercel improved results after removing many tools.
- The post argues smaller or open-weight models may need more harness work to compete.
Quick term guide
- AI agents
- AI agents are AI tools that can carry out steps toward a goal, not just answer once.
- permissions
- Settings that define what files or actions a system or user is allowed to access.
- LangChain
- A popular open-source framework for building AI agents and applications that chain together language model calls.
- Terminal-Bench 2.0
- A test that measures how well an AI system handles tasks in a computer terminal.
- tool calls
- Times when an AI system uses another function, such as search or file access.
- open-weight models
- AI models whose internal weights are available for people to run themselves.
- open-weight
- The model's internal numbers are publicly released, so anyone can download and run or modify it freely.
- reasoning
- The ability of the AI to think through complex steps to find a solution.