A Claude Code workflow claims a 23.7% agent benchmark jump in one day

The Reddit poster says they have been using Claude Code to improve agents through repeated testing and fixes. They claim that using Fable in the same workflow raised their hardest internal agent benchmark by 23.7% in one day. They also say they plan to share the structure they built so others can try the workflow.

Key points

  • The post describes a Claude Code workflow for improving agents.
  • The workflow collects traces, analyzes errors, patches the agent, makes evals, and repeats.
  • The poster says Fable found root causes better than Opus in this setup.
  • They claim a 23.7% gain on their hardest internal agent benchmark in one day.
  • They say they will share the scaffolding used to run the workflow.

Quick term guide

testing
The process of checking that software does what it's supposed to do, usually by running it and looking for errors.
workflow
A repeatable set of steps for getting a task done.
benchmark
A test used to compare speed, quality, or cost.
Solo makers
People who build and launch their own products or services entirely on their own.
Pattern
A group of related tickets that point to the same repeated problem.
AI tool
Software that uses artificial intelligence to help with tasks like writing, coding, or research.
import
To bring a file or folder into a tool so it can use it.
patches
Small code updates that fix bugs or security problems.
Read original