AIImportance: Medium

A Claude Code workflow claims a 23.7% agent benchmark jump in one day

r/ClaudeAIJun 13, 2026 · 4h ago

The Reddit poster says they have been using Claude Code to improve agents through repeated testing and fixes. They claim that using Fable in the same workflow raised their hardest internal agent benchmark by 23.7% in one day. They also say they plan to share the structure they built so others can try the workflow.

Key points

The post describes a Claude Code workflow for improving agents.
The workflow collects traces, analyzes errors, patches the agent, makes evals, and repeats.
The poster says Fable found root causes better than Opus in this setup.
They claim a 23.7% gain on their hardest internal agent benchmark in one day.
They say they will share the scaffolding used to run the workflow.

Quick term guide

testing: The process of checking that software does what it's supposed to do, usually by running it and looking for errors.
workflow: A repeatable set of steps for getting a task done.
benchmark: A test used to compare speed, quality, or cost.
Solo makers: People who build and launch their own products or services entirely on their own.
Pattern: A group of related tickets that point to the same repeated problem.
AI tool: Software that uses artificial intelligence to help with tasks like writing, coding, or research.
import: To bring a file or folder into a tool so it can use it.
patches: Small code updates that fix bugs or security problems.

Read original ↗