Open SourceImportance: High

On-policy distillation targets agent mistakes more directly

r/MachineLearningJun 4, 2026 · 9d ago

is a method that helps an AI model correct specific mistakes made during a task. If an calls a tool that does not exist, the method does not rely only on the final success or failure score, because that signal is spread across the whole run. Another model reads the run and marks the area just before the mistake by adding .

The original model is then run forward with those hints, without generating a fresh run from scratch. The hints make the model give lower probability to the bad choice, and the original model is trained to copy that improved judgment. The method is described as important work behind models such as Qwen 3.6 and 3.7, GLM-5.1, and .

PapersWithCode now groups the first paper, method notes, and later papers that cite or mention it.

Key points

trains on the model’s own task runs and focuses on the specific mistake inside the run.
A helper model adds near the error so the original model can learn a sharper correction.
The method can avoid generating a new run from scratch during this correction step.
It is especially relevant to that call tools and may choose the wrong action.
It is linked with for models including Qwen, GLM, and DeepSeek.

Read original ↗