Agent verification can improve safety but lower task completion
A tool-using can finish a task while still breaking a safety rule or policy. That makes task alone a weak way to judge whether the agent worked well.
The research separates results into safe success, unsafe success, and failure. It tests this idea with tool-use scenarios and proposes a two-step setup: simple policy and tool checks first, then an for cases that need more context.
can reduce unsafe success, but it can also lower task when the task has more steps. The tradeoff between safer behavior and lower is called the verifier tax.
Key points
- Task alone can hide unsafe agent behavior.
- The research separates outcomes into safe success, unsafe success, and failure.
- The proposed setup uses simple checks first and an only for more contextual cases.
- can cut unsafe success but may reduce on longer tasks.
- Using rule checks before an can help control tokens and cost.