AI benchmarks should test models with safety filters on, not off

Most AI performance tests measure models without their safety filters active — but real services like Claude always have those filters on. A Reddit thread argues this gap makes benchmark scores misleading for everyday users.

When researchers rank AI models, they typically test raw capability with few restrictions. But the Claude or ChatGPT you actually use has safeguards running at all times: it refuses certain requests, avoids sensitive topics, and adds caveats. Those safeguards can slow the model down or make it decline tasks it could otherwise handle, so a 'raw' benchmark score may look better than what you experience day-to-day.

The post argues that publishing two sets of scores — one with safeguards on, one off — would give users a much clearer picture of which model actually performs best in real-world use. Without that, top benchmark rankings can be misleading, because the 'winner' might drop significantly once its safety filters are active.

Key points

  • Most AI benchmarks test models without safety filters, inflating scores vs. real use
  • Services like Claude always run with safeguards active, which can limit or slow responses
  • The gap between benchmark scores and real-world performance can be significant
  • Publishing 'safeguards on' scores alongside raw scores would make comparisons fairer
  • This matters most when choosing an AI tool for everyday tasks, not lab experiments

Quick term guide

safety filters
Built-in rules that stop an AI from producing harmful or dangerous outputs.
safety filter
An automatic rule inside an AI model that stops it from producing content it judges as harmful or risky.
benchmark
A test used to compare speed, quality, or cost.
AI models
The core brain or underlying program that powers an artificial intelligence tool.
AI model
A program that can understand prompts and produce text, code, or answers.
safeguards
Safety controls that block or redirect risky AI responses.
benchmarks
Benchmarks are standard tests used to compare performance.
responses
An OpenAI API feature for creating and handling model answers.
Read original