AI benchmarks should test models with safety filters on, not off
Most AI performance tests measure models without their safety filters active — but real services like Claude always have those filters on. A Reddit thread argues this gap makes benchmark scores misleading for everyday users.
When researchers rank AI models, they typically test raw capability with few restrictions. But the Claude or ChatGPT you actually use has safeguards running at all times: it refuses certain requests, avoids sensitive topics, and adds caveats. Those safeguards can slow the model down or make it decline tasks it could otherwise handle, so a 'raw' benchmark score may look better than what you experience day-to-day.
The post argues that publishing two sets of scores — one with safeguards on, one off — would give users a much clearer picture of which model actually performs best in real-world use. Without that, top benchmark rankings can be misleading, because the 'winner' might drop significantly once its safety filters are active.
Key points
- Most AI benchmarks test models without safety filters, inflating scores vs. real use
- Services like Claude always run with safeguards active, which can limit or slow responses
- The gap between benchmark scores and real-world performance can be significant
- Publishing 'safeguards on' scores alongside raw scores would make comparisons fairer
- This matters most when choosing an AI tool for everyday tasks, not lab experiments
Quick term guide
- safety filters
- Built-in rules that stop an AI from producing harmful or dangerous outputs.
- safety filter
- An automatic rule inside an AI model that stops it from producing content it judges as harmful or risky.
- benchmark
- A test used to compare speed, quality, or cost.
- AI models
- The core brain or underlying program that powers an artificial intelligence tool.
- AI model
- A program that can understand prompts and produce text, code, or answers.
- safeguards
- Safety controls that block or redirect risky AI responses.
- benchmarks
- Benchmarks are standard tests used to compare performance.
- responses
- An OpenAI API feature for creating and handling model answers.