What makes AI ‘safe’? A conversation on flawed benchmarks and public trust
Outdated tests are still being used to judge today’s AI systems—creating the illusion of safety while leaving real risks unmeasured.
Would you buy a car that only passed safety tests from 20 years ago, or trust a bridge inspected with 1950s standards? Probably not. Yet with AI, we may already be doing something just as risky—trusting systems that pass outdated tests while calling them 'safe'.
AI benchmarks are used to measure how well AI systems perform specific tasks. Think of them as report cards for AI models that tell us whether they are ready for real-world use. But what happens when those report cards are outdated or misleading?
I caught up with Rokas Gipiskis and Ayrton San Joaquin, AI researchers at AI Standards Lab, to discuss this critical problem. Their recent paper, presented at the ICML Technical AI Governance Workshop in Vancouver, examines the consequences for AI safety when benchmarks stop being useful and proposes a new framework for retiring outdated benchmarks.
Why benchmarks matter
You say benchmarks are "fundamental to evaluating model outputs" and "critical indicators for when frontier models exhibit dangerous capabilities." Why has this become so urgent?
Aryton: There is a push to develop and deploy models faster and benchmarks can’t keep up with these developments. We are lacking benchmarks in certain areas, especially for safety because of how expensive and capital-intensive they are to construct. Even if they are being developed, as evidenced by hundreds of benchmarks being introduced every year in conferences, there’s a lag in adoption as certain benchmarks are already entrenched because they are already widely used.
But ensuring that our tests for AI products are actually working is also basic quality assurance as much as any other product we buy from the grocery store. It is high-time for AI to tend towards being a mature technology given its widespread adoption and expected upheaval of our way of life, from job security to education and healthcare.
The problem of "safety-washing"
What is "safety-washing"—and why should people outside the AI world care?
Rokas: Safety-washing refers to exaggerated or misleading claims about the safety of AI models or systems. This can occur when companies point to outdated or flawed benchmarks without actually putting in the rigorous work to ensure real-world safety. This matters to everyone because AI systems are increasingly making decisions that affect our daily lives, from hiring processes to medical diagnoses to financial approvals – so deceptive safety claims can put real people at real risk.
Think of a car company claiming their vehicle is crash-proof because it passed a 20-year-old crash test that only examined frontal impacts at low speeds. If they market it as "safe" based on that alone, while ignoring side collisions and other key safety components, they're safety-washing.
Or recall when Volkswagen marketed its diesel cars as eco-friendly while secretly using software to cheat emissions tests, a clear case of greenwashing. Similarly, an AI model might be labeled "safe" simply because it passed benchmarks that don't reflect certain key risks.
Real-world risks
What are the direct risks when AI systems are evaluated using outdated benchmarks?
Aryton: The fundamental goal of benchmarks, like any other type of evaluation, is to check how a given product will operate in the real world. This means that a benchmark should be close to real-world conditions as much as possible. If it is not, then nobody knows what a product is - or not - fully capable of. That lack of knowledge is where different sorts of accidents can arise.
For example, you would doubt buying a self-driving car if a car was only benchmarked / tested against understanding Chinese road signage but you live in the U.S.. Because then, you would not know how well the car would be able to follow the American rules of the road, no matter how skilled the car is in other areas like driving under different weathers and terrain.
What you can do
What should we, as users or the public, look out for when it comes to AI claims? Any red flags or smart questions to ask?
Rokas: When evaluating AI safety claims, watch for vague safety assertions without mention of specific tests or benchmarks, or references to performance on "leaderboards" without proper context or limitations.
It’s also important to think about how you will use the AI product in your context based on what’s currently publicly available about it. Is the product known to work well in your language, especially if you're not in an English-speaking country? Can it reason about the domain you’re interested in?
To cut through potential safety-washing, ask questions like:
What specific benchmarks were used to test this model, and are they still relevant to current risks?
Were outside experts or independent groups involved in evaluating the model's safety?
Do the safety tests reflect real problems people might face?
Are the limitations of these benchmarks disclosed transparently?
Does the company openly admit what their evaluations can't measure or what the model might get wrong?
The challenge of change
One challenge with updating benchmarks is getting people to actually adopt new frameworks. What's the biggest hurdle to getting your proposed deprecation framework widely used?
Aryton: I think the biggest challenge is that currently there is no motivation to deprecate benchmarks. On one hand, the publishing process in AI currently encourages people to develop benchmarks, but there is no reward when one spots a flaw in their own benchmark and tries to correct it. Once you publish a benchmark through a peer-reviewed paper, you move on to the next project.
On the other hand, in industry, there is a pressure to make your product look great to stand out from the crowd of different AI products. So that may involve cherry-picking certain benchmarks to use. Even with good intentions, a company may use flawed benchmarks because every other competing product is using them. Explaining why that benchmark is not used by one specific company may be seen as just trying to hide poor performance of their product on that benchmark.
Rebuilding trust
How can your framework help rebuild public trust in AI safety evaluations? What's at stake if outdated benchmarks stick around?
Rokas: Our framework starts with the idea that AI safety reporting should be transparent and trustworthy. We encourage clear reporting of how and when benchmarks are updated, and why some may no longer be useful. This openness is key to building public trust, not just among experts, but for everyone affected by AI.
If outdated or flawed benchmarks are allowed to stay in use, the public could be misled into thinking that an AI system is safer than it really is. So this isn’t just a technical issue, but a trust issue as well.
For non-experts, the risk is that without these changes, people might put their trust in AI systems that haven't truly earned it. This isn’t just about improving product quality, but about helping everyone have a clearer view of the technology we’ll be exposed to every day.
The framework safeguards public understanding by ensuring that tools like the AI Safety Index rely on meaningful, updated benchmarks, not ones selected to make model providers look good. By requiring transparent reporting, version control, and allowing independent governance actors to intervene when needed, it helps ensure that evaluations better reflect model capabilities.
Check out their paper on deprecating benchmarks which provides detailed technical recommendations for the AI safety community.