Out of 1,000 real-world claims put in front of the five best AI systems available right now, exactly 328 got unanimous agreement.
For just over 67% of the sample — 672 claims — at least one model handed back a different verdict than the others. Not a different explanation. A different call. True. Mostly True. Misleading. False. Same claim, different answer.
And for 34% of the total, the gap between the most-disagreeing pair was two or more labels wide. One model calling something "True" while another calls it "Misleading" or worse. That's not a difference in shade. That's a different verdict on whether the thing is real.
What They Did
Lenz Research ran what might be the most quietly important AI study this month. They took 1,000 recent claims that real users submitted to a fact-checking platform — none older than February 2026, none from synthetic benchmarks the models might have seen in training. Real things real people wanted verified.
Then they fed each claim to five frontier configurations: GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro from Perplexity — four distinct models, with Gemini 3 Pro run both with and without search. Each had to pick exactly one label from four options: True, Mostly True, Misleading, or False. No hedging. No qualifiers. Just the verdict.
The design matters. Earlier fact-check evaluations used public corpora like PolitiFact or AVeriTeC — datasets that have been online for years and almost certainly appear in model training data. Measuring disagreement there partly measures which model memorized the "official" labels. Lenz used claims submitted after the models' training cutoffs, so the models actually have to reason. Decrypt covered the finding. The HN thread has 300+ comments running the numbers from different angles.
Unanimous agreement: 328 out of 1,000. The rest is disagreement.
Why It's Not Just Nuance
The natural defense is: fact-checking is genuinely hard. These claims are subtle. Humans disagree on this stuff too.
Sure. But 34% is the hard number to wave away. Those aren't cases where models are splitting "True" versus "Mostly True" — two reasonable positions on a nuanced question. These are two-label gaps. One model is telling you the claim is basically accurate. Another is telling you it's significantly deceptive or wrong. Those aren't two honest readings of the same ambiguous evidence. At least one of them is substantially off.
Think of it like five reference books, all described as authoritative, all shelved next to each other. You look up a claim in the first one: True. You look in the third: Misleading. You look in the fifth: False. You don't go "oh, nuance." You go "something is wrong with at least two of these books." The question is which ones — and you can't know without a sixth source that isn't one of the five.
Most people only consult one AI. So they never see the disagreement.
I've Been Calling It Verification
I've been building AI applications for a while now. I built an interview agent as part of a work project. I've integrated AI into workflows where it has to make judgments about information — summarizing, classifying, validating.
And if I'm being honest: there were moments where I leaned on the model to settle something and called it checked. It felt like verification. What I was actually doing was outsourcing the uncertainty to a single model and then moving on.
When I run a claim past an AI, I'm not consulting a neutral oracle. I'm consulting GPT-5.4's inference about the world, shaped by its training data and its fine-tuning and all the choices that went into how it was built. A different model has a different worldview. On 67% of real claims in this study, those worldviews produce different verdicts.
I'm not saying I've been building recklessly. I think I'm careful. But I did assume — somewhere in the background — that the model and I were basically checking a thing together, rather than that the model was one of five that might disagree with each other on most of what I asked.
Maybe the disagreements cluster in genuinely ambiguous territory and the clear cases are fine. The research isn't done — there's a companion study mapping which claims produce systematic disagreement and why. So I'm not ready to say "AI fact-checking is broken." But I'm ready to say "AI fact-checking is not neutral," and I don't think I was treating it that way.
The Quiet Part
There's an ecosystem of products that present one model's verdict as the answer. Not as this-model's-best-inference, but as a resolved truth. Content moderation systems. Medical information assistants. Platform features people don't interrogate. They all resolved, at some point, to running it past one model's call.
The models are probably each doing their best. Truth is genuinely complicated — "Mostly True" is a real category because reality is messy. I'm not surprised that models disagree; I'm surprised by how much, and on how many ordinary claims.
We've been reaching for AI as if it were a calculator. Same inputs, interchangeable output, trustworthy anywhere you find the brand. Calculators don't disagree with each other on 67% of inputs. The models do.
The question isn't just which AI is right. It's whether you know you're picking one when you reach for any of them.
I'm still going to use AI to think through things. I just don't think I can call it verification anymore.
That's a small thing. It probably changes how I build.