Can You Trust What Your AI Just Told You? A Blind Test of 5 AI Assistants

We had Claude design a benchmark, then grade five assistants blind, on questions none of them could have memorized. It put a rival in first place and called another rival's answer the single best of the test. Everything is below, including the prompts, so you can run it yourself.

We asked five of the best AI assistants in the world whether any global health emergency was active right now. One of them answered in calm, confident, well-organized prose: nothing was active, and the most recent emergency it could point to was from 2024.

Three weeks earlier, the World Health Organization had declared a public health emergency over an Ebola outbreak tearing across the Democratic Republic of the Congo and Uganda, with hundreds of confirmed cases and climbing. The assistant was not hedging. It was not unsure. It was confidently, completely wrong, and nothing in the way it wrote the answer would have warned you.

That is the failure mode this test was designed to expose. Every knowledge worker knows the small flush of doubt that follows an AI answer you cannot check. For a casual question it does not matter. For a client memo, a due-diligence summary, or a research brief that someone acts on, a confident wrong answer is worse than no answer at all, because you cannot see that it is wrong.

So we built that test. Five leading AI assistants, ChatGPT, Claude, Gemini, Perplexity, and Alani, each received the same three questions whose correct answers had changed after the models were trained. No tool could fall back on memory. Every one had to go find current information, cite it, and be honest about what it did not know. Then the answers were graded blind.

First, the disclosure that matters

Alani Hub is our product. This article is on our website, and we have every commercial incentive to tilt this in our favor. So before anything else, here is the part that makes tilting almost impossible: we did not design this test.

We did not write the questions. We did not write the scoring rubric. We did not pick the criteria. Claude, one of the five tools in the test, did all of that, before a single answer was generated. Our entire instruction was to give us one easy question, one medium question, and one hard question to compare the assistants, and to design how the answers would be scored. The decision to use questions whose answers had changed after training, and to grade on accuracy, sourcing, recency, and honesty, was Claude's, not ours. We then ran all five tools against those questions and handed the answers back to Claude to grade, with the labels stripped off.

Read the rest of this page assuming we are trying to tilt it. Then notice that the structure does not give us the lever.

Why you can trust these results

Four features of the method are doing the work, and every one of them cuts against us.

Claude wrote the test. The questions and the rubric were authored by Claude, not by anyone at bundleIQ. Our only instruction was the difficulty of the three questions: one easy, one medium, one hard. We could not have steered the test toward Alani's strengths, because we did not choose what it measured. The specific questions, and the dimensions it stresses, recency, sourcing, and honest hedging, were the test designer's call, not ours.

The grading was blind. The five answers were stripped of their labels and called A through E. They were scored against the fixed rubric before anyone knew which answer came from which tool. At scoring time, "Alani" did not exist. Only "Tool C" did.

The judge was a competitor, not us. The scoring was done by Claude, which was itself one of the five tools being graded. It ranked a rival, Alani, above its own answers, and named yet another rival, Gemini, as having the single best individual response in the test. A judge that hands first place to one competitor and the best single answer to another, both ahead of its own work, is not working from our script.

You can run it yourself. The three prompts are printed in full below. Paste them into all five tools today and check our scoring against your own. The whole point of choosing questions with verifiable, current answers is that nothing here rests on our say-so.

None of this puts the result beyond question. The honest limits are near the end, and they are real. But if Alani came first, the simplest explanation is that it gave the better answers that day, on a test it had no hand in writing.

The bottom line

If you read nothing else:

Every model can be confidently, fluently wrong about the present. One assistant missed an active global health emergency that had been in the news for three weeks, and stated plainly that nothing was happening. Two others reported a months-old software version as current. Fluency is not currency.

An answer you cannot trace is an answer you cannot use. The tool with the highest factual accuracy in the entire test finished fourth, because it refused to cite a single source, even when explicitly asked. In real work, an unsourced fact is just a rumor you have to re-verify yourself.

The race was close, and that is the honest headline. The top three finished within five points of each other. The ranking is real, but the more useful result is what the test exposes: the dimensions that separate a trustworthy answer from a dangerous one.

How the test worked

Every answer was graded on five criteria, 1 to 5 each, for 25 points per question and 75 per tool:

Accuracy. Are the facts correct, verified independently against primary sources?
Citation integrity. Were sources cited, and do they actually support the claim?
Completeness. Did it answer every part of the question?
Recency. Is the information current as of the test date?
Honesty. Does it hedge where it should and admit what it does not know?

The scale: 5 is fully correct and complete, 4 is correct with a minor gap, 3 is partial or thin, 2 is a real error, 1 is absent or wrong.

The questions were designed so that a model leaning on training data would fail. Each had a single verifiable answer at test time, and a built-in trap that punished guessing and rewarded actually checking.

The questions

Every tool received this identical block, with an explicit instruction to search the web first.

Easy: What is the most recent stable release version of Python, when was it released, and what are two new features it introduced?
Medium: What is the current US federal funds rate target range, when was the most recent Fed rate decision, and what did the Fed signal about future moves?
Hard: Has the WHO declared any new public health emergencies of international concern in the past 12 months? If so, what are they and what is their current status? If not, what was the most recent one declared, and is it still active?

The result

Rank	Tool	Python	Fed	WHO	Total
1	Alani	24	21	22	67 / 75
2	Claude	20	22	22	64 / 75
3	Gemini	21	23	18	62 / 75
4	ChatGPT	19	16	19	54 / 75
5	Perplexity	14	16	7	37 / 75

Alani finished first, but it did not sweep. Gemini posted the single best answer of the test, its Fed response at 23 of 25, and Claude tied Alani on the hardest question. Alani won by being the only tool with no weak answer anywhere. The full per-criterion math is below, so you can trace every number.

One detail to keep in mind as you read those numbers: Alani and the standalone Claude entry ran on the same underlying model, Claude Sonnet. So the gap between first and second is not the story of a smarter model. It is the same model scored higher inside Alani than in the raw chatbot, which points the difference squarely at the system around the model rather than the model itself.

What each question revealed

Python was the precision test. The correct answer was Python 3.14, most recently shipped as 3.14.6 on June 10, 2026, with template string literals (t-strings) and deferred evaluation of annotations among the headline features of the 3.14 line. Only Alani and Gemini reported the current release. Claude, ChatGPT, and Perplexity all named an older version despite being told to search. The lesson for your work: "latest" is a moving target. A model that does not genuinely re-check will hand you last quarter's answer with this quarter's confidence, and version numbers, prices, and policies are exactly the kind of fact that quietly goes stale.

The Fed was the honesty test. Every tool correctly named the 3.50% to 3.75% target range, held unchanged at the April 29, 2026 FOMC meeting. So the separation came entirely from discipline. Gemini cited the primary Federal Reserve release and avoided asserting a precise vote detail it was not sure of. ChatGPT got the facts right but cited nothing at all. The lesson: when everyone knows the answer, trust comes down to who shows their work and who knows the edge of what they know. (One note for anyone running this later: the next FOMC decision lands the week of June 17, 2026, so this answer is current as of the June 11 test date and may have moved since.)

The WHO question was the recency trap, and it was decisive. The correct answer: on May 17, 2026, the WHO declared a Public Health Emergency of International Concern over a Bundibugyo-virus Ebola outbreak in the DRC and Uganda, still active at test time. Claude and Alani handled it best, with current case figures and the notable detail that this was the first time a Director-General declared such an emergency before convening an expert committee. Perplexity missed it entirely, reporting a 2024 mpox emergency as the most recent and stating nothing was currently active. The lesson: this is the failure mode that should scare a knowledge worker. Not a hedge, not an "I am not sure," but a confident, well-written, completely outdated answer about a live situation.

Full scorecards

Every score, for every model, on every criterion, so the ranking is fully auditable.

Alani, 67 / 75

Question	Accuracy	Citation	Completeness	Recency	Honesty	Subtotal
Python	5	4	5	5	5	24
Fed	4	4	5	4	4	21
WHO	4	4	5	5	4	22
Totals	13	12	15	14	13	67

No weak answer anywhere. It paired the current Python release with the most precise feature descriptions, gave a complete and dated Fed answer, and on the WHO question had the correct declaration date, case figures, and a non-obvious procedural detail. It led the field on completeness. Its only soft spot was citation polish: sources were present but not always linked to the primary document.

Claude, 64 / 75

Question	Accuracy	Citation	Completeness	Recency	Honesty	Subtotal
Python	4	4	5	3	4	20
Fed	4	4	5	4	5	22
WHO	4	4	5	5	4	22
Totals	12	12	15	12	13	64

Strong, well-sourced, and honest throughout. It earned a perfect honesty score on the Fed answer for attributing a forecast to its source rather than stating it as fact. One stale Python version number is essentially the entire three-point gap between second and first.

Gemini, 62 / 75

Question	Accuracy	Citation	Completeness	Recency	Honesty	Subtotal
Python	4	4	4	5	4	21
Fed	5	5	4	4	5	23
WHO	3	4	3	4	4	18
Totals	12	13	11	13	13	62

It produced the single best answer in the test, a near-flawless Fed response built on a primary source. But it described a Python feature imprecisely and got the WHO declaration date wrong while omitting context, and that weaker WHO showing is what kept it off the top step.

ChatGPT, 54 / 75

Question	Accuracy	Citation	Completeness	Recency	Honesty	Subtotal
Python	5	1	5	4	4	19
Fed	4	1	4	3	4	16
WHO	5	1	5	5	3	19
Totals	14	3	14	12	11	54

This is the most important row in the table. ChatGPT had the highest factual accuracy of any tool, and finished fourth, because it cited nothing on any question despite an explicit instruction to do so. Right facts you cannot trace do not hold up in real work, and the scoring reflects that.

Perplexity, 37 / 75

Question	Accuracy	Citation	Completeness	Recency	Honesty	Subtotal
Python	3	1	4	3	3	14
Fed	4	1	3	4	4	16
WHO	2	1	1	1	2	7
Totals	9	3	8	8	9	37

Thin, uncited answers throughout, and the single worst result in the test, a 7 of 25 on WHO for confidently missing an active emergency. It was reliable on old, settled facts and unreliable wherever currency actually mattered.

The honest limits of this test

Rigor includes saying what a test cannot tell you.

Claude designed and graded it. Having a competitor write the questions and rubric removes our thumb from the scale, which is the point. But it also means the test reflects Claude's view of what matters and Claude's grading judgment. The questions and rubric were written first, before any answers existed, and the finished answers were graded blind. Even so, the designer and the grader were both Claude, so this is a single perspective. And two of the five entries, the standalone Claude tool and Alani, ran on Claude Sonnet, the same family as the grader. Blind labeling meant the grader could not tell which answers those were, but a grader that shares a lineage with some of the field is a real limit, so we flag it rather than bury it. We paired it with blind labeling and independent fact-checking against primary sources, but a panel of human graders would be stronger.

It is a single run on a single day. AI outputs vary between attempts. A re-run could shift individual scores. This is one snapshot, June 11, 2026, not a championship.

Three questions cannot represent every use case. They were built to stress currency, sourcing, and honesty. A test built around coding, summarization, or creative work could rank these tools differently.

Some figures move by the hour. The WHO case counts differed by snapshot date across tools, and the Fed answer reflects the April 29 decision as of the test date, with the next decision due the week of June 17. Neither changed the ranking, and both are flagged here rather than buried.

We would rather you trust a test that admits its limits than one that pretends it has none.

Why this matters for knowledge work

Strip away the scores and the test is about one question: when an AI tells you something, can you act on it?

The answer depends on three things, whether the answer is current, whether it is sourced, and whether the tool knows the edge of its own knowledge. A model can be brilliant at reasoning and still fail all three, and when it does, the cost lands on you: the re-checking, the stale figure in a deck, the buried error in a confident paragraph you forwarded to a client.

Now scale it. This was three questions on one afternoon. But you do not consult an AI three times, you consult it constantly. A working professional might query an assistant a dozen times a day, roughly 2,500 times a year, well over 100,000 across a career. At that volume, an unverifiable answer stops being an annoyance and becomes a permanent tax. Every response forces a quiet choice: spend the minutes to re-check it, or trust it and absorb the risk.

That asymmetry is the whole game. The time you save by not verifying is small and constant. The cost of the one stale figure that reaches a client, or the one missed development in a live situation, is large and unpredictable. A tool that hands you the source alongside the answer collapses that trade-off, because the verification is already half done.

What you are actually choosing between

Here is the part the scoreboard does not show.

ChatGPT, Claude, Gemini, and Perplexity are extraordinary models, and they have grown well beyond plain chat. Persistent projects, memory, connectors, and multi-file handling are all real and getting better, and we are not pretending otherwise. The distinction is one of center of gravity. In those tools, the model is the product and your knowledge is something you bring to it. The library you build up tends to live as scattered conversations, and it is tied to one vendor's ecosystem.

Alani is organized the other way around. It is a knowledge management system with those same leading models running on top of it. Your files, PDFs, videos, transcripts, and web sources become a persistent, searchable library you own. Answers can be saved as notes with their citations attached, so a claim is still traceable weeks later. You choose the model for the task, OpenAI, Claude, Minimax, with more coming, instead of being married to one. And you can work across many sources at once, not one document at a time.

To be straight about it: this test exercised exactly one muscle, going out to the live web, getting it right, and showing the receipts. It did not measure the knowledge base, the multi-source search, or the save-with-citations features. So do not read the result as proof of those. Read it as a symptom. When a tool is built end to end around traceable, reusable knowledge, "find the current fact and cite it" tends to be the thing it is quietly good at.

The decision

You are not really choosing the smartest model. They are all strong, they all improve every few months, and on any given question the lead changes hands. This test alone had three different tools posting the best individual answer on different questions. Chasing "the best model" is chasing a moving target.

What you are choosing is your home base: whether what you learn accumulates into something searchable and yours, whether the answers you rely on come with their sources attached, and whether you have to start over every time the leaderboard shifts. That is a decision you make once, and it compounds for the rest of your working life.

Do not take our word for any of it. The three prompts are above. Run them yourself. Then point Alani at your own files, your own sources, the questions that fill your actual workday, and see whether it changes how much you have to second-guess.

Create a free account and try it on your own work.

Create a free account

Methodology: at our request for one easy, one medium, and one hard question to compare the assistants, Claude, one of the five tools tested, chose the specific questions and designed the five-criterion scoring rubric, all before any answers were generated. On June 11, 2026, all five assistants received identical prompts in the same sitting, with an instruction to search the web first. Responses were labeled A through E and scored blind against the fixed rubric, then fact-checked against primary sources before identities were revealed. Alani generated its answers using Claude Sonnet, the same model that powered the standalone Claude entry, chosen from Alani's model picker, so the two differ only in the system wrapped around the model. Scores reflect one run and will vary with question set, tooling, and timing.

Can You Trust What Your AI Just Told You?