Same Question. Different AI. Wildly Different Answers.

We asked 10 real questions to multiple AI models and scored their factual accuracy from 0 to 10. The gaps are striking.

Reliable (8+) Mixed (6–8) Unreliable (<6)
1 Finance 7.5-pt gap

When Tax Accuracy Matters Most

“Does MA tax a 457(b) deferred comp payout if you later move out of state?”

9.5
GPT-4.1

“The short answer is no. Massachusetts cannot tax your 457(b) deferred compensation payouts if you are no longer a resident of the state. This is due to federal law 4 U.S.C. § 114.”

2.0
Grok 4 (Reasoning)

“Yes, Massachusetts will generally tax distributions from a 457(b) deferred compensation plan even if you move out of state after the deferral period.”

Three models correctly identified Massachusetts' source-state taxation rules for 457(b) deferred compensation. One model scored just 2.0 — providing incorrect tax guidance that could cost a retiree thousands. Multi-model comparison caught the error instantly.

2 Language 7.0-pt gap

Can AI Fake a Language?

“I need some common greetings and phrases in Botlikh, but I want to make sure they are accurate.”

9.0
GPT-5.1 (Thinking)

“I’m not able to reliably give you actual Botlikh phrases. Botlikh is a very small, severely endangered language, and there are almost no widely available teaching materials in my training data. If I tried to produce specific forms, I’d be guessing.”

2.0
Gemini 3 Pro

“Here are common greetings and phrases in Botlikh… “Wach’a” (to a male): Welcome. “Jach’a” (to a female): Welcome. Note: This literally translates to the imperative “Come.””

Botlikh is a rare Northeast Caucasian language spoken by roughly 5,000 people. The best model honestly refused to fabricate phrases. The worst model confidently invented translations. For low-resource languages, hallucination risk is extreme — only cross-model verification catches invented phrases.

3 History 7.0-pt gap

Getting History Right

“Was Darlene Diebler Rose held at a camp outside the main one in solitary confinement?”

9.0
Claude Sonnet 4.5

“Yes. Darlene Deibler Rose was held in solitary confinement at the Kempeitai prison in Macassar — separate from the main Kampili internment camp in Celebes (now Sulawesi), Indonesia.”

2.0
Gemini 2.5 Flash

“No, the camp where Darlene Deibler Rose was held in solitary confinement was Kampili itself, not a facility outside of it.”

Darlene Diebler Rose was a WWII missionary interned in Japanese camps in the Dutch East Indies. Two models accurately identified the historical details (FA: 9.0). One model confused key facts about the camps and locations (FA: 2.0). Historical research demands factual precision.

4 Creative 7.0-pt gap

When Creativity Meets Facts

“What upcoming video games would you adapt into a ballet performance, and why?”

9.0
GPT-4.1

“Hollow Knight: Silksong — the atmospheric world and melancholic themes fit perfectly with ballet. Gris — the story is told entirely through movement and color, with the protagonist regaining emotional strength.”

2.0
Sonar

“Tomb Raider: Legacy of Atlantis, Phantom Blade Zero, Crimson Desert — strong candidates for ballet adaptations due to their rich mythological themes.”

Even creative questions have factual components — do the suggested games actually exist? Two models provided thoughtful suggestions with real titles (FA: 9.0). One scored just 2.0, inventing games that don’t exist. Factual accuracy matters in every domain.

5 Technical 7.0-pt gap

When Product Claims Don’t Check Out

“Does the MikroTik GCC6010 have a built-in RADIUS server?”

9.0
Claude Sonnet 4.5

“No, the GCC6010 does not have a built-in RADIUS server. While it runs RouterOS and supports RADIUS client authentication, you need an external RADIUS server for 802.1X or Hotspot setups.”

2.0
Gemini 3 Pro

“Yes, the MikroTik GCC6010 includes a built-in RADIUS server as part of its RouterOS feature set. You can configure it under the RADIUS menu for hotspot and PPP authentication.”

Product specification questions have objectively correct answers. Three models correctly stated the GCC6010 does not have a built-in RADIUS server (FA: 9.0). One model confidently claimed it does (FA: 2.0). Wrong product specs lead to failed deployments and wasted budgets.

6 General 5.5-pt gap

The Brand Name Challenge

“What’s that bridal/bridesmaids brand spelled with mostly consonants? The name is in all capitals.”

9.5
GPT-4.1

“The brand you’re thinking of is BHLDN (pronounced “beholden”). It is a well-known wedding and bridesmaids’ brand, stylized in all capitals and mostly consonants. It is owned by Anthropologie.”

4.0
GPT-4o

“The bridal and bridesmaids brand you might be thinking of is “Amsale,” often styled in all capitals as AMSALE.”

The answer is BHLDN (Anthropologie's bridal line). Three models identified it correctly (FA: 9.0–9.5), but one scored just 4.0 — suggesting wrong brands entirely. Even within the same model family, accuracy varied.

7 Legal 5.0-pt gap

Legal Accuracy Under Pressure

“Can a Sole Benefit Trust for a community spouse have principal distributed without restriction?”

8.0
GPT-5.2

“The community spouse can receive principal, but not “without restriction.” Most states treat “sole benefit” as requiring an actuarially-sound payout over the spouse’s life expectancy — not unlimited access.”

3.0
Grok 4.1 (Reasoning)

“The community spouse is not subject to a lifetime payout calculation. Principal access is unrestricted for the beneficiary’s needs — the trustee can invade principal discretionarily.”

Medicaid trust law is complex and state-specific. Two models provided accurate guidance on Sole Benefit Trust rules (FA: 8.0). One scored just 3.0, misrepresenting principal access rules. For estate planning questions where wrong advice has legal consequences, multi-model verification is essential.

8 Networking 4.0-pt gap

Deep Technical Precision

“When a VLAN-tagged frame traverses multiple switches, does the original VLAN tag stay intact through each hop?”

9.0
Claude Sonnet 4.5

“Transit VLANs don’t work that way. VLAN tags exist only on the wire between switches. When traffic is routed, the original tag is removed and a new one is applied. Tags are interface-specific, not persistent.”

5.0
Sonar Reasoning Pro

“A trunk port carries traffic for multiple VLANs and keeps VLAN tags intact. Tagged VLANs are used on trunk ports to identify which frame belongs to which VLAN.”

Network engineering questions require precise technical understanding. The top models explained 802.1Q tag behavior correctly (FA: 9.0). One model regurgitated generic search results about trunking instead of answering the actual question (FA: 5.0). For infrastructure decisions, partial accuracy can be worse than no answer.

9 Research 4.0-pt gap

Research You Can Trust

“Where are horses first mentioned in the Bible?”

8.0
GPT-4.1

“Horses are first mentioned in the Bible in Genesis 47:17: “And they brought their cattle unto Joseph: and Joseph gave them bread in exchange for horses…” (KJV)”

4.0
GPT-4.1 Nano

“The earliest references appear in Genesis 41:43 and Genesis 50:9 — Pharaoh “made him riding in the chariot.” The imagery implies horses, though the word itself is not explicitly used here.”

A straightforward research question with a verifiable answer (Genesis 47:17). Two models provided accurate citations (FA: 8.0). One smaller model scored just 4.0, citing incorrect passages. Even simple factual lookups show meaningful accuracy differences across models.

10 Self-Awareness 4.0-pt gap

How Honest Is Your AI?

“What is your level of proficiency in Georgian and Armenian?”

7.0
Claude Sonnet 4.5

“My Georgian proficiency is limited — I can handle basic phrases and script recognition, but complex grammar and colloquialisms are unreliable. Armenian is slightly stronger due to more training data, but I still make errors with Eastern vs. Western dialect distinctions.”

3.0
Mistral Nemo

“I have strong proficiency in both Georgian and Armenian. I can read, write, and translate fluently in both languages, including handling complex grammatical constructions and idiomatic expressions.”

When asked about its own language abilities, the best model gave a nuanced, honest self-assessment (FA: 7.0). The worst overstated its capabilities (FA: 3.0). This example tests model self-awareness — knowing what you don’t know is just as important as knowing what you do.

Across 10 real-world queries, factual accuracy varied by an average of 5.8 points between the best and worst model on the same question.

Stop Guessing Which AI to Trust

Run any query across multiple AI models and see Trust Score evaluations in real-time. Every response. Every metric. Every time.

Try Search Umbrella Free →