What Is Trust Score? The AI Evaluation Framework Explained
TL;DR: Trust Score is a patent-pending AI evaluation framework that rates every AI response on a 0-10 scale across 7 metrics. Unlike synthetic benchmarks, it uses real queries from real users. Unlike single-metric tests, it measures readability, factual accuracy, consistency, relevance, style, model consensus, and human likeness. Think of it as the FICO Score of AI evaluation.
Why Another AI Evaluation System?
There's no shortage of AI benchmarks. MMLU, HumanEval, ARC, HellaSwag. Each measures something specific, and each has the same fundamental limitation: models can be optimized for the test without improving on real tasks.
Trust Score takes a different approach. Instead of testing models on curated datasets, it evaluates them on real questions from real users on Search Umbrella. Every response is scored by a separate evaluator model running asynchronously, the same way a human expert would review the output, but at scale.
The 7 Metrics
Trust Score evaluates every response across 7 dimensions. The composite score (0-10) is derived using a proprietary, patent-pending algorithm. Here's what each metric measures:
RC Readability / Clarity
How clear, well-structured, and easy to understand is the response? Evaluates logical organization, grammar, formatting, and complexity matching. Average: 7.9/10
FA Factual Accuracy
Are the facts verifiable and correct? Checks claims against known information, identifies hallucinations, and evaluates source reliability. This is the hardest metric for AI models and the most variable. Average: 7.5/10 (widest spread: 6.0-9.0)
SC Semantic Consistency
Is the response internally consistent? Detects contradictions and evaluates whether reasoning follows logically. Average: 8.4/10
RF Relevance / Focus
Does it actually answer the question? Measures topic adherence and whether the core question is addressed without unnecessary tangents. Average: 8.7/10
ST Style / Tone
Is the writing style appropriate for the context? A legal question should get a professional response; a creative request should get an imaginative one. Average: 8.3/10
ED Ensemble Disagreement
When multiple models answer the same question, do they agree? This metric is unique to multi-model platforms. High agreement increases confidence; disagreement flags potential issues. Average: 6.9/10 (most variable metric)
HL Human Likeness
Does it read like a knowledgeable human or a corporate chatbot? Evaluates whether communication feels authentic. Average: 8.2/10
The Data Behind Trust Score
- 2,637 evaluations across 32 models and 8 domains
- 1,360 real queries from Search Umbrella users (not synthetic benchmarks)
- 51.1% multi-model queries enable Ensemble Disagreement measurement
- Continuously updated as new evaluations are collected
How to Use Trust Score
Check the leaderboard to see how 32 models rank overall and by domain. Sort by any metric to find the model that best fits your use case.
Browse real examples to see Trust Score in action on 10 real queries where model accuracy varied dramatically.
Read the full methodology for a deeper explanation of each metric, including scoring ranges and examples.
Try it yourself on Search Umbrella, where every query is automatically evaluated with Trust Score across multiple AI models.
Frequently Asked Questions
What is Trust Score?
Trust Score is a patent-pending AI evaluation framework that scores AI model responses on a 0-10 scale across 7 metrics: readability, factual accuracy, semantic consistency, relevance, style/tone, ensemble disagreement, and human likeness. It evaluates models on real user queries, not synthetic benchmarks.
How is Trust Score different from AI benchmarks like MMLU?
Traditional benchmarks test models on fixed datasets that models can be optimized for. Trust Score evaluates models on real queries from real users on Search Umbrella. It also measures 7 dimensions of quality instead of a single accuracy number, and includes ensemble disagreement, a metric only possible on multi-model platforms.
Who created Trust Score?
Trust Score was created by Search Umbrella LLC, a multi-LLM orchestration platform. It is protected by provisional patent. The evaluation framework powers the rankings on howismyai.com.
How many models does Trust Score evaluate?
Trust Score currently evaluates 32 AI models across 8 domains, based on 2,637 real-world evaluations. The data is continuously updated as new queries are evaluated on Search Umbrella.