GPT-4.1
OpenAI GPT-4
Rank #8 overall · 50 evaluations
8.64
Performance Metrics
Metric Breakdown
Style/Tone
Relevance
Semantic Consistency
Human Likeness
Readability
Factual Accuracy
Ensemble Agreement
Strengths
Style/Tone: 8.96
Relevance: 8.90
Areas for Improvement
Ensemble Agreement: 7.61
Factual Accuracy: 8.34
Performance by Domain
Head-to-Head Record
| Opponent | Wins | Losses | Ties | Avg Diff |
|---|---|---|---|---|
| Claude Sonnet 4.5 | 6 | 1 | 6 | +0.34 |
| Grok 4 (Reasoning) | 10 | 0 | 3 | +0.61 |
| Gemini 3 Flash | 5 | 0 | 8 | +0.30 |
| GPT-4.1 Nano | 1 | 1 | 8 | +0.15 |
| GPT-4o | 4 | 1 | 4 | +0.44 |
| GPT-5 Mini | 2 | 3 | 4 | +0.26 |
| Gemini 3 Pro | 3 | 0 | 1 | +5.98 |
| Gemini 2.5 Pro | 0 | 0 | 3 | +0.07 |
| Grok 3 Mini | 2 | 0 | 1 | +0.25 |
| Claude 3.5 Haiku | 2 | 0 | 1 | +0.52 |