Grok 4.1 (Reasoning)
xAI (Grok) Grok 4
Rank #22 overall · 99 evaluations
7.68
Performance Metrics
Metric Breakdown
Relevance
Style/Tone
Semantic Consistency
Human Likeness
Readability
Factual Accuracy
Ensemble Agreement
Strengths
Relevance: 8.38
Style/Tone: 8.05
Areas for Improvement
Ensemble Agreement: 6.61
Factual Accuracy: 6.96
Performance by Domain
Head-to-Head Record
| Opponent | Wins | Losses | Ties | Avg Diff |
|---|---|---|---|---|
| Claude Sonnet 4.5 | 8 | 11 | 37 | +0.04 |
| Gemini 3 Pro | 9 | 7 | 22 | +0.29 |
| GPT-5.2 (Thinking) | 0 | 21 | 10 | -0.61 |
| GPT-5.2 | 0 | 9 | 7 | -0.60 |
| GPT-5.1 (Thinking) | 2 | 7 | 4 | -0.19 |
| Claude Sonnet 4.5 (Thinking) | 3 | 3 | 6 | +0.06 |
| Gemini 3 Flash | 0 | 4 | 5 | -0.46 |
| Grok 4.1 (Non-Reasoning) | 1 | 2 | 5 | -0.26 |
| Sonar Reasoning Pro | 4 | 0 | 2 | +4.06 |
| GPT-5.1 | 0 | 1 | 5 | -0.16 |
| Grok Code | 1 | 0 | 4 | +0.12 |
| Grok 4 (Non-Reasoning) | 0 | 1 | 4 | -0.34 |
| Grok 4 (Reasoning) | 1 | 0 | 2 | +0.09 |