claude-sonnet-4-6 dropped on common-sense-1. gpt-5.4-mini failing spatial-1. claude-haiku-4-5, gemini-2.5-flash recovering. gpt-5.4-mini scores rising.
June 23, 2026 — 12:40 PM CT
Drift Alerts
- SCORE_RISE openai/gpt-5.4-mini on spatial-1
- SCORE_DROP anthropic/claude-sonnet-4-6 on common-sense-1
- IMPROVEMENT anthropic/claude-haiku-4-5 on common-sense-1
- IMPROVEMENT gemini/gemini-2.5-flash on causality-1
Provider Status
- OpenAI Users may experience elevated errors in ChatGPT uploading and downloading files
- Anthropic Elevated error rate across multiple models
- Anthropic Elevated errors for Claude Opus 4.8
- Anthropic Elevated errors across many models
- Anthropic Elevated errors for Claude Opus 4.8
- Anthropic Elevated Error Rates for Opus 4.8, Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5
- Anthropic We’ve suspended access to Claude Mythos 5 and Claude Fable 5
Scorecard
| Model | ambiguity-1 | causality-1 | code-1 | common-sense-1 | logic-1 | math-1 | spatial-1 |
|---|---|---|---|---|---|---|---|
| anthropic/claude-haiku-4-5 | ✓ (4.33) | ✓ (4.5) | ✓ (4.5) | ✓ (3.33)was ✗ (3) | ✓ (5) | ✓ (5) | ✓ (5) |
| anthropic/claude-opus-4-6 | ✓ (5) | ✓ (5) | ✓ (4.33) | ✓ (4.33) | ✓ (5) | ✓ (5) | ✓ (5) |
| anthropic/claude-sonnet-4-6 | ✓ (4.5) | ✓ (5) | ✓ (4.5) | ✓ (3.33)was 4.75 | ✓ (4.83) | ✓ (5) | ✓ (5) |
| gemini/gemini-2.5-flash | ✓ (4.67) | ✓ (5)was ✗ (1.75) | ✓ (4.67) | ✓ (3.5) | ✓ (5) | ✓ (5) | ✓ (5) |
| gemini/gemini-2.5-pro | ✓ (4.83) | ✓ (4.83) | ✓ (4.67) | ✓ (4.83) | ✓ (5) | ✓ (5) | ✓ (5) |
| ollama/llama3 | — | — | — | — | — | — | — |
| openai/gpt-5.4 | ✓ (4.5) | ✓ (4.83) | ✓ (4.67) | ✓ (4.33) | ✓ (5) | ✓ (5) | ✓ (5) |
| openai/gpt-5.4-mini | ✓ (4.67) | ✓ (4.67) | ✓ (4.67) | ✓ (4.33) | ✓ (5) | ✓ (4.67) | ✗ (3.67)was 2.25 |
Model Status
- ↑ anthropic/claude-haiku-4-5 up
- → anthropic/claude-opus-4-6 stable
- ↓ anthropic/claude-sonnet-4-6 down
- ↑ gemini/gemini-2.5-flash up
- → gemini/gemini-2.5-pro stable
- → openai/gpt-5.4 stable
- ↑ openai/gpt-5.4-mini up
Raw Data
- Detail log — full responses and judge verdicts per prompt
- JSON — structured data for programmatic access
- Markdown — plain text report
- responses.json — raw model outputs
- judgments.json — raw judge verdicts
- run.log — debug log
- Agent Skill — how to read and interpret this data
- Methodology — how evaluations work