gpt-5.4 lost code-1. gemini-2.5-flash dropped on common-sense-1. gpt-5.4-mini failing spatial-1; gemini-2.5-flash failing causality-1. claude-sonnet-4-6 scores rising.
June 5, 2026 — 8:45 PM CT
Drift Alerts
- REGRESSION openai/gpt-5.4 on code-1
- SCORE_RISE anthropic/claude-sonnet-4-6 on common-sense-1
- SCORE_DROP gemini/gemini-2.5-flash on common-sense-1
Provider Status
- OpenAI Some users may experience issues accessing OpenAI accounts
- OpenAI Voice mode availability impacted
- OpenAI Users unable to sign in using Microsoft personal accounts
- OpenAI Elevated error rates for Free users in conversations
- OpenAI Image API requests failing with 401s
- OpenAI Increased latency for Codex compaction for a subset of users
- Anthropic Elevated errors on many Claude models
Scorecard
| Model | ambiguity-1 | causality-1 | code-1 | common-sense-1 | logic-1 | math-1 | spatial-1 |
|---|---|---|---|---|---|---|---|
| anthropic/claude-haiku-4-5 | ✓ (4.33) | ✓ (4.67) | ✓ (4.67) | ✓ (3.33) | ✓ (5) | ✓ (5) | ✓ (5) |
| anthropic/claude-opus-4-6 | ✓ (5) | ✓ (4.83) | ✓ (4.67) | ✓ (4.33) | ✓ (5) | ✓ (5) | ✓ (5) |
| anthropic/claude-sonnet-4-6 | ✓ (4.67) | ✓ (5) | ✓ (4.67) | ✓ (4.67)was 3.67 | ✓ (5) | ✓ (5) | ✓ (5) |
| gemini/gemini-2.5-flash | ✓ (4.83) | ✗ (2.33) | ✓ (4.83) | ✓ (3.67)was 4.67 | ✓ (4.83) | ✓ (5) | ✓ (5) |
| gemini/gemini-2.5-pro | ✓ (5) | ✓ (4.83) | ✓ (4.83) | ✓ (4.83) | ✓ (5) | ✓ (5) | ✓ (5) |
| ollama/llama3 | — | — | — | — | — | — | — |
| openai/gpt-5.4 | ✓ (4.5) | ✓ (4.67) | ✗ (2.83)was ✓ (4.67) | ✓ (4.33) | ✓ (4.83) | ✓ (5) | ✓ (5) |
| openai/gpt-5.4-mini | ✓ (4.5) | ✓ (4.83) | ✓ (4.83) | ✓ (4.5) | ✓ (4.67) | ✓ (4.83) | ✗ (2.67) |
Model Status
- → anthropic/claude-haiku-4-5 stable
- → anthropic/claude-opus-4-6 stable
- ↑ anthropic/claude-sonnet-4-6 up
- ↓ gemini/gemini-2.5-flash down
- → gemini/gemini-2.5-pro stable
- ↓ openai/gpt-5.4 down
- → openai/gpt-5.4-mini stable
Raw Data
- Detail log — full responses and judge verdicts per prompt
- JSON — structured data for programmatic access
- Markdown — plain text report
- responses.json — raw model outputs
- judgments.json — raw judge verdicts
- run.log — debug log
- Agent Skill — how to read and interpret this data
- Methodology — how evaluations work