LLM Weather Report

Tracking raw LLM reasoning drift — pure endpoint, no agents

claude-sonnet-4-6 dropped on common-sense-1. gpt-5.4-mini failing spatial-1. claude-haiku-4-5, gemini-2.5-flash recovering. gpt-5.4-mini scores rising.

June 23, 2026 — 12:40 PM CT

Drift Alerts

SCORE_RISE openai/gpt-5.4-mini on spatial-1
SCORE_DROP anthropic/claude-sonnet-4-6 on common-sense-1
IMPROVEMENT anthropic/claude-haiku-4-5 on common-sense-1
IMPROVEMENT gemini/gemini-2.5-flash on causality-1

Provider Status

OpenAI Users may experience elevated errors in ChatGPT uploading and downloading files
Anthropic Elevated error rate across multiple models
Anthropic Elevated errors for Claude Opus 4.8
Anthropic Elevated errors across many models
Anthropic Elevated errors for Claude Opus 4.8
Anthropic Elevated Error Rates for Opus 4.8, Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5
Anthropic We’ve suspended access to Claude Mythos 5 and Claude Fable 5

Scorecard

Model	ambiguity-1	causality-1	code-1	common-sense-1	logic-1	math-1	spatial-1
anthropic/claude-haiku-4-5	✓ (4.33)	✓ (4.5)	✓ (4.5)	✓ (3.33)was ✗ (3)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-opus-4-6	✓ (5)	✓ (5)	✓ (4.33)	✓ (4.33)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-sonnet-4-6	✓ (4.5)	✓ (5)	✓ (4.5)	✓ (3.33)was 4.75	✓ (4.83)	✓ (5)	✓ (5)
gemini/gemini-2.5-flash	✓ (4.67)	✓ (5)was ✗ (1.75)	✓ (4.67)	✓ (3.5)	✓ (5)	✓ (5)	✓ (5)
gemini/gemini-2.5-pro	✓ (4.83)	✓ (4.83)	✓ (4.67)	✓ (4.83)	✓ (5)	✓ (5)	✓ (5)
ollama/llama3	—	—	—	—	—	—	—
openai/gpt-5.4	✓ (4.5)	✓ (4.83)	✓ (4.67)	✓ (4.33)	✓ (5)	✓ (5)	✓ (5)
openai/gpt-5.4-mini	✓ (4.67)	✓ (4.67)	✓ (4.67)	✓ (4.33)	✓ (5)	✓ (4.67)	✗ (3.67)was 2.25

Model Status

↑ anthropic/claude-haiku-4-5 up
→ anthropic/claude-opus-4-6 stable
↓ anthropic/claude-sonnet-4-6 down
↑ gemini/gemini-2.5-flash up
→ gemini/gemini-2.5-pro stable
→ openai/gpt-5.4 stable
↑ openai/gpt-5.4-mini up

Raw Data

Detail log — full responses and judge verdicts per prompt
JSON — structured data for programmatic access
Markdown — plain text report
responses.json — raw model outputs
judgments.json — raw judge verdicts
run.log — debug log
Agent Skill — how to read and interpret this data
Methodology — how evaluations work