GPQA Diamond — insidejob

#	Model	Provider	Score
1	Claude Opus 4.6	Anthropic	94.3
2	GPT-5.4	OpenAI	92
3	GPT-5.3 Codex	OpenAI	91.5
4	Gemini 3.1 Pro	Google	90.8
5	Claude Sonnet 4.6	Anthropic	88.5
6	Grok 4.20	xAI	86.2
7	DeepSeek V4	DeepSeek	84

GPQA Diamond is a graduate-level, Google-proof Q&A benchmark. The “Diamond” subset contains 198 questions where domain expert annotators both got the correct answer, but non-experts mostly failed. PhD-level experts score 69.7%.

Key takeaways

Claude Opus 4.6 leads with 94.3% — a 1.4-point lead over GPT-5.4 and 3.5 points over Gemini 3.1 Pro.
All frontier models now exceed PhD-level expert performance by significant margins (25+ percentage points).
The benchmark covers biology, physics, and chemistry at the graduate research level.