insidejob

Benchmark

GPQA Diamond

# Model Provider Score
1 Claude Opus 4.6 Anthropic 94.3
2 GPT-5.4 OpenAI 92
3 GPT-5.3 Codex OpenAI 91.5
4 Gemini 3.1 Pro Google 90.8
5 Claude Sonnet 4.6 Anthropic 88.5
6 Grok 4.20 xAI 86.2
7 DeepSeek V4 DeepSeek 84

GPQA Diamond is a graduate-level, Google-proof Q&A benchmark. The “Diamond” subset contains 198 questions where domain expert annotators both got the correct answer, but non-experts mostly failed. PhD-level experts score 69.7%.

Key takeaways

  • Claude Opus 4.6 leads with 94.3% — a 1.4-point lead over GPT-5.4 and 3.5 points over Gemini 3.1 Pro.
  • All frontier models now exceed PhD-level expert performance by significant margins (25+ percentage points).
  • The benchmark covers biology, physics, and chemistry at the graduate research level.