Benchmark
GPQA Diamond
| # | Model | Provider | Score |
|---|---|---|---|
| 1 | Claude Opus 4.6 | Anthropic | 94.3 |
| 2 | GPT-5.4 | OpenAI | 92 |
| 3 | GPT-5.3 Codex | OpenAI | 91.5 |
| 4 | Gemini 3.1 Pro | 90.8 | |
| 5 | Claude Sonnet 4.6 | Anthropic | 88.5 |
| 6 | Grok 4.20 | xAI | 86.2 |
| 7 | DeepSeek V4 | DeepSeek | 84 |
GPQA Diamond is a graduate-level, Google-proof Q&A benchmark. The “Diamond” subset contains 198 questions where domain expert annotators both got the correct answer, but non-experts mostly failed. PhD-level experts score 69.7%.
Key takeaways
- Claude Opus 4.6 leads with 94.3% — a 1.4-point lead over GPT-5.4 and 3.5 points over Gemini 3.1 Pro.
- All frontier models now exceed PhD-level expert performance by significant margins (25+ percentage points).
- The benchmark covers biology, physics, and chemistry at the graduate research level.