Benchmark
SWE-bench Verified
| # | Model | Provider | Score |
|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | 93.9 |
| 2 | GPT-5.3 Codex | OpenAI | 85 |
| 3 | Claude Opus 4.5 | Anthropic | 80.9 |
| 4 | Claude Opus 4.6 | Anthropic | 80.8 |
| 5 | Claude Sonnet 4.6 | Anthropic | 79.6 |
| 6 | Gemini 3.1 Pro | 78.8 | |
| 7 | GPT-5.4 | OpenAI | 77.2 |
| 8 | DeepSeek V4 | DeepSeek | 72.5 |
SWE-bench Verified evaluates AI models on real-world GitHub issues — can the model generate a patch that fixes the bug? The “Verified” subset contains human-validated issues with clear acceptance criteria.
Notable
- Claude Mythos Preview leads by a wide margin at 93.9% — an Anthropic research preview model not yet generally available.
- Anthropic dominates the top 5 with 4 entries.
- SWE-bench Pro is gaining traction as an alternative after OpenAI flagged data contamination concerns in the Verified subset. OpenAI has stopped reporting Verified scores.
83 models are currently tracked on this benchmark. Average score: 63.4%.