insidejob

Benchmark

SWE-bench Verified

# Model Provider Score
1 Claude Mythos Preview Anthropic 93.9
2 GPT-5.3 Codex OpenAI 85
3 Claude Opus 4.5 Anthropic 80.9
4 Claude Opus 4.6 Anthropic 80.8
5 Claude Sonnet 4.6 Anthropic 79.6
6 Gemini 3.1 Pro Google 78.8
7 GPT-5.4 OpenAI 77.2
8 DeepSeek V4 DeepSeek 72.5

SWE-bench Verified evaluates AI models on real-world GitHub issues — can the model generate a patch that fixes the bug? The “Verified” subset contains human-validated issues with clear acceptance criteria.

Notable

  • Claude Mythos Preview leads by a wide margin at 93.9% — an Anthropic research preview model not yet generally available.
  • Anthropic dominates the top 5 with 4 entries.
  • SWE-bench Pro is gaining traction as an alternative after OpenAI flagged data contamination concerns in the Verified subset. OpenAI has stopped reporting Verified scores.

83 models are currently tracked on this benchmark. Average score: 63.4%.