SWE-bench Verified — insidejob

#	Model	Provider	Score
1	Claude Mythos Preview	Anthropic	93.9
2	GPT-5.3 Codex	OpenAI	85
3	Claude Opus 4.5	Anthropic	80.9
4	Claude Opus 4.6	Anthropic	80.8
5	Claude Sonnet 4.6	Anthropic	79.6
6	Gemini 3.1 Pro	Google	78.8
7	GPT-5.4	OpenAI	77.2
8	DeepSeek V4	DeepSeek	72.5

SWE-bench Verified evaluates AI models on real-world GitHub issues — can the model generate a patch that fixes the bug? The “Verified” subset contains human-validated issues with clear acceptance criteria.

Notable

Claude Mythos Preview leads by a wide margin at 93.9% — an Anthropic research preview model not yet generally available.
Anthropic dominates the top 5 with 4 entries.
SWE-bench Pro is gaining traction as an alternative after OpenAI flagged data contamination concerns in the Verified subset. OpenAI has stopped reporting Verified scores.

83 models are currently tracked on this benchmark. Average score: 63.4%.