insidejob

Benchmark

LM Arena (Chatbot Arena) Elo Rankings

# Model Provider Score
1 Claude Opus 4.6 Thinking Anthropic 1504
2 Gemini 3.1 Pro Preview Google 1493
3 Grok 4.20 Beta1 xAI 1491
4 GPT-5.4 High OpenAI 1484
5 Claude Sonnet 4.6 Thinking Anthropic 1478
6 GPT-5.4 OpenAI 1470
7 Gemini 3.1 Flash Google 1455
8 DeepSeek V4 DeepSeek 1445

LM Arena (formerly LMSYS Chatbot Arena) ranks models using Elo ratings from crowdsourced human pairwise comparisons. Users chat with two anonymous models and vote for the better response.

  • Reasoning-optimized models dominate. Claude Opus 4.6 Thinking uses hidden chain-of-thought to debug outputs before the user sees them.
  • Grok 4.20 disrupts the top tier — climbed to #3 globally, surpassing GPT-5.4.
  • Gemini 3.1 Pro Preview outperforms GPT-5.4 High by 9 Elo points in the text arena.
  • Anything above 1400 Elo is considered frontier-level performance.

The leaderboard updates daily as thousands of new human comparisons are processed.