insidejob
Sat, Apr 11 First edition. March 2026 was the densest model release window in AI history — GPT-5.4, Gemini 3.1, DeepSeek V4 (1T params), and Claude Managed Agents all shipped. Open-source models now match proprietary on many benchmarks. full summary
Latest News 7

A comprehensive guide to MITRE ATLAS — 16 tactics, 84 techniques, and 42 case studies for understanding adversarial threats to AI/ML systems.

A technical breakdown of prompt injection attack classes, real CVEs, and the defense mechanisms that work — and those that don't.

Three frontier models in a single month — GPT-5.4, Gemini 3.1 Ultra, and Grok 4.20 — plus major open-source releases.

As AI agents gain autonomy, the OWASP LLM Top 10 tracks the most critical security risks for large language model applications.

Anthropic renames the SDK to reflect its broader applications beyond coding. Now available in Python and TypeScript.

A fully managed agent harness for running Claude autonomously with secure sandboxing, multi-agent coordination, and server-sent event streaming.

The largest freely available AI model at 1T parameters, hosted on OpenRouter at $0.28/M input tokens.

Releases 3
  • Fully managed agent harness on Anthropic infrastructure
  • Secure sandboxing and long-running sessions
  • Multi-agent coordination in research preview
  • Record 83% on GDPval
  • Record scores on OSWorld-Verified and WebArena Verified
  • Standard, Thinking, and Pro variants
  • 1M context window at standard pricing
  • Opus 80.8% and Sonnet 79.6% on SWE-bench Verified
  • Adaptive, extended, and interleaved thinking
Models 8 pricing per 1M tokens
Model Provider In/Out Context Benchmark
Qwen 3.6 Plus Alibaba $0.3/$1.2 1M GPQA 82%
Gemma 4 Google free 128K GPQA 72%
DeepSeek V4 DeepSeek $0.28/$1.1 128K SWE 72.5%
GPT-5.4 OpenAI $2.5/$10 256K GPQA 92%
Gemini 3.1 Pro Google $2/$12 2M SWE 78.8%
Claude Opus 4.6 Anthropic $5/$25 1M SWE 80.8%
Claude Sonnet 4.6 Anthropic $3/$15 1M SWE 79.6%
Llama 4 Maverick Meta free 10M SWE 68.5%
Security 2 rss
Benchmarks 3

GPQA Diamond

  1. Claude Opus 4.6 94.3
  2. GPT-5.4 92
  3. GPT-5.3 Codex 91.5
  4. Gemini 3.1 Pro 90.8
  5. Claude Sonnet 4.6 88.5

SWE-bench Verified

  1. Claude Mythos Preview 93.9
  2. GPT-5.3 Codex 85
  3. Claude Opus 4.5 80.9
  4. Claude Opus 4.6 80.8
  5. Claude Sonnet 4.6 79.6

LM Arena (Chatbot Arena) Elo Rankings

  1. Claude Opus 4.6 Thinking 1504
  2. Gemini 3.1 Pro Preview 1493
  3. Grok 4.20 Beta1 1491
  4. GPT-5.4 High 1484
  5. Claude Sonnet 4.6 Thinking 1478
ATLAS 5.5.0 16T / 101t / 66st