insidejob — AI intelligence, daily.

Sat, Apr 11 First edition. March 2026 was the densest model release window in AI history — GPT-5.4, Gemini 3.1, DeepSeek V4 (1T params), and Claude Managed Agents all shipped. Open-source models now match proprietary on many benchmarks. full summary

Latest News 7

MITRE ATLAS: the adversarial threat matrix for AI systems ↗ MITRE Apr 11

A comprehensive guide to MITRE ATLAS — 16 tactics, 84 techniques, and 42 case studies for understanding adversarial threats to AI/ML systems.

Prompt injection in 2026: taxonomy, real-world exploits, and defenses ↗ Multiple Apr 10

A technical breakdown of prompt injection attack classes, real CVEs, and the defense mechanisms that work — and those that don't.

March 2026: the densest model release window in AI history ↗ LLM Stats Apr 9

Three frontier models in a single month — GPT-5.4, Gemini 3.1 Ultra, and Grok 4.20 — plus major open-source releases.

OWASP LLM Top 10: the evolving AI security landscape ↗ OWASP Apr 7

As AI agents gain autonomy, the OWASP LLM Top 10 tracks the most critical security risks for large language model applications.

Claude Code SDK renamed to Claude Agent SDK ↗ Anthropic Apr 4

Anthropic renames the SDK to reflect its broader applications beyond coding. Now available in Python and TypeScript.

Anthropic launches Claude Managed Agents in public beta ↗ Anthropic Mar 31

A fully managed agent harness for running Claude autonomously with secure sandboxing, multi-agent coordination, and server-sent event streaming.

DeepSeek V4: 1 trillion parameters, open and available ↗ OpenRouter Mar 10

The largest freely available AI model at 1T parameters, hosted on OpenRouter at $0.28/M input tokens.

Releases 3

Mar 31 Claude Managed Agents beta (2026-04-01) ↗

Fully managed agent harness on Anthropic infrastructure
Secure sandboxing and long-running sessions
Multi-agent coordination in research preview

Mar 4 GPT-5.4 5.4 ↗

Record 83% on GDPval
Record scores on OSWorld-Verified and WebArena Verified
Standard, Thinking, and Pro variants

Jan 31 Claude Opus 4.6 / Sonnet 4.6 4.6 ↗

1M context window at standard pricing
Opus 80.8% and Sonnet 79.6% on SWE-bench Verified
Adaptive, extended, and interleaved thinking

Models 8 pricing per 1M tokens

Model	Provider	In/Out	Context	Benchmark
Qwen 3.6 Plus	Alibaba	$0.3/$1.2	1M	GPQA 82%
Gemma 4	Google	free	128K	GPQA 72%
DeepSeek V4	DeepSeek	$0.28/$1.1	128K	SWE 72.5%
GPT-5.4	OpenAI	$2.5/$10	256K	GPQA 92%
Gemini 3.1 Pro	Google	$2/$12	2M	SWE 78.8%
Claude Opus 4.6	Anthropic	$5/$25	1M	SWE 80.8%
Claude Sonnet 4.6	Anthropic	$3/$15	1M	SWE 79.6%
Llama 4 Maverick	Meta	free	10M	SWE 68.5%

Security 2 rss

critical 9.6 CVE-2025-53773 github-copilot patched critical 9.3 CVE-2025-68664 langchain-core patched

ATLAS Navigator OWASP CISA all security

MITRE ATLAS: the adversarial threat matrix for AI systems ↗ Prompt injection in 2026: taxonomy, real-world exploits, and defenses ↗ OWASP LLM Top 10: the evolving AI security landscape ↗

Benchmarks 3

GPQA Diamond

Claude Opus 4.6 94.3
GPT-5.4 92
GPT-5.3 Codex 91.5
Gemini 3.1 Pro 90.8
Claude Sonnet 4.6 88.5

SWE-bench Verified

Claude Mythos Preview 93.9
GPT-5.3 Codex 85
Claude Opus 4.5 80.9
Claude Opus 4.6 80.8
Claude Sonnet 4.6 79.6

LM Arena (Chatbot Arena) Elo Rankings

Claude Opus 4.6 Thinking 1504
Gemini 3.1 Pro Preview 1493
Grok 4.20 Beta1 1491
GPT-5.4 High 1484
Claude Sonnet 4.6 Thinking 1478

Trends 1 snapshots

Snapshot: 2026-04-12

Model	Arena	GPQA	$/M in
claude opus 4 6	1504	94.3%	$5
gemini 3 1 pro	1493	90.8%	$2
gpt 5 4	1484	92%	$2.5
claude sonnet 4 6	1478	88.5%	$3
deepseek v4	1445	84%	$0.28
llama 4 maverick	—	78%	free
qwen 3 6 plus	—	82%	$0.3
gemma 4	—	72%	free

3 releases tracked 2 advisories

ATLAS 5.5.0 16T / 101t / 66st

16 tactics 101 techniques 0 mitigations 57 case studies

Recently modified

AML.T0050 Command and Scripting Interpreter demonstrated AML.T0024.000 Infer Training Data Membership feasible AML.T0095 Search Open Websites/Domains demonstrated AML.T0000.001 Pre-Print Repositories demonstrated AML.T0002.000 Datasets demonstrated AML.T0008.001 Consumer Hardware realized Browse all 167 techniques