insidejob
open source Google

Gemma 4

26B (MoE)
Context 128K tokens
Max output 16K tokens
Architecture Mixture-of-Experts, 26B parameters
Pricing (per 1M tokens) Free

Benchmark scores

72 GPQA Diamond
Available via: Self-hostedOllamaLM Studio

Gemma 4 brings genuine reasoning capability to consumer hardware — it runs at 85 tokens/second on a single consumer GPU.

Key specs

SpecValue
Parameters26B (MoE)
Disk size~14 GB
Speed85 tok/s on consumer GPU
Context128K tokens
CostFree (open weights)

Why it matters

Gemma 4 is significant because it demonstrates that MoE architectures can deliver meaningful quality improvements at sizes that actually run on hardware people own. You don’t need an H100 cluster — a MacBook Pro or a gaming PC with 16GB+ VRAM will do.

Strengths

  • Runs on consumer hardware (16GB+ VRAM)
  • Fast inference — 85 tok/s without specialized infrastructure
  • Google’s training data and methodology at open-source scale
  • Excellent for local/private deployment

Weaknesses

  • Benchmarks trail frontier models significantly (GPQA ~72%)
  • 128K context window
  • Limited to text (no multimodal)
  • Smaller community than Llama ecosystem