Capabilities

What frontier models can do, on what compute, per what evidence — and what the public scores hide. Benchmark performance plotted against training compute; autonomous-task time horizons against the statutory lines they approach; announced training runs cross-checked against the physical evidence; and the elicitation gap that makes every published score a floor rather than a ceiling. Start with the scaling curve, then follow a model into the compute that could train it, or down into the floor its score really represents.

Data vintages

Capability Scaling

Benchmark score against estimated training compute across 3,000+ Epoch AI model entries on 36 evaluations (14 safety-relevant), each point carrying its training chip and current export-control status. Click any model to open its detail — and from there, which facilities have the compute to train it and what floor its published score represents.