Benchmark score vs. estimated training FLOP, plotted across 3,000+ Epoch AI model entries on 36 evaluations (14 safety-relevant). Per-benchmark log-linear fits with R² and quality label; each data point carries hardware type, training country, and current export-control status.
Benchmark score plotted against training compute across 3,000+ models, with the safety-relevant subset (autonomous task completion, cybersecurity, deception) flagged separately. Each dot carries the training chip and its current export-control status, so the scaling pattern reads next to the supply-chain provenance of the compute it was bought with.