Scrutica
Capability Scaling fits benchmark score against training FLOP (log-linear and sigmoid). Training Evidence cross-references vendor training-run announcements against four independent public signals (chip count, facility power draw, capex, supply-chain consistency). Autonomy Monitor tracks METR time-horizon scores against estimated training compute.
Three tools, one substrate. The compute side of the threshold story (which the Governance pages anchor) lands at “model X exceeds FLOP threshold Y”; this page sits one rung up, asking what models at that compute level actually do, and whether the vendor-side training-run announcement holds up against four independent public signals.