Scrutica
Capability Scaling plots benchmark scores against training FLOP with log-linear and sigmoid fits. Training Evidence runs cross-signal plausibility analysis (chip counts, power draw, cost, supply chain consistency). Autonomy Monitor tracks METR time horizons.
These tools connect compute governance to capability outcomes. How much compute produces what level of capability? Are company claims about training runs plausible given known infrastructure? Which systems approach safety-relevant autonomy thresholds?