Benchmark performance (31 evaluations, 11 safety-relevant) mapped to estimated training compute (FLOP) with hardware provenance and export control status per data point. 2,800+ model entries from Epoch AI. Regression fits available per benchmark family. For autonomous capability threshold monitoring, see Autonomy Monitor. Estimate any facility's training compute → FLOP Engine. How do benchmark scores change as training compute increases? This explorer plots performance against FLOP across 31 evaluations, with particular attention to safety-relevant benchmarks (autonomous task completion, deception detection, reasoning). Each data point carries hardware provenance and export control status.