“we regard any one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling, on capabilities that may emerge in real world use and misuse.”
Scrutica
OpenAI's Preparedness Framework v2 says it cleanly: “we regard any one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling.” METR's time-horizon paper attaches a number: results are a “reasonable lower bound” at 2-3 engineer-weeks of elicitation per model, and o1's measured time horizon more than doubledwith three engineer-weeks of dedicated effort. UK AISI's 13 May 2026 cyber post names the specific design choices producing its floor (a 2.5M-token-per-task cap, a simple agent scaffold). Anthropic concedes the gap in its RSP; Google DeepMind's FSF sets a threat-actor-matching standard its current framework does not claim to meet; Apollo Research has the structural argument that prompt-format variance alone moves accuracy by up to 76 points.
What the convergence does notinclude is a multiplier. No public source quantifies a single “gap factor” that converts a published score into the ceiling — METR's ~2× on o1 is the closest measured number, and it covers one model on one axis. Anyone quoting 10× or 100× across the frontier is past what the literature supports. The honest read is that the gap exists, its width is bounded only by what each lab is willing to spend on elicitation, and the structural reason public evaluators cannot close it (max-elicitation of cyber or CBRN capability is operationally close to building the misuse tool) applies to every domain Article 51 of the EU AI Act is drawn around.
Article 51(2) of the EU AI Act presumes systemic risk above 1025FLOP. The presumption is rebuttable; the rebuttal and most downstream deployment decisions lean on capability evaluations. The capability evaluations the public can read are floors — by the evaluators' own words. OpenAI: “lower bound, rather than a ceiling.” METR: “a reasonable lower bound” with two to three engineer-weeks of elicitation per model. UK AISI: “a simple agent scaffold... artificially lowers success rates and understates what models can do with more tokens and stronger scaffolds.” Anthropic and Google DeepMind concede the gap exists in their public safety frameworks without quantifying it. Apollo Research shows the structural reason: small scaffold changes alone move scores by tens of percentage points.
The gap factor — the multiplier that would turn a published score into a ceiling — is not in the public literature. The regulator drawing the 1025 line on Article 51 is drawing it against floor evidence whose distance from a ceiling no public evaluator has measured. The July 2025 GPAI Guidelines put a ±30% tolerance on the cumulative-FLOP measurement itself; no comparable tolerance is documented on the capability side because the capability side has not specified a measurement regime to put bounds on.
The grid below collects the floor-language disclosures from the institutions whose published capability work is most often cited in policy. Each card is verbatim from the linked primary source; framing tags are descriptive (the card says what kind of admission it is, not whether the institution agrees with the page's consolidation of them).
“we regard any one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling, on capabilities that may emerge in real world use and misuse.”
“researchers put a limited amount of effort into eliciting models to get good performance on their tasks, with their results serving as a reasonable lower bound. The most work was done to elicit o1 and the original Claude 3.5 Sonnet, each of which had around 2-3 engineer weeks of iterative development.”
“The cap, alongside our use of a simple agent scaffold, artificially lowers success rates and understates what models can do with more tokens and stronger scaffolds.”
“it is impossible to guarantee they are eliciting all the relevant model capabilities during testing… [we] could be substantially underestimating the capabilities that external actors could elicit.”
“elicitation methods used during evaluations must be comprehensive enough to match the elicitation efforts of potential threat actors.”
“Evals often aim to estimate an upper bound of capabilities, [so] it is important to understand how to elicit maximal rather than average [capabilities]… several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points.”
No public evaluator has re-run a frontier-model cyber suite under a multi-agent scaffold; the closest available evidence is the benchmarks that publish per-scaffold-mode scores on the same model. Cybench's Unguided / Subtask-Guided / Subtasks columns and OSWorld's 15- / 50- / 100-step budget series are both substrate of that shape. Across 8 paired Cybench rows the mean Subtasks / Unguided fold change is roughly 3× — the largest gap sits at o1-preview's 10% Unguided → 46.8% Subtasks. Across 5 paired OSWorld rows the 100-step / 15-step lift averages ~1.6×. Both are still floors: the Subtasks column is the strongest scaffold the benchmark publisher chose to disclose, not the strongest a properly-resourced red team would produce.
The two charts below render “same model, same benchmark, different scaffold” on two primary-source datasets where the publisher disclosed per-scaffold-mode numbers. Move the model from a simpler scaffold to a stronger one and the score moves — by roughly 3× on Cybench across 8 models, by ~1.6× on OSWorld across 5. The numbers are not the gap; they are the lift the benchmark publisher chose to disclose. The true gap is at least that wide, and wider by construction — none of these are maximum-elicitation runs.
What the chart does not claim: that the Subtasks column or the 100-step budget is the actual capability ceiling. The lift visible on the chart is the lift the benchmark publisher chose to disclose. The gap between either of those columns and the true ceiling is unmeasured by construction — the publisher did not run stronger scaffolds, and no one publicly has.
The AISI cyber-range methodology paper (Folkerts, Payne, Inman et al., arXiv 2603.11214, 11 March 2026) defines the instrument: two ranges — a 32-step corporate-network attack (“The Last Ones”) and a 7-step ICS attack — under a 2.5M-token-per-task budget, evaluated across seven models from August 2024 to February 2026 at varying inference-time compute budgets. The Mythos Preview evaluation (13 April 2026) is one run on that instrument. On 13 May 2026 AISI named the two things that hold the published numbers below the ceiling: the token cap, and a “simple agent scaffold” that the institute holds constant across models so the cross-model comparison stays clean. Inspect — the evaluation framework AISI itself maintains, at UKGovernmentBEIS/inspect_ai — added deepagent()in v0.3.213 on 27 April 2026 (subagent delegation, persistent memory, structured planning, with research / plan / general subagent factories built in). The Mythos cyber methodology has not adopted it. The cleanest read of the situation is the institute's own: the cross-model held-constant call is the right one for comparability, and the right call here understates capability.
The two-week gap between Mythos publication and the Inspect multi-agent release is the legible artifact of a real methodological commitment. Cross-model comparability requires holding the scaffold constant; running stronger scaffolds for later models would break the across-time comparison. The consequence is that the framework can do multi-agent orchestration now and the published methodology cannot use that capability without forfeiting comparability against the models already on the leaderboard. The structural problem is upstream of any single AISI methodology choice.
AISI's public cyber evaluations cap each task at 2.5 million tokens of model output and use what the institute itself calls a “simple agent scaffold,” held constant across models so the cross-model comparison is apples-to-apples. The institute is explicit that both choices push scores down — they are the right choices for measuring relative progress across models, and the wrong instrument for reading any single number as a capability ceiling.
Inspect — the evaluation framework AISI itself maintains — added a multi-agent path called deepagent() on 27 April 2026, two weeks after the Mythos cyber evaluation shipped. The framework can now do subagent delegation, persistent memory, and structured planning; the published cyber methodology cannot use those features without breaking cross-model comparability against the models already on the leaderboard. The gap between what the framework can run and what the public methodology reports is a known fact about the instrument, not a critique of it.
Three quantitative numbers exist in the public literature. METR's o1 time-horizon roughly doubled with three engineer-weeks of elicitation — a 2× lift on one model, one benchmark, one set of elicitation techniques. Cybench's Subtasks / Unguided fold-change across 8 paired models is roughly 3× on average. OSWorld's 100-step / 15-step lift across 5 paired models averages ~1.6×. None of these is a ceiling number, and the three are not directly commensurable: each captures a different cap-vs-scaffold delta on a different instrument. The aggregate “gap factor” — what would convert a typical published score into the maximum a well-resourced red team would produce — is not measured.
The structural reason it is not measured matters more than the missing number: maximum-elicitation of cyber capability means building a cyber attack tool, and the legal, ethical, and reputational considerations that keep public evaluators from publishing one apply identically to maximum-elicitation in biosecurity, CBRN, or chemical-weapons assistance. The evaluation regimes closest to the ceiling — internal lab pre-deployment work conducted under RSP / Preparedness / FSF review, and any classified national-security evaluation whose existence the public does not have to be told about — are the regimes whose results the public cannot read. The asymmetric publication regime is the reason the gap is open-ended, not the reason a specific multiplier exists in some other room.
The page does not claim a specific multiplier between published scores and capability ceilings. The three measured numbers we have are: METR's o1 horizon doubled with three engineer-weeks of elicitation; Cybench's Subtasks / Unguided lift across 8 models averages 3×; OSWorld's 100- / 15-step lift across 5 models averages 1.6×. The three are not commensurable; the aggregate gap factor is not in the public record.
Maximum-elicitation of cyber or CBRN capability is operationally close to building the misuse tool. The legal and reputational considerations that keep public evaluators from publishing those runs apply to every domain Article 51 of the EU AI Act is drawn around. Anyone citing a fixed multiplier (2×, 10×, 100×) is past what the public sources support.
The 1025 presumption is rebuttable; the provider demonstrates the model does not pose systemic risk despite crossing the cumulative-compute line, and the demonstration leans on capability evaluations. The July 2025 GPAI Guidelines codify a ±30% measurement tolerance on the cumulative-FLOP measurement; no comparable tolerance is documented on the capability side, because the capability side has not specified a measurement regime to put tolerance bounds on. The asymmetry between a tolerance-bounded input and a floor-only output is the gap this page renders. At identical training FLOP, the same model can clear a deployment decision under a single-agent eval that it would fail under a multi-agent eval — Inspect now supports the multi-agent variant; AISI's published methodology has not adopted it; the comparison the framework can run has not appeared in the published record.
If the elicitation gap widens with capability (the working assumption the public literature does not contradict and no public evaluator has measured), the calibration error against any fixed cumulative-FLOP line widens monotonically as the training-compute frontier moves up. The 1025threshold drawn against today's public capability evidence is a different policy instrument from the 1025threshold drawn against tomorrow's.
Article 51(2) sets a presumption of systemic risk above 1025 FLOP that the provider can rebut. The rebuttal process — and most downstream Member State deployment decisions — cites capability evaluations. The July 2025 GPAI Guidelines say the cumulative-FLOP measurement is good to ±30%. The capability side has no comparable tolerance, because public evaluators have not specified a measurement regime to put bounds on.
A presumption rebutted against floor evidence is not a presumption rebutted against the ceiling. If the elicitation gap is constant across training-compute scales, the calibration error is constant. If it widens — the working assumption nothing in the literature contradicts — the calibration error widens with every training run that crosses the line.
EO 14110's 1026 Defense Production Act reporting trigger was rescinded by EO 14148 in January 2025; the federal posture on cumulative-FLOP reporting is currently absent. State-level activity is the live front (California SB 1047 at 1026+$100M was vetoed September 2024). If the next US reporting regime gates on cumulative-FLOP — at the 1026 EO 14110 line or below — the floor-only capability evidence underneath is the substrate the regime will inherit.
The Biden-era EO 14110 set a 1026 Defense Production Act reporting trigger for frontier training runs. EO 14148 rescinded that order in January 2025. The US has no current cumulative-FLOP reporting regime at the federal level. State activity is where the next round of compute thresholds is being drafted — California SB 1047 (1026+$100M) was vetoed in September 2024 and the legislature has not yet re-engaged at the frontier-compute layer. Whatever regime succeeds the rescinded EO inherits the same capability- evidence substrate as the EU: floor-only, tolerance- undocumented.
Capability-frontier reporting in the public research literature — Epoch's capability indices, METR's time horizons, the Anthropic Frontier Red Team posts, the AISI cyber reports, the OpenAI Preparedness reports, the GDM FSF releases — treats per-model published scores as the unit of comparison. The scaffold variation each publisher chose to disclose is the dimension that gets reported; the scaffold variation that exists but was not reported sits in a structural blind spot. The cross-link to /capabilities on Scrutica renders the published-score frontier; this page is the seam on which that frontier sits.
Most public capability-frontier reporting treats published scores as the unit of comparison. The scaffold variation each publisher chose to surface is the dimension that gets reported; the variation that exists but was not reported sits in a structural blind spot. The page exists to name that blind spot in the same place the compute numbers it would modify are visible.
Scrutica's substrate-strength is the compute side. The capability side is Epoch / METR / AISI territory; the policy side is the EU AI Office and US OSTP. This page sits on the seam — the four pages below each render one slice of the compute × capability × policy triangle the elicitation gap structurally widens.
Which facilities have the FLOP to put a model across the EU AI Act 1025line. The atlas's addendum carries the short version of the argument on this page; the dedicated surface here carries the quantitative substrate.
The benchmark frontier — METR Time Horizons, Cybench, APEX Agents, TheAgentCompany, OSWorld, BALROG — plotted against compute. Each y-axis on the scaling explorer is a single-scaffold published score; the underlying capability curve sits somewhere above by a multiplier the chart on this page makes legible.
Per-run FLOP under three independent paths (hardware / power / cost). Multi-agent deployment configurations require ~N× the FLOP budget for equivalent context length; the cost-equivalence between a single-agent evaluation run and a multi-agent deployment run lives here.
Full derivation chains for every Scrutica estimate. The methodology page is the parent surface for the elicitation-gap addendum on /threshold-atlas; this dedicated page is the standalone analytical reading, with the primary-source quantitative work attached.
Per-model paired scaffold-mode rows from Cybench (Unguided / Subtask-Guided / Subtasks) and OSWorld (15 / 50 / 100 step budgets). Rows survive the paired-row filter only when every column is populated in the underlying Epoch CSV ingest; 8 Cybench models and 5 OSWorld models qualify. Scores are presented at the precision the source publishes; the chart applies no rounding or interpolation. The Epoch CSV ingest is the offline-verifiable anchor for both leaderboards, which render their per-model rows via client-side JS.
That the published Subtasks score or 100-step budget is the maximum-elicitation capability of any model on either benchmark. Both are the strongest scaffold the publisher chose to disclose; both are still floors. The gap on the chart is the lift the publisher rendered, not the lift the underlying capability curve supports.
A floor is the lower-bound use of the published number: the value is at least this. Each cited disclosure on the page treats its number that way. The page does not collapse the five framings into a single concept; the convergent-disclosure panel preserves the difference between explicit floor language (OpenAI, METR, AISI), qualified acknowledgements (Anthropic), a threat-actor- matching standard (GDM), and a structural argument about evals generally (Apollo).
Held-constant scaffolding is the right call for cross-model comparability. The institute put the floor-estimate caveat on the page itself, in publishable English. The structural problem sits one layer up: published cyber-capability scores get cited in policy chains (Article 51 rebuttals, sovereign-AI program risk- tier allocations, US export-control impact assessments) as if they were capability measurements, when AISI itself — and the four other organizations alongside — has labelled them floors. The calibration chain is what this page is about, not the upstream evaluation.
BibTeX, APA 7, Chicago, MLA, plain-text, and DataCite formats are available from the “Cite this” control in the page header. For a specific data point on the chart or a specific organizational disclosure, cite the underlying primary source rather than this page; the primary-source table above is the audit anchor.