The Elicitation Gap

Gringras, David

The thesisas of 2026-05-20

Five organizations have said, in print, that the public capability numbers are floors. The gap factor is unmeasured.

OpenAI's Preparedness Framework v2 says it cleanly: “we regard any one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling.” METR's time-horizon paper attaches a number: results are a “reasonable lower bound” at 2-3 engineer-weeks of elicitation per model, and o1's measured time horizon more than doubledwith three engineer-weeks of dedicated effort. UK AISI's 13 May 2026 cyber post names the specific design choices producing its floor (a 2.5M-token-per-task cap, a simple agent scaffold). Anthropic concedes the gap in its RSP; Google DeepMind's FSF sets a threat-actor-matching standard its current framework does not claim to meet; Apollo Research has the structural argument that prompt-format variance alone moves accuracy by up to 76 points.

What the convergence does notinclude is a multiplier. No public source quantifies a single “gap factor” that converts a published score into the ceiling — METR's ~2× on o1 is the closest measured number, and it covers one model on one axis. Anyone quoting 10× or 100× across the frontier is past what the literature supports. The honest read is that the gap exists, its width is bounded only by what each lab is willing to spend on elicitation, and the structural reason public evaluators cannot close it (max-elicitation of cyber or CBRN capability is operationally close to building the misuse tool) applies to every domain Article 51 of the EU AI Act is drawn around.

Article 51(2) of the EU AI Act presumes systemic risk above 10²⁵FLOP. The presumption is rebuttable; the rebuttal and most downstream deployment decisions lean on capability evaluations. The capability evaluations the public can read are floors — by the evaluators' own words. OpenAI: “lower bound, rather than a ceiling.” METR: “a reasonable lower bound” with two to three engineer-weeks of elicitation per model. UK AISI: “a simple agent scaffold... artificially lowers success rates and understates what models can do with more tokens and stronger scaffolds.” Anthropic and Google DeepMind concede the gap exists in their public safety frameworks without quantifying it. Apollo Research shows the structural reason: small scaffold changes alone move scores by tens of percentage points.

The gap factor — the multiplier that would turn a published score into a ceiling — is not in the public literature. The regulator drawing the 10²⁵ line on Article 51 is drawing it against floor evidence whose distance from a ceiling no public evaluator has measured. The July 2025 GPAI Guidelines put a ±30% tolerance on the cumulative-FLOP measurement itself; no comparable tolerance is documented on the capability side because the capability side has not specified a measurement regime to put bounds on.

The disclosures, side-by-sidetier-1, primary-source verbatim

Six organizations, the same admission, in their own words.

The grid below collects the floor-language disclosures from the institutions whose published capability work is most often cited in policy. Each card is verbatim from the linked primary source; framing tags are descriptive (the card says what kind of admission it is, not whether the institution agrees with the page's consolidation of them).

OpenAIExplicit floor framing

“we regard any one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling, on capabilities that may emerge in real world use and misuse.”

METRQualified floor framing

“researchers put a limited amount of effort into eliciting models to get good performance on their tasks, with their results serving as a reasonable lower bound. The most work was done to elicit o1 and the original Claude 3.5 Sonnet, each of which had around 2-3 engineer weeks of iterative development.”

UK AI Security InstituteExplicit floor framing

“The cap, alongside our use of a simple agent scaffold, artificially lowers success rates and understates what models can do with more tokens and stronger scaffolds.”

AnthropicQualified floor framing

“it is impossible to guarantee they are eliciting all the relevant model capabilities during testing… [we] could be substantially underestimating the capabilities that external actors could elicit.”

Google DeepMindThreat-actor-match standard

“elicitation methods used during evaluations must be comprehensive enough to match the elicitation efforts of potential threat actors.”

Apollo ResearchStructural argument

“Evals often aim to estimate an upper bound of capabilities, [so] it is important to understand how to elicit maximal rather than average [capabilities]… several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points.”

The gap, rendered quantitativelyCybench + OSWorld ingest 2026-05-19

Same model, same benchmark, different scaffold.

No public evaluator has re-run a frontier-model cyber suite under a multi-agent scaffold; the closest available evidence is the benchmarks that publish per-scaffold-mode scores on the same model. Cybench's Unguided / Subtask-Guided / Subtasks columns and OSWorld's 15- / 50- / 100-step budget series are both substrate of that shape. Across 8 paired Cybench rows the mean Subtasks / Unguided fold change is roughly 3× — the largest gap sits at o1-preview's 10% Unguided → 46.8% Subtasks. Across 5 paired OSWorld rows the 100-step / 15-step lift averages ~1.6×. Both are still floors: the Subtasks column is the strongest scaffold the benchmark publisher chose to disclose, not the strongest a properly-resourced red team would produce.

The two charts below render “same model, same benchmark, different scaffold” on two primary-source datasets where the publisher disclosed per-scaffold-mode numbers. Move the model from a simpler scaffold to a stronger one and the score moves — by roughly 3× on Cybench across 8 models, by ~1.6× on OSWorld across 5. The numbers are not the gap; they are the lift the benchmark publisher chose to disclose. The true gap is at least that wide, and wider by construction — none of these are maximum-elicitation runs.

What the chart does not claim: that the Subtasks column or the 100-step budget is the actual capability ceiling. The lift visible on the chart is the lift the benchmark publisher chose to disclose. The gap between either of those columns and the true ceiling is unmeasured by construction — the publisher did not run stronger scaffolds, and no one publicly has.

The AISI case in detail2026-03-11 → 2026-05-13

A token cap, a simple scaffold, and a multi-agent framework that shipped two weeks after the cyber report.

The AISI cyber-range methodology paper (Folkerts, Payne, Inman et al., arXiv 2603.11214, 11 March 2026) defines the instrument: two ranges — a 32-step corporate-network attack (“The Last Ones”) and a 7-step ICS attack — under a 2.5M-token-per-task budget, evaluated across seven models from August 2024 to February 2026 at varying inference-time compute budgets. The Mythos Preview evaluation (13 April 2026) is one run on that instrument. On 13 May 2026 AISI named the two things that hold the published numbers below the ceiling: the token cap, and a “simple agent scaffold” that the institute holds constant across models so the cross-model comparison stays clean. Inspect — the evaluation framework AISI itself maintains, at UKGovernmentBEIS/inspect_ai — added deepagent()in v0.3.213 on 27 April 2026 (subagent delegation, persistent memory, structured planning, with research / plan / general subagent factories built in). The Mythos cyber methodology has not adopted it. The cleanest read of the situation is the institute's own: the cross-model held-constant call is the right one for comparability, and the right call here understates capability.

The two-week gap between Mythos publication and the Inspect multi-agent release is the legible artifact of a real methodological commitment. Cross-model comparability requires holding the scaffold constant; running stronger scaffolds for later models would break the across-time comparison. The consequence is that the framework can do multi-agent orchestration now and the published methodology cannot use that capability without forfeiting comparability against the models already on the leaderboard. The structural problem is upstream of any single AISI methodology choice.

AISI's public cyber evaluations cap each task at 2.5 million tokens of model output and use what the institute itself calls a “simple agent scaffold,” held constant across models so the cross-model comparison is apples-to-apples. The institute is explicit that both choices push scores down — they are the right choices for measuring relative progress across models, and the wrong instrument for reading any single number as a capability ceiling.

Inspect — the evaluation framework AISI itself maintains — added a multi-agent path called deepagent() on 27 April 2026, two weeks after the Mythos cyber evaluation shipped. The framework can now do subagent delegation, persistent memory, and structured planning; the published cyber methodology cannot use those features without breaking cross-model comparability against the models already on the leaderboard. The gap between what the framework can run and what the public methodology reports is a known fact about the instrument, not a critique of it.

What this page does not claim

The gap factor is not in the public literature.

Three quantitative numbers exist in the public literature. METR's o1 time-horizon roughly doubled with three engineer-weeks of elicitation — a 2× lift on one model, one benchmark, one set of elicitation techniques. Cybench's Subtasks / Unguided fold-change across 8 paired models is roughly 3× on average. OSWorld's 100-step / 15-step lift across 5 paired models averages ~1.6×. None of these is a ceiling number, and the three are not directly commensurable: each captures a different cap-vs-scaffold delta on a different instrument. The aggregate “gap factor” — what would convert a typical published score into the maximum a well-resourced red team would produce — is not measured.

The structural reason it is not measured matters more than the missing number: maximum-elicitation of cyber capability means building a cyber attack tool, and the legal, ethical, and reputational considerations that keep public evaluators from publishing one apply identically to maximum-elicitation in biosecurity, CBRN, or chemical-weapons assistance. The evaluation regimes closest to the ceiling — internal lab pre-deployment work conducted under RSP / Preparedness / FSF review, and any classified national-security evaluation whose existence the public does not have to be told about — are the regimes whose results the public cannot read. The asymmetric publication regime is the reason the gap is open-ended, not the reason a specific multiplier exists in some other room.

The page does not claim a specific multiplier between published scores and capability ceilings. The three measured numbers we have are: METR's o1 horizon doubled with three engineer-weeks of elicitation; Cybench's Subtasks / Unguided lift across 8 models averages 3×; OSWorld's 100- / 15-step lift across 5 models averages 1.6×. The three are not commensurable; the aggregate gap factor is not in the public record.

Maximum-elicitation of cyber or CBRN capability is operationally close to building the misuse tool. The legal and reputational considerations that keep public evaluators from publishing those runs apply to every domain Article 51 of the EU AI Act is drawn around. Anyone citing a fixed multiplier (2×, 10×, 100×) is past what the public sources support.

Policy consequences

A floor-bound capability layer under a cumulative-FLOP regulatory ceiling.

EU AI Act — Article 51(2) presumes systemic risk above 10²⁵ FLOP; the rebuttal cites capability evaluations.

The 10²⁵ presumption is rebuttable; the provider demonstrates the model does not pose systemic risk despite crossing the cumulative-compute line, and the demonstration leans on capability evaluations. The July 2025 GPAI Guidelines codify a ±30% measurement tolerance on the cumulative-FLOP measurement; no comparable tolerance is documented on the capability side, because the capability side has not specified a measurement regime to put tolerance bounds on. The asymmetry between a tolerance-bounded input and a floor-only output is the gap this page renders. At identical training FLOP, the same model can clear a deployment decision under a single-agent eval that it would fail under a multi-agent eval — Inspect now supports the multi-agent variant; AISI's published methodology has not adopted it; the comparison the framework can run has not appeared in the published record.

If the elicitation gap widens with capability (the working assumption the public literature does not contradict and no public evaluator has measured), the calibration error against any fixed cumulative-FLOP line widens monotonically as the training-compute frontier moves up. The 10²⁵threshold drawn against today's public capability evidence is a different policy instrument from the 10²⁵threshold drawn against tomorrow's.

Article 51(2) sets a presumption of systemic risk above 10²⁵ FLOP that the provider can rebut. The rebuttal process — and most downstream Member State deployment decisions — cites capability evaluations. The July 2025 GPAI Guidelines say the cumulative-FLOP measurement is good to ±30%. The capability side has no comparable tolerance, because public evaluators have not specified a measurement regime to put bounds on.

A presumption rebutted against floor evidence is not a presumption rebutted against the ceiling. If the elicitation gap is constant across training-compute scales, the calibration error is constant. If it widens — the working assumption nothing in the literature contradicts — the calibration error widens with every training run that crosses the line.

US — EO 14148 rescinded EO 14110; the federal reporting regime sits on no cumulative-FLOP trigger.

EO 14110's 10²⁶ Defense Production Act reporting trigger was rescinded by EO 14148 in January 2025; the federal posture on cumulative-FLOP reporting is currently absent. State-level activity is the live front (California SB 1047 at 10²⁶+$100M was vetoed September 2024). If the next US reporting regime gates on cumulative-FLOP — at the 10²⁶ EO 14110 line or below — the floor-only capability evidence underneath is the substrate the regime will inherit.

The Biden-era EO 14110 set a 10²⁶ Defense Production Act reporting trigger for frontier training runs. EO 14148 rescinded that order in January 2025. The US has no current cumulative-FLOP reporting regime at the federal level. State activity is where the next round of compute thresholds is being drafted — California SB 1047 (10²⁶+$100M) was vetoed in September 2024 and the legislature has not yet re-engaged at the frontier-compute layer. Whatever regime succeeds the rescinded EO inherits the same capability- evidence substrate as the EU: floor-only, tolerance- undocumented.

Compute-governance research community — the gap is the least- operationalized capability-frontier axis in the literature.

Capability-frontier reporting in the public research literature — Epoch's capability indices, METR's time horizons, the Anthropic Frontier Red Team posts, the AISI cyber reports, the OpenAI Preparedness reports, the GDM FSF releases — treats per-model published scores as the unit of comparison. The scaffold variation each publisher chose to disclose is the dimension that gets reported; the scaffold variation that exists but was not reported sits in a structural blind spot. The cross-link to /capabilities on Scrutica renders the published-score frontier; this page is the seam on which that frontier sits.

Most public capability-frontier reporting treats published scores as the unit of comparison. The scaffold variation each publisher chose to surface is the dimension that gets reported; the variation that exists but was not reported sits in a structural blind spot. The page exists to name that blind spot in the same place the compute numbers it would modify are visible.

Where this lives on Scrutica

The compute × capability × policy seam.

Scrutica's substrate-strength is the compute side. The capability side is Epoch / METR / AISI territory; the policy side is the EU AI Office and US OSTP. This page sits on the seam — the four pages below each render one slice of the compute × capability × policy triangle the elicitation gap structurally widens.

/threshold-atlas

Facility Threshold Atlas

Which facilities have the FLOP to put a model across the EU AI Act 10²⁵line. The atlas's addendum carries the short version of the argument on this page; the dedicated surface here carries the quantitative substrate.

/capabilities/scaling-explorer

Capability Scaling

The benchmark frontier — METR Time Horizons, Cybench, APEX Agents, TheAgentCompany, OSWorld, BALROG — plotted against compute. Each y-axis on the scaling explorer is a single-scaffold published score; the underlying capability curve sits somewhere above by a multiplier the chart on this page makes legible.

/flop-engine

FLOP Estimation Engine

Per-run FLOP under three independent paths (hardware / power / cost). Multi-agent deployment configurations require ~N× the FLOP budget for equivalent context length; the cost-equivalence between a single-agent evaluation run and a multi-agent deployment run lives here.

/methodology

Scrutica Methodology

Full derivation chains for every Scrutica estimate. The methodology page is the parent surface for the elicitation-gap addendum on /threshold-atlas; this dedicated page is the standalone analytical reading, with the primary-source quantitative work attached.

Methodology + sourcesv2.0.0 · 2026-05-20

What this surface measures, what it does not, every source it cites.

What the chart shows

Per-model paired scaffold-mode rows from Cybench (Unguided / Subtask-Guided / Subtasks) and OSWorld (15 / 50 / 100 step budgets). Rows survive the paired-row filter only when every column is populated in the underlying Epoch CSV ingest; 8 Cybench models and 5 OSWorld models qualify. Scores are presented at the precision the source publishes; the chart applies no rounding or interpolation. The Epoch CSV ingest is the offline-verifiable anchor for both leaderboards, which render their per-model rows via client-side JS.

What the chart does not claim

That the published Subtasks score or 100-step budget is the maximum-elicitation capability of any model on either benchmark. Both are the strongest scaffold the publisher chose to disclose; both are still floors. The gap on the chart is the lift the publisher rendered, not the lift the underlying capability curve supports.

What “floor” means here

A floor is the lower-bound use of the published number: the value is at least this. Each cited disclosure on the page treats its number that way. The page does not collapse the five framings into a single concept; the convergent-disclosure panel preserves the difference between explicit floor language (OpenAI, METR, AISI), qualified acknowledgements (Anthropic), a threat-actor- matching standard (GDM), and a structural argument about evals generally (Apollo).

Why this is not a critique of AISI

Held-constant scaffolding is the right call for cross-model comparability. The institute put the floor-estimate caveat on the page itself, in publishable English. The structural problem sits one layer up: published cyber-capability scores get cited in policy chains (Article 51 rebuttals, sovereign-AI program risk- tier allocations, US export-control impact assessments) as if they were capability measurements, when AISI itself — and the four other organizations alongside — has labelled them floors. The calibration chain is what this page is about, not the upstream evaluation.

Primary sources

Tier 1Our evaluation of Claude Mythos Preview’s cyber capabilities — UK AI Security Institute (2026-04-13)· accessed 2026-05-20
The Mythos cyber evaluation. The post reports outcomes (CTF challenges, the 32-step "Last Ones" corporate-network range); the scaffold mechanics live in Folkerts et al. 2026 below.
Tier 1How fast is autonomous AI cyber capability advancing? — UK AI Security Institute (2026-05-13)· accessed 2026-05-20
AISI’s verbatim admission that the published scores are floors — the 2.5M-token-per-task cap plus "simple agent scaffold" framing.
Tier 1Cyber-range scaling: AISI capability evaluation methodology (Folkerts, Payne, Inman et al.) — AISI / arXiv:2603.11214 (2026-03-11)· accessed 2026-05-20
AISI cyber-range methodology paper: the 32-step corporate-network range, the 7-step ICS range, the 10M / 100M token budget series across seven models (Aug 2024 – Feb 2026).
Tier 1Inspect AI framework — CHANGELOG (v0.3.213, deepagent + subagent delegation + persistent memory + structured planning) — UK Government BEIS (2026-04-27)· accessed 2026-05-20
Inspect’s deepagent() path shipped two weeks after the Mythos report. Built-in research() / plan() / general() subagent factories; the published cyber methodology has not yet adopted it.
Tier 2Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models — Zhang, Diffenderfer, Bhatt et al. (arXiv 2408.08926) (2024-08-15)· accessed 2026-05-20
Cybench scaffold-mode definitions: Unguided / Subtask-Guided / Subtasks (per-step success rate).
Tier 2Cybench HAL leaderboard (Unguided / Subtask-Guided / Subtasks columns) — Cybench / HAL evaluation harness (rolling)· accessed 2026-05-20
Per-model, per-scaffold-mode scores. The audit anchor is the Epoch CSV ingest (data/raw/epoch/ai_capabilities/cybench_external.csv).
Tier 2OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments — Xie, Zhang, Chen et al. (NeurIPS 2024; arXiv 2404.07972) (2024-04-11)· accessed 2026-05-20
OSWorld step-budget protocol (15 / 50 / 100 steps).
Tier 2OSWorld leaderboard (15- / 50- / 100-step budget rows) — OSWorld project (rolling)· accessed 2026-05-20
Per-model, per-step-budget scores. The audit anchor is the Epoch CSV ingest (data/raw/epoch/ai_capabilities/os_world_external.csv).
Tier 1Regulation (EU) 2024/1689 — Article 51(2) (presumption of systemic risk above 10²⁵ FLOP) — European Parliament & Council (2024-07-12)· accessed 2026-05-20
The cumulative-FLOP threshold the published capability evidence is meant to anchor.
Tier 1Guidelines on GPAI obligations (±30% measurement tolerance) — European Commission, AI Office (2025-07-18)· accessed 2026-05-20
July 2025 GPAI Guidelines on cumulative-FLOP measurement.
Tier 1Executive Order 14148 — Initial Rescissions of Harmful Executive Orders and Actions — White House (45-W-EO-14148) (2025-01-20)· accessed 2026-05-20
Rescinds EO 14110 (the 10²⁶ reporting trigger and the dual-use-AI reporting regime it sat on).
Tier 1OpenAI Preparedness Framework v2 (the "lower bound, rather than a ceiling" verbatim) — OpenAI (2025-04-15)· accessed 2026-05-20
OpenAI: "we regard any one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling, on capabilities that may emerge in real world use and misuse." The cleanest lab-side floor framing on public record.
Tier 1Measuring AI Ability to Complete Long Tasks (METR) — METR / arXiv:2503.14499 (2025-03-18)· accessed 2026-05-20
METR: their numbers serve as a "reasonable lower bound" with elicitation budgets of 2-3 engineer-weeks per model; o1's time-horizon "more than doubled with three engineer-weeks of elicitation."
Tier 1Anthropic Responsible Scaling Policy v3.0 (under-elicitation acknowledgement) — Anthropic (2026-02-24)· accessed 2026-05-20
Anthropic acknowledges in v3.0 that capability evaluations cannot guarantee they elicit "all the relevant model capabilities" and may "substantially underestimate" what external actors elicit.
Tier 1Google DeepMind Frontier Safety Framework v3.0 (threat-actor-elicitation requirement) — Google DeepMind (2025-09-22)· accessed 2026-05-20
GDM: "elicitation methods used during evaluations must be comprehensive enough to match the elicitation efforts of potential threat actors." Threat-actor-matching standard, not a stated floor.
Tier 1We need a science of evals (Apollo Research) — Apollo Research (2024-01-22)· accessed 2026-05-20
Apollo: evals "aim to estimate an upper bound of capabilities" and small prompt-format changes produce up to 76-point accuracy swings. The structural argument for the floor-vs-ceiling gap.

Cite this surface

BibTeX, APA 7, Chicago, MLA, plain-text, and DataCite formats are available from the “Cite this” control in the page header. For a specific data point on the chart or a specific organizational disclosure, cite the underlying primary source rather than this page; the primary-source table above is the audit anchor.

Sources

OpenAI Preparedness Framework v2 (the "lower bound, rather than a ceiling" verbatim)
METR — Measuring AI Ability to Complete Long Tasks (engineer-weeks elicitation disclosure)
AISI cyber blog — Mythos eval, May 13 floor admission, cyber-range methodology paper
Anthropic RSP v3.0 (under-elicitation acknowledgement)
Google DeepMind FSF v3.0 (threat-actor-elicitation requirement)
Apollo Research — We Need A Science of Evals
Inspect AI framework CHANGELOG (deepagent() v0.3.213)
Cybench HAL leaderboard (per-scaffold-mode scores) via Epoch CSV ingest
OSWorld leaderboard (per-step-budget scores) via Epoch CSV ingest
EU AI Act (Regulation 2024/1689)

Processing

Convergent-disclosure quotes are verbatim from each cited primary source (WebFetch-verified 2026-05-20). Quantitative chart data comes from the Epoch CSV ingest of the Cybench and OSWorld leaderboards (data/raw/epoch/ai_capabilities/), filtered to rows where every scaffold-mode or step-budget column the source publishes is populated for the same model. No rounding, no synthetic interpolation; scores at source-published precision. Policy citations resolve to the Eur-Lex CELEX consolidated text and the Federal Register PDF respectively.

Threshold AtlasGovernance

The cumulative-FLOP threshold the floor-only capability evidence anchors against; the addendum here carries the short version

Capability ScalingCapabilities

Benchmark frontier whose published scores sit on the same elicitation floor the convergent-disclosure panel names

FLOP EngineInfrastructure

Multi-agent deployment scales the per-run FLOP budget by ~N× per agent

MethodologyReference

Full derivation chain underneath the chart and the convergent-disclosure substrate

Built for the AI governance research community

The Elicitation Gap

The thesisas of 2026-05-20

Five organizations have said, in print, that the public capability numbers are floors. The gap factor is unmeasured.

The disclosures, side-by-sidetier-1, primary-source verbatim

Six organizations, the same admission, in their own words.

OpenAIExplicit floor framing

“we regard any one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling, on capabilities that may emerge in real world use and misuse.”

METRQualified floor framing

“researchers put a limited amount of effort into eliciting models to get good performance on their tasks, with their results serving as a reasonable lower bound. The most work was done to elicit o1 and the original Claude 3.5 Sonnet, each of which had around 2-3 engineer weeks of iterative development.”

UK AI Security InstituteExplicit floor framing

“The cap, alongside our use of a simple agent scaffold, artificially lowers success rates and understates what models can do with more tokens and stronger scaffolds.”

AnthropicQualified floor framing

“it is impossible to guarantee they are eliciting all the relevant model capabilities during testing… [we] could be substantially underestimating the capabilities that external actors could elicit.”

Google DeepMindThreat-actor-match standard

“elicitation methods used during evaluations must be comprehensive enough to match the elicitation efforts of potential threat actors.”

Apollo ResearchStructural argument

“Evals often aim to estimate an upper bound of capabilities, [so] it is important to understand how to elicit maximal rather than average [capabilities]… several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points.”

The gap, rendered quantitativelyCybench + OSWorld ingest 2026-05-19

Same model, same benchmark, different scaffold.

The AISI case in detail2026-03-11 → 2026-05-13

A token cap, a simple scaffold, and a multi-agent framework that shipped two weeks after the cyber report.

What this page does not claim

The gap factor is not in the public literature.

Policy consequences

A floor-bound capability layer under a cumulative-FLOP regulatory ceiling.

EU AI Act — Article 51(2) presumes systemic risk above 10²⁵ FLOP; the rebuttal cites capability evaluations.

US — EO 14148 rescinded EO 14110; the federal reporting regime sits on no cumulative-FLOP trigger.

Compute-governance research community — the gap is the least- operationalized capability-frontier axis in the literature.

Where this lives on Scrutica

The compute × capability × policy seam.

/threshold-atlas

Facility Threshold Atlas

/capabilities/scaling-explorer

Capability Scaling

/flop-engine

FLOP Estimation Engine

/methodology

Scrutica Methodology

Methodology + sourcesv2.0.0 · 2026-05-20

What this surface measures, what it does not, every source it cites.

What the chart shows

What the chart does not claim

What “floor” means here

Why this is not a critique of AISI

Primary sources

Tier 1Our evaluation of Claude Mythos Preview’s cyber capabilities — UK AI Security Institute (2026-04-13)· accessed 2026-05-20
The Mythos cyber evaluation. The post reports outcomes (CTF challenges, the 32-step "Last Ones" corporate-network range); the scaffold mechanics live in Folkerts et al. 2026 below.
Tier 1How fast is autonomous AI cyber capability advancing? — UK AI Security Institute (2026-05-13)· accessed 2026-05-20
AISI’s verbatim admission that the published scores are floors — the 2.5M-token-per-task cap plus "simple agent scaffold" framing.
Tier 1Cyber-range scaling: AISI capability evaluation methodology (Folkerts, Payne, Inman et al.) — AISI / arXiv:2603.11214 (2026-03-11)· accessed 2026-05-20
AISI cyber-range methodology paper: the 32-step corporate-network range, the 7-step ICS range, the 10M / 100M token budget series across seven models (Aug 2024 – Feb 2026).
Tier 1Inspect AI framework — CHANGELOG (v0.3.213, deepagent + subagent delegation + persistent memory + structured planning) — UK Government BEIS (2026-04-27)· accessed 2026-05-20
Inspect’s deepagent() path shipped two weeks after the Mythos report. Built-in research() / plan() / general() subagent factories; the published cyber methodology has not yet adopted it.
Tier 2Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models — Zhang, Diffenderfer, Bhatt et al. (arXiv 2408.08926) (2024-08-15)· accessed 2026-05-20
Cybench scaffold-mode definitions: Unguided / Subtask-Guided / Subtasks (per-step success rate).
Tier 2Cybench HAL leaderboard (Unguided / Subtask-Guided / Subtasks columns) — Cybench / HAL evaluation harness (rolling)· accessed 2026-05-20
Per-model, per-scaffold-mode scores. The audit anchor is the Epoch CSV ingest (data/raw/epoch/ai_capabilities/cybench_external.csv).
Tier 2OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments — Xie, Zhang, Chen et al. (NeurIPS 2024; arXiv 2404.07972) (2024-04-11)· accessed 2026-05-20
OSWorld step-budget protocol (15 / 50 / 100 steps).
Tier 2OSWorld leaderboard (15- / 50- / 100-step budget rows) — OSWorld project (rolling)· accessed 2026-05-20
Per-model, per-step-budget scores. The audit anchor is the Epoch CSV ingest (data/raw/epoch/ai_capabilities/os_world_external.csv).
Tier 1Regulation (EU) 2024/1689 — Article 51(2) (presumption of systemic risk above 10²⁵ FLOP) — European Parliament & Council (2024-07-12)· accessed 2026-05-20
The cumulative-FLOP threshold the published capability evidence is meant to anchor.
Tier 1Guidelines on GPAI obligations (±30% measurement tolerance) — European Commission, AI Office (2025-07-18)· accessed 2026-05-20
July 2025 GPAI Guidelines on cumulative-FLOP measurement.
Tier 1Executive Order 14148 — Initial Rescissions of Harmful Executive Orders and Actions — White House (45-W-EO-14148) (2025-01-20)· accessed 2026-05-20
Rescinds EO 14110 (the 10²⁶ reporting trigger and the dual-use-AI reporting regime it sat on).
Tier 1OpenAI Preparedness Framework v2 (the "lower bound, rather than a ceiling" verbatim) — OpenAI (2025-04-15)· accessed 2026-05-20
OpenAI: "we regard any one-time capability elicitation in a frontier model as a lower bound, rather than a ceiling, on capabilities that may emerge in real world use and misuse." The cleanest lab-side floor framing on public record.
Tier 1Measuring AI Ability to Complete Long Tasks (METR) — METR / arXiv:2503.14499 (2025-03-18)· accessed 2026-05-20
METR: their numbers serve as a "reasonable lower bound" with elicitation budgets of 2-3 engineer-weeks per model; o1's time-horizon "more than doubled with three engineer-weeks of elicitation."
Tier 1Anthropic Responsible Scaling Policy v3.0 (under-elicitation acknowledgement) — Anthropic (2026-02-24)· accessed 2026-05-20
Anthropic acknowledges in v3.0 that capability evaluations cannot guarantee they elicit "all the relevant model capabilities" and may "substantially underestimate" what external actors elicit.
Tier 1Google DeepMind Frontier Safety Framework v3.0 (threat-actor-elicitation requirement) — Google DeepMind (2025-09-22)· accessed 2026-05-20
GDM: "elicitation methods used during evaluations must be comprehensive enough to match the elicitation efforts of potential threat actors." Threat-actor-matching standard, not a stated floor.
Tier 1We need a science of evals (Apollo Research) — Apollo Research (2024-01-22)· accessed 2026-05-20
Apollo: evals "aim to estimate an upper bound of capabilities" and small prompt-format changes produce up to 76-point accuracy swings. The structural argument for the floor-vs-ceiling gap.

Cite this surface

Sources

OpenAI Preparedness Framework v2 (the "lower bound, rather than a ceiling" verbatim)
METR — Measuring AI Ability to Complete Long Tasks (engineer-weeks elicitation disclosure)
AISI cyber blog — Mythos eval, May 13 floor admission, cyber-range methodology paper
Anthropic RSP v3.0 (under-elicitation acknowledgement)
Google DeepMind FSF v3.0 (threat-actor-elicitation requirement)
Apollo Research — We Need A Science of Evals
Inspect AI framework CHANGELOG (deepagent() v0.3.213)
Cybench HAL leaderboard (per-scaffold-mode scores) via Epoch CSV ingest
OSWorld leaderboard (per-step-budget scores) via Epoch CSV ingest
EU AI Act (Regulation 2024/1689)

Processing

Threshold AtlasGovernance

The cumulative-FLOP threshold the floor-only capability evidence anchors against; the addendum here carries the short version

Capability ScalingCapabilities

Benchmark frontier whose published scores sit on the same elicitation floor the convergent-disclosure panel names

FLOP EngineInfrastructure

Multi-agent deployment scales the per-run FLOP budget by ~N× per agent

MethodologyReference

Full derivation chain underneath the chart and the convergent-disclosure substrate

Built for the AI governance research community

The Elicitation Gap

Five organizations have said, in print, that the public capability numbers are floors. The gap factor is unmeasured.

Six organizations, the same admission, in their own words.

Same model, same benchmark, different scaffold.

A token cap, a simple scaffold, and a multi-agent framework that shipped two weeks after the cyber report.

The gap factor is not in the public literature.

A floor-bound capability layer under a cumulative-FLOP regulatory ceiling.

EU AI Act — Article 51(2) presumes systemic risk above 1025 FLOP; the rebuttal cites capability evaluations.

US — EO 14148 rescinded EO 14110; the federal reporting regime sits on no cumulative-FLOP trigger.

Compute-governance research community — the gap is the least- operationalized capability-frontier axis in the literature.

The compute × capability × policy seam.

What this surface measures, what it does not, every source it cites.

What the chart shows

What the chart does not claim

What “floor” means here

Why this is not a critique of AISI

Primary sources

Cite this surface

The Elicitation Gap

Five organizations have said, in print, that the public capability numbers are floors. The gap factor is unmeasured.

Six organizations, the same admission, in their own words.

Same model, same benchmark, different scaffold.

A token cap, a simple scaffold, and a multi-agent framework that shipped two weeks after the cyber report.

The gap factor is not in the public literature.

A floor-bound capability layer under a cumulative-FLOP regulatory ceiling.

EU AI Act — Article 51(2) presumes systemic risk above 1025 FLOP; the rebuttal cites capability evaluations.

US — EO 14148 rescinded EO 14110; the federal reporting regime sits on no cumulative-FLOP trigger.

Compute-governance research community — the gap is the least- operationalized capability-frontier axis in the literature.

The compute × capability × policy seam.

What this surface measures, what it does not, every source it cites.

What the chart shows

What the chart does not claim

What “floor” means here

Why this is not a critique of AISI

Primary sources

Cite this surface

EU AI Act — Article 51(2) presumes systemic risk above 10²⁵ FLOP; the rebuttal cites capability evaluations.

EU AI Act — Article 51(2) presumes systemic risk above 10²⁵ FLOP; the rebuttal cites capability evaluations.