Skip to content

Control Tax: The Price of Keeping AI in Check

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=BpyR6Wkl1c
Code: https://github.com/MikhailTerekhov/control-tax-code
Area: AI Safety / AI Control
Keywords: AI Control, Trusted Monitoring, Code Backdoor, Game Theory, Auditing Budget

TL;DR

This paper introduces "Control Tax"—the operational and financial costs of integrating AI Control (AIC) measures into production pipelines. The authors empirically measure the ROC performance of frontier models acting as monitors on the APPS code backdoor task, translate these ROC curves into "safety probability under a given auditing budget" using game theory, and finally plot the Pareto frontier of "Safety vs. Monitoring Cost," demonstrating that the most expensive monitors are not necessarily the safest.

Background & Motivation

Background: As agentic AI is deployed in high-risk real-world scenarios, "AI Control" (AIC; Greenblatt et al. 2023) has emerged as a new direction. It assumes that strong models may be misaligned and uses weaker but trusted models to monitor stronger, untrusted models during deployment to "safely extract useful work" while preventing AI subversion. The classic protocol is "trusted monitoring," where a monitor model assigns suspicion scores to the untrusted model's outputs; outputs exceeding a threshold are sent for human auditing.

Limitations of Prior Work: Whether AIC is adopted by the industry depends largely on the overhead it introduces, yet this has rarely been quantified. Existing research either provides static classifier performance (e.g., AUC, ROC) or discusses safety-usefulness trade-offs, but no work answers the most practical question: How much must be spent on monitoring to achieve a specific safety level? An ROC curve alone does not equate to safety—even a strong monitor with a poor deployment strategy (e.g., auditing only after many submissions have passed) provides weak protection.

Key Challenge: There is a gap between static "monitor accuracy" and dynamic "long-term deployment safety," constrained by a realistic bottleneck: the auditing budget (only a small fraction of submissions can be manually reviewed). Stronger monitors are usually more accurate but more expensive, creating an uncharacterized trade-off between "safety" and "cost."

Goal: (1) Establish a theoretical framework mapping monitor performance and cost to safety guarantees; (2) Measure the real performance of frontier models as attackers/monitors in adversarial settings; (3) Provide monetary cost estimates for control protocols and design optimal monitoring strategies that balance safety and cost-effectiveness.

Key Insight: The authors select "code backdooring" as the attacker task. An attacker model inserts hidden bugs into correct solutions for APPS competitive programming problems, and a monitor model attempts to catch them. This task allows for objective determination of backdoor success via test cases, transforming vague "alignment" issues into measurable detection problems.

Core Idea: The introduction of "Control Tax" from an economic perspective. By using ROC curves of monitors as input for a game-theoretic model, the authors calculate the "probability of the Blue Team (defenders) winning given an auditing budget \(c\)." Plotting this safety probability against real-world dollar costs for monitors reveals the Pareto frontier, allowing practitioners to convert "safety goals" directly into "monitoring expenditures."

Method

Overall Architecture

The workflow follows a pipeline from "empirical data to safety guarantees to cost decisions," consisting of three interconnected contributions:

First is Empirical Measurement: An attacker/monitor pair is constructed using the APPS dataset. The attacker model (agentic and iterative) injects hidden backdoors into correct solutions, while the monitor model assigns a suspicion score (1–100) to each code snippet. By pairing backdoor solutions with correct solutions in a balanced dataset, ROC curves are generated for each (attacker, monitor) combination. Second is Theoretical Translation: Since a single ROC curve lacks safety semantics, the authors build a game-theoretic model. They introduce an "affordable suspicion threshold" \(s(c,p_b)\) to translate ROC data into the Red Team's (attacker) optimal winning probability \(P_w^*\) under a budget \(c\), where safety \(= 1 - P_w^*\). The model distinguishes between batch (all outputs available for sorting) and sequential (outputs arrive one by one) deployment. Third is Cost Decision: Safety probabilities are plotted alongside the real-world dollar cost of running 1,000 samples for each monitor to identify the Pareto frontier and derive optimal strategies (e.g., hiding the total task count \(N\), front-loading auditing).

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["APPS Programming Problems<br/>+ Correct Solutions"] --> B["Backdoor Generation and Monitoring Evaluation Pipeline<br/>attacker injects backdoor → monitor assigns suspicion → ROC"]
    B --> C["Game-Theoretic Safety Model<br/>Affordable threshold s(c,pb) → Red Team optimal Pw*"]
    C -->|Batch vs. Sequential Output| D["Batch / Sequential Deployment<br/>SNAC Poisson rate λ"]
    D --> E["Control Tax Curves and Optimal Strategies<br/>Safety vs. Cost Pareto Frontier"]

Key Designs

1. Control Tax: Putting "Safety" and "Cost" on the Same Plot The core conceptual innovation is quantifying AIC as an economic problem. Control tax is defined as the operational and financial cost of integrating control measures into an AI pipeline—specifically, "the average dollar cost for a monitor to process 1,000 samples." The key observation is that an isolated ROC curve or AUC score cannot determine whether a monitor should be deployed, because safety depends on both monitor accuracy and deployment strategy. The "Control Tax Curve" (Figure 1) plots safety on the y-axis and average monitoring cost on the x-axis. This reveals a counter-intuitive conclusion: the most expensive, highest-performing monitor is not necessarily the safest. Under a limited budget, using a cheaper monitor (e.g., Gemini 2.0 Flash) may allow for more frequent auditing, resulting in higher overall safety.

2. Code Backdoor Generation and Monitoring Pipeline To obtain realistic ROC data, the authors created an environment where attack success can be objectively judged. A backdoor is defined as a solution that passes all public tests but fails on a specific, uncovered "backdoor input." The generation uses an agentic iterative process where the attacker brainstorms strategies, an independent instance of the same model ranks them, and the optimal strategy generates the solution and trigger input. Candidates must pass all public tests and produce a different output than the original solution on the trigger input. Monitors provide fine-grained suspicion scores (1–100) using Chain-of-Thought (CoT), which is essential for the game-theoretic model to adjust thresholds according to the auditing budget. Results show o3-mini achieves an AUC of 0.92–0.96 as a monitor.

3. Game-Theoretic Safety Model: Mapping ROC to Safety Probability The authors model the scenario as a Red-Blue game. The Red Team (misaligned AI) independently attempts to inject backdoors with probability \(p_b\). The Blue Team can audit a fraction \(c\) of submissions (assuming auditing always catches backdoors). Blue wins if any backdoor is audited; Red wins only by successfully deploying an undetected backdoor. The "affordable suspicion threshold" \(s(c,p_b)\) is defined by the implicit equation \(t(s)p_b + f(s)(1-p_b) = c\) (where \(t, f\) are TPR and FPR). In batch settings, Blue can sort outputs; the authors prove this is equivalent to using a higher effective auditing budget \(t_0 = t(s(c,0))\) for pre-auditing.

4. Sequential Deployment (SNAC) and Optimal Audit Allocation In sequential scenarios, Blue cannot reorder submissions. The authors define an "allocation strategy" \(p(\tau)\) (a non-increasing function of auditing probability over time, subject to \(\int_0^1 p(\tau)d\tau = c\)). They prove that non-increasing strategies are optimal because failing to catch a backdoor early can lead to irrevocable system takeover. In the limit \(N \to \infty\) with \(p_b \sim \lambda/N\), the Red Team's winning probability converges to:

\[P_w(p;\lambda) = \lambda p_+ \int_0^1 (1 - \tilde t(p(\tau))) \exp\left(-\lambda p_+ \tau - \lambda(1 - p_+) \int_0^\tau \tilde t(p(\theta)) d\theta\right) d\tau,\]

where \(p_+\) is the success probability of a backdoor attempt and \(\tilde t(x) = t(s(x,0))\). A key finding is that hiding the total number of tasks \(N\) from the attacker significantly reduces their success rate, as they cannot select the optimal \(\lambda\).

Loss & Training

Ours does not involve model training; all evaluations are based on API inference of frontier LMs. Optimization occurrs at the game-theoretic level by numerically solving the saddle-point problem \(\min_p \max_\lambda P_w(p; \lambda)\) to find the optimal audit allocation \(p(\tau)\).

Key Experimental Results

Main Results

Evaluation used the APPS dataset (2427 evaluable samples) across 13 attacker/monitor models (GPT-3.5-Turbo to o3-mini, Claude, and Gemini series).

Metric Result Description
Strongest monitor (o3-mini) AUC 0.92 – 0.96 Distinguishing backdoored vs. correct solutions
Monitoring Performance vs. Model Scale Positive Correlation Stronger models generally monitor more accurately
Backdoor Stealth vs. Attacker Strength Minimal Correlation Public unit tests likely act as a pre-filter
Next-gen Models (GPT-4.1 / Gemini 2.5) Significantly stealthier Potential "generational leap" in backdoor capabilities
Error Types Mostly False Positives (FP) Calibration improvements could enhance precision

Key Findings

  • Most expensive is not always safest: Given a fixed budget, using cheaper monitors allows for higher auditing frequency, which can lead to higher overall safety than using a single expensive monitor.
  • Hiding \(N\) is a critical defense: Since the Red Team's optimal attack rate depends on \(1/N\), hiding the deployment duration forces the attacker into sub-optimal strategies.
  • Front-load auditing: The optimal defense in sequential deployment is a non-increasing audit allocation, spending more budget early to prevent initial takeover.
  • The weakness of patient attackers: In infinite horizons (\(N\) unknown), the attacker's best strategy is to wait and inject rarely, exposing the inherent vulnerability of pure monitoring in long-term deployments.

Highlights & Insights

  • Economics of AI Safety: Control tax formalizes the safety-cost trade-off, moving AIC from "which monitor is most accurate" to "how to achieve safety within a budget."
  • Elegant Theoretical Mapping: The use of \(s(c, p_b)\) to feed empirical ROC data into game models avoids parametric assumptions about monitor distributions.
  • Actionable Defense Strategies: The suggestion to hide \(N\) is a concrete, executable strategy derived directly from the mathematical structure of the optimal attack rate.
  • Fine-grained suspicion scores: Using continuous scores instead of binary labels is the prerequisite for the game-theoretic framework to function across arbitrary budgets \(c\).

Limitations & Future Work

  • Domain Specificity: Verified only on APPS; needs expansion to agentic tasks like SWE-bench.
  • Generation Bias: Honest solutions are from APPS, while backdoor solutions are generated by attackers; using the same attacker to generate both would be more controlled.
  • Missing Usefulness Modeling: The analysis covers safety and audit cost but lacks the impact of monitoring on task throughput.
  • Anytime Scenarios: The discussion on infinite horizons (\(N\) unknown) is preliminary and requires more rigorous future work.
  • Lack of Stealth Feedback: Attackers were not explicitly trained or incentivized for stealth; providing feedback on monitored backdoors might increase their evasiveness.
  • vs. Greenblatt et al. (2023): Ours follows their "trusted monitoring" protocol but shifts focus to cost quantification.
  • vs. Griffin et al. (2024): While they study safety-usefulness, Ours investigates the safety-cost trade-off, which is more decisive for industrial adoption.
  • vs. Engels et al. (2025): They evaluate APPS backdoors with binary labels and no agentic scaffolding; Ours achieves higher attacker capability and more flexible evaluation via fine-grained scores.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Formally introducing "Control Tax" is a highly practical and novel contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Extensive model matrix and ROC data, though limited to the APPS domain.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear progression from empirical data to theory and cost analysis.
  • Value: ⭐⭐⭐⭐⭐ Provides a concrete decision-making tool for AIC deployment.