Skip to content

Enabling Fine-Grained Operating Points for Black-Box LLMs

Conference: ICLR 2026 arXiv: 2510.17727 Code: Not released (code snippets available in appendix) Area: LLM Evaluation Keywords: Black-box LLM, operating points, probability calibration, PR curve, confidence estimation

TL;DR

This paper identifies that verbalized probabilities from black-box LLMs produce only 16–23 unique values (low-cardinality problem), resulting in coarse PR/ROC curves that prevent fine-grained threshold tuning. By injecting parameterized noise and an optional MLP correction, the number of unique values increases from 16 to 20,000+, matching the performance of 20-sample ensembles with only 1–2 API calls.

Background & Motivation

Background: When deploying LLMs as classifiers, practitioners must select an appropriate operating point on the precision-recall curve. A common approach is to prompt the LLM to output a verbalized probability in \([0,1]\) as a confidence score.

Limitations of Prior Work: LLMs exhibit a severe "rounding bias"—their output probabilities concentrate on a small set of values such as 0, 0.5, 0.85, 0.9, and 0.95 (with a tendency to end in 0 or 5), yielding PR/ROC curves with only a handful of discrete points and preventing fine-grained threshold control.

Key Challenge: The coarse spacing on the PR curve forces practitioners to choose between either high-precision/low-recall or low-precision/high-recall operating points, with no fine-grained trade-off available at deployment time.

Goal: How can black-box LLM probability outputs be made continuous without substantially increasing the number of API calls?

Key Insight: Inject parameterized noise onto discrete probabilities, effectively "diffusing" the discrete distribution into a continuous one.

Core Idea: Breaking rounding bias through noise injection—expanding the number of unique probability values from 16 to 20,000+ while preserving ranking performance.

Method

Overall Architecture

Three variants are proposed: (1) Unsup—unsupervised uniform noise injection; (2) Sup-1call—supervised MLP with noise, single API call; (3) Sup-2call—supervised MLP with noise, two API calls (at \(T=0\) and \(T=1\)).

Key Designs

  1. Unsupervised Noise (Ours-Unsup):

  2. Function: Adds uniform noise to verbalized probabilities, maximizing noise magnitude while maintaining performance.

  3. Mechanism: \(\max w \text{ s.t. } \sum \text{loss}(y_i, \text{clip}(z_i \cdot w + y_{\text{vrb},i})) \leq \sum \text{loss}(y_i, y_{\text{vrb},i}),\ z \sim U(0,1)\). This finds the largest noise amplitude under the constraint that performance does not degrade.
  4. Design Motivation: Requires no labeled data and is fully unsupervised. Increases cardinality from 16 to 5,614.

  5. Supervised Noise + MLP (Ours-Sup):

  6. Function: Learns a correction function \(f\) that maps discrete probabilities to better-calibrated outputs while simultaneously injecting noise.

  7. Mechanism: \(\min_{\theta_f, w} \sum \text{loss}(y_i, \text{sigmoid}(z_i / w + f(y_{\text{vrb}}; \theta_f))) + \lambda \cdot w,\ z \sim \mathcal{N}(0,1)\). Here \(f\) is a 2-layer ReLU MLP that jointly learns the calibration correction and noise amplitude.
  8. Design Motivation: The MLP correction addresses probability miscalibration (systematic over- or under-confidence), while noise injection resolves the cardinality problem (discretization).

Key Experimental Results

Cardinality Improvement

Method API Calls Unique Values Cardinality Gain
Prompt-Naive 1 10
Sample-Class (20×) 20 97 10×
Ours-Unsup 1 5,614 561×
Ours-Sup-2call 2 20,607 2,061×

Performance Comparison (11 datasets combined)

Method API Calls PRAUC Precision Granularity
Prompt-Naive 1 0.72 0.081
Sample-Prob 20 0.78 0.014
Ours-Sup-2call 2 0.79 0.016

Key Findings

  • Sup-2call surpasses 20-sample ensembling with only 2 API calls (PRAUC 0.79 vs. 0.78).
  • Noise injection is necessary—MLP calibration alone without noise fails to resolve the cardinality problem.
  • Results are consistent across 11 diverse datasets spanning sentiment classification to fact verification.

Highlights & Insights

  • Engineering-driven problem discovery: The paper identifies and quantifies the rounding bias in LLM probability outputs (16–23 unique values), which is a valuable observation in its own right.
  • Noise as regularization: Adding noise does not degrade signal quality; rather, it smooths an over-discretized distribution into a continuous space, improving the flexibility of downstream decision-making.
  • Exceptional cost efficiency: 2 API calls outperform 20-sample ensembling, reducing API costs by 90%.

Limitations & Future Work

  • The supervised variant requires labeled data for MLP training, making it unsuitable for cold-start scenarios.
  • Validation is limited to Claude and a subset of open-source models; rounding bias may vary across different LLMs.
  • The noise amplitude and MLP architecture are fixed; adaptive schemes may yield further improvements.
  • vs. Standard Sampling: Multi-sample ensembling (20 calls) increases cardinality but incurs linearly growing costs; this work achieves comparable results with 1–2 calls.
  • vs. Probability Calibration: Methods such as Platt Scaling improve calibration but do not address the cardinality problem; this work resolves both simultaneously.

Rating

  • Novelty: ⭐⭐⭐⭐ Novel problem identification (rounding bias) with an intuitive and effective solution.
  • Experimental Thoroughness: ⭐⭐⭐⭐ 11 datasets, multiple baselines, and ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Clear problem motivation with rich figures and tables.
  • Value: ⭐⭐⭐⭐⭐ Direct practical value for LLM deployment.