Enabling Fine-Grained Operating Points for Black-Box LLMs¶

Conference: ICLR 2026 arXiv: 2510.17727 Code: Not released (code snippets available in appendix) Area: LLM Evaluation Keywords: Black-box LLM, operating points, probability calibration, PR curve, confidence estimation

TL;DR¶

This paper identifies that verbalized probabilities from black-box LLMs produce only 16–23 unique values (low-cardinality problem), resulting in coarse PR/ROC curves that prevent fine-grained threshold tuning. By injecting parameterized noise and an optional MLP correction, the number of unique values increases from 16 to 20,000+, matching the performance of 20-sample ensembles with only 1–2 API calls.

Background & Motivation¶

Background: When deploying LLMs as classifiers, practitioners must select an appropriate operating point on the precision-recall curve. A common approach is to prompt the LLM to output a verbalized probability in \([0,1]\) as a confidence score.

Limitations of Prior Work: LLMs exhibit a severe "rounding bias"—their output probabilities concentrate on a small set of values such as 0, 0.5, 0.85, 0.9, and 0.95 (with a tendency to end in 0 or 5), yielding PR/ROC curves with only a handful of discrete points and preventing fine-grained threshold control.

Key Challenge: The coarse spacing on the PR curve forces practitioners to choose between either high-precision/low-recall or low-precision/high-recall operating points, with no fine-grained trade-off available at deployment time.

Goal: How can black-box LLM probability outputs be made continuous without substantially increasing the number of API calls?

Key Insight: Inject parameterized noise onto discrete probabilities, effectively "diffusing" the discrete distribution into a continuous one.

Core Idea: Breaking rounding bias through noise injection—expanding the number of unique probability values from 16 to 20,000+ while preserving ranking performance.

Method¶

Overall Architecture¶

Three variants are proposed: (1) Unsup—unsupervised uniform noise injection; (2) Sup-1call—supervised MLP with noise, single API call; (3) Sup-2call—supervised MLP with noise, two API calls (at \(T=0\) and \(T=1\)).

Key Designs¶

Unsupervised Noise (Ours-Unsup):
Function: Adds uniform noise to verbalized probabilities, maximizing noise magnitude while maintaining performance.
Mechanism: \(\max w \text{ s.t. } \sum \text{loss}(y_i, \text{clip}(z_i \cdot w + y_{\text{vrb},i})) \leq \sum \text{loss}(y_i, y_{\text{vrb},i}),\ z \sim U(0,1)\). This finds the largest noise amplitude under the constraint that performance does not degrade.
Design Motivation: Requires no labeled data and is fully unsupervised. Increases cardinality from 16 to 5,614.
Supervised Noise + MLP (Ours-Sup):
Function: Learns a correction function \(f\) that maps discrete probabilities to better-calibrated outputs while simultaneously injecting noise.
Mechanism: \(\min_{\theta_f, w} \sum \text{loss}(y_i, \text{sigmoid}(z_i / w + f(y_{\text{vrb}}; \theta_f))) + \lambda \cdot w,\ z \sim \mathcal{N}(0,1)\). Here \(f\) is a 2-layer ReLU MLP that jointly learns the calibration correction and noise amplitude.
Design Motivation: The MLP correction addresses probability miscalibration (systematic over- or under-confidence), while noise injection resolves the cardinality problem (discretization).

Key Experimental Results¶

Cardinality Improvement¶

Method	API Calls	Unique Values	Cardinality Gain
Prompt-Naive	1	10	1×
Sample-Class (20×)	20	97	10×
Ours-Unsup	1	5,614	561×
Ours-Sup-2call	2	20,607	2,061×

Performance Comparison (11 datasets combined)¶

Method	API Calls	PRAUC	Precision Granularity
Prompt-Naive	1	0.72	0.081
Sample-Prob	20	0.78	0.014
Ours-Sup-2call	2	0.79	0.016

Key Findings¶

Sup-2call surpasses 20-sample ensembling with only 2 API calls (PRAUC 0.79 vs. 0.78).
Noise injection is necessary—MLP calibration alone without noise fails to resolve the cardinality problem.
Results are consistent across 11 diverse datasets spanning sentiment classification to fact verification.

Highlights & Insights¶

Engineering-driven problem discovery: The paper identifies and quantifies the rounding bias in LLM probability outputs (16–23 unique values), which is a valuable observation in its own right.
Noise as regularization: Adding noise does not degrade signal quality; rather, it smooths an over-discretized distribution into a continuous space, improving the flexibility of downstream decision-making.
Exceptional cost efficiency: 2 API calls outperform 20-sample ensembling, reducing API costs by 90%.

Limitations & Future Work¶

The supervised variant requires labeled data for MLP training, making it unsuitable for cold-start scenarios.
Validation is limited to Claude and a subset of open-source models; rounding bias may vary across different LLMs.
The noise amplitude and MLP architecture are fixed; adaptive schemes may yield further improvements.

vs. Standard Sampling: Multi-sample ensembling (20 calls) increases cardinality but incurs linearly growing costs; this work achieves comparable results with 1–2 calls.
vs. Probability Calibration: Methods such as Platt Scaling improve calibration but do not address the cardinality problem; this work resolves both simultaneously.

Rating¶

Novelty: ⭐⭐⭐⭐ Novel problem identification (rounding bias) with an intuitive and effective solution.
Experimental Thoroughness: ⭐⭐⭐⭐ 11 datasets, multiple baselines, and ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear problem motivation with rich figures and tables.
Value: ⭐⭐⭐⭐⭐ Direct practical value for LLM deployment.