QuEst: Enhancing Estimates of Quantile-Based Distributional Measures Using Model Predictions¶

Conference: ICML 2025
arXiv: 2507.05220
Code: btleyre/quest
Area: LLM/NLP
Keywords: Quantile estimation, prediction-powered inference, confidence intervals, distributional measures, LLM auto-evaluation

TL;DR¶

The QuEst framework is proposed to combine a small amount of high-quality observed data with a large amount of model-predicted (imputed) data. This provides more accurate point estimates and rigorous confidence intervals for quantile-based distributional measures (QBDMs), covering classical metrics such as CVaR and Interval-VaR.

Background & Motivation¶

The existing Prediction-Powered Inference (PPI) framework can combine a small number of gold-standard labels with a large volume of model predictions to perform hybrid inference for M-estimators such as means and regression coefficients. However, the applicability of PPI is limited to M-estimators, making it unable to handle many crucial quantile-based distributional measures (QBDMs), such as:

CVaR (Conditional Value at Risk): Characterizes extreme tail performance, which is crucial in financial risk control and LLM safety evaluation.
Interval-VaR: Measures the mean over a specific segment of the population/data (such as the lowest 20% of income).
VaR (Value at Risk / single quantile): Although solvable by PPI, standard PPI lacks optimal \(\lambda\) selection for it.

In fields such as economics, sociology, genomics, and LLM safety evaluation, decision-makers often need to understand the tail behavior of distributions or specific population segments rather than merely focusing on the mean. In scenarios where labeled data is expensive, estimating these metrics using only a small amount of observed data yields high variance, while pure model predictions suffer from systematic bias. QuEst is proposed precisely to fill this gap.

Method¶

Overall Architecture¶

The core idea of QuEst is to define a hybrid estimator for QBDM, leveraging large amounts of predicted data to reduce variance, while using a small amount of observed data for bias correction.

Unified definition of QBDM: Given a CDF \(F\), QBDM is defined as

\[Q_\psi(F) = \int_0^1 \psi(p) F^{-1}(p) dp\]

where \(\psi \geq 0\) and \(\int \psi(p) dp = 1\) is a weight function. Multiple classical metrics can be recovered by choosing different \(\psi\):

Metric	Weight Function \(\psi(p)\)	Application Scenario
Expected Mean	\(\psi(p) = 1\)	General
\(\beta\)-VaR	\(\psi(p) = \delta_\beta(p)\)	Financial risk, quantile
\(\beta\)-CVaR	\(\psi(p) = \frac{1}{1-\beta}\) for \(p \geq \beta\)	Tail risk assessment
Interval-VaR	\(\psi(p) = \frac{1}{\beta_2 - \beta_1}\) for \(p \in [\beta_1, \beta_2]\)	Population segment analysis

Data Setup:

A small labeled dataset (observed data): \(\{(X_i, Y_i, \tilde{Y}_i)\}_{i=1}^n\), with a small \(n\)
A large unlabeled dataset (predicted data): \(\{(X_j^u, \tilde{Y}_j^u)\}_{j=1}^N\), with \(N \gg n\)
\(\tilde{Y}\) is generated by a prediction model \(g\), and \(M(X,Y)\) is a user-defined metric function

Key Designs¶

1. QuEst Hybrid Estimator

\[\hat{Q}_\psi(\lambda) = \lambda \cdot Q_\psi(\tilde{F}_N^u) + \left( Q_\psi(F_n) - \lambda \cdot Q_\psi(\tilde{F}_n) \right)\]

First term \(\lambda Q_\psi(\tilde{F}_N^u)\): Leverages the empirical quantile function of the large prediction dataset to provide a low-variance estimate.
Second term \(Q_\psi(F_n) - \lambda Q_\psi(\tilde{F}_n)\): Uses the observed data to estimate and correct for prediction bias.
\(\lambda = 0\) degenerates to the classical estimator using only observed data.
\(\lambda = 1\) relies fully on predicted data with the bias corrected by observed data.

2. Asymptotic Normality (Core Theorem)

Utilizing classical L-statistic theory, the paper expands \(Q_\psi(F_n)\) as a weighted sum of order statistics:

\[Q_\psi(F_n) = \sum_{i=1}^n \left[ \int_{(i-1)/n}^{i/n} \psi(p)dp \right] M_{(i)}\]

Based on this, Theorem 3.1 is derived: For any fixed \(\lambda\), under regularity conditions,

\[\sqrt{n} \left( \hat{Q}_\psi(\lambda) - Q_\psi(F) \right) \xrightarrow{D} \mathcal{N}(0, \rho_\psi^2(\lambda, F, \tilde{F}))\]

where the asymptotic variance is

\[\rho_\psi^2(\lambda, F, \tilde{F}) = \lambda^2(1+r)\sigma_\psi^2(\tilde{F}) + \sigma_\psi^2(F) - 2\lambda \eta_\psi(F, \tilde{F})\]

Here \(r = n/N\), and \(\eta_\psi\) is the covariance of the QBDM between the observed and predicted distributions.

3. Closed-Form Solution for Optimal \(\lambda\)

Minimizing \(\rho_\psi^2\) with respect to \(\lambda\) yields:

\[\hat{\lambda} = \frac{\eta_\psi(F_n, \tilde{F}_n)}{(1 + n/N) \sigma_\psi^2(\tilde{F}_N^u)}\]

Key property: The final asymptotic variance is guaranteed not to exceed the variance of the classical estimator, i.e.,

\[\rho_\psi^2(\hat{\lambda}) = \sigma_\psi^2(F_n) - \frac{(\eta_\psi(F_n, \tilde{F}_n))^2}{(1+n/N)\sigma_\psi^2(\tilde{F}_N^u)} \leq \sigma_\psi^2(F_n)\]

4. Multidimensional Extension (Multidimensional QuEst)

Estimating \(k\) QBDMs simultaneously, a joint confidence region is provided via multi-dimensional CLT:

\[\hat{V}^{-1/2}\sqrt{n}\left(\hat{\mathbf{Q}}(\psi_{1:k}, \hat{\lambda}_{1:k}) - \mathbf{Q}(\psi_{1:k}, F)\right) \xrightarrow{D} \mathcal{N}(\mathbf{0}, I)\]

Compared to individual estimations followed by Bonferroni correction, this avoids excessively conservative confidence regions.

5. QuEst-Opt: Generalized Weight Function Optimization

It further allows applying an adaptive weight function \(\tilde{\psi}\) to the predicted data that differs from the target QBDM \(\psi\):

\[\hat{Q}(\psi, \tilde{\psi}) = \underbrace{Q_{\tilde{\psi}}(\tilde{F}_N^u) - Q_{\tilde{\psi}}(\tilde{F}_n)}_{\text{自适应项}} + \underbrace{Q_\psi(F_n)}_{\text{目标 QBDM}}\]

After parameterizing \(\tilde{\psi}(\cdot) = \xi^T \phi(\cdot)\), the variance is convex with respect to \(\xi\). The optimal \(\xi\) is solved via convex optimization with regularization, and it still satisfies the CLT. This shows significant effects in the presence of heteroscedasticity.

Loss & Training¶

QuEst is a purely inferential framework and does not involve model training. Its core "optimization" lies in:

\(\lambda\) Selection: Closed-form solution that minimizes asymptotic variance, requiring no hyperparameters.
\(\xi\) Optimization (QuEst-Opt): A convex optimization problem with L2 regularization: \(\min_\xi \rho_\psi^2(\xi) + \frac{\alpha}{2}\|\xi\|^2\)
Confidence Interval Construction: \(\hat{Q}_\psi(\hat{\lambda}) \pm z_{1-\alpha/2} \cdot \hat{SE}\), where \(\hat{SE} = \rho_\psi(\hat{\lambda}) / \sqrt{n}\)

Key Experimental Results¶

Main Results¶

Experiments span 3 types of scenarios and 6+ datasets, covering economics, genomics, public opinion, and LLM evaluation.

Dataset	Metric	QuEst Performance (n=100)	Observation Only	Gain
PovertyMap	Interval-VaR	Significantly lower error	Baseline	~30-40% error reduction
GeneExpression	CVaR (top 20%)	~50% reduction	Baseline	~50% reduction in error and interval width
OpinionQA	Interval-VaR (Middle 50%)	Clearly superior	Baseline	Significantly outperforms pure prediction and pure observation
Red-teaming Toxicity Evaluation	CVaR (Worst 25%)	Consistently lower error	Baseline	Higher correlation in model ranking
News Summarization (XSum)	Multidimensional Interval-VaR	MSE reduced by 7%	Baseline	Confidence region volume reduced by 34% (univariate)

Ablation Study¶

Configuration	Key Metric	Description
QuEst vs PPI (VaR)	Lower estimation error	Adaptive \(\lambda\) selection outperforms PPI with fixed \(\lambda\)
QuEst-Opt vs QuEst	Further error reduction on heteroscedastic data	Generalized weights yield additional variance reduction
Univariate vs Multidimensional Confidence Regions	Volume: 1.04e-2 vs 5.17e-3	Multidimensional QuEst yields the smallest confidence region
Informational Correlation Impact	Correlation \(\uparrow\) \(\rightarrow\) QuEst gains \(\uparrow\)	Consistent with intuition
\(\lambda\) clipping [0,1]	More stable for small samples	Does not affect asymptotic properties

Key Findings¶

Largest gains in low-data scenarios: With only 100 gold-standard samples, QuEst achieves approximately a 50% reduction in error and interval width on GeneExpression.
Guaranteed to perform no worse than classical methods: The asymptotic variance of the optimal \(\lambda\) is strictly no greater than the variance using only observed data.
Adaptability: QuEst automatically adjusts \(\lambda\) based on prediction quality. For poor models, \(\lambda \to 0\), degenerating to the classical estimator.
Multidimensional inference avoids Bonferroni penalty: The joint confidence region is tighter than that obtained from individual estimations with Bonferroni correction.
Model ranking in red-teaming evaluation: The toxicity rankings of 8 candidate models generated by QuEst achieve the highest correlation with the true rankings.

Highlights & Insights¶

Unified and elegant theoretical framework: Incorporates the hybrid inference problem of QBDMs into a unified framework using L-statistic theory, offering a closed-form solution for \(\lambda\) while guaranteeing performance no worse than classical methods.
Zero hyperparameters: \(\lambda\) is solved automatically; users only need to select the weight function \(\psi\) for the target QBDM.
Wide applicability: Has diverse application scenarios, ranging from economic wealth analysis to LLM safety red-teaming evaluation.
Scalability of QuEst-Opt: Achieves further variance reduction by parameterizing the adaptive weight function, with the variance objective maintaining convexity.
Practical value for LLM evaluation: Obtains reliable tail-risk estimates by combining standard labeling of a few samples using expensive LLMs with large-scale labeling from inexpensive small models, which is particularly vital for LLM safety evaluations.

Limitations & Future Work¶

Requires predicted and observed data to come from the same input distribution: Covariate shift scenarios are not covered.
Remains reliant on QuEst-Opt when heteroscedasticity is severe: The base QuEst only utilizes a scalar \(\lambda\), limiting its adaptive capability.
Lacks theoretical characterization of "how large \(N\) needs to be": The lower bound of predicted data volume is not explicitly provided.
Confidence intervals rely on asymptotic theory: The finite-sample coverage might be suboptimal for small sample sizes (\(n < 50\)).
Selection of the regularization parameter \(\alpha\): Although \(\alpha\) can theoretically be arbitrarily small in QuEst-Opt, its practical impact has not been thoroughly explored.

PPI / PPI++ (Angelopoulos et al., 2023, 2024): QuEst directly extends PPI to the QBDM family.
Cross-PPI (Fisch et al., 2024): Another hybrid inference method.
LLM Auto-Evaluation (Boyeau et al., 2024; Eyre & Madras, 2024): The application of QuEst in this direction is of great practical significance.
Insights: The core ideas of this framework can be generalized to other non-M-estimator statistics, such as the Gini coefficient, Lorenz curve, etc.

Rating¶

Dimension	Score (1-5)	Description
Novelty	4	First to extend PPI to the complete QBDM family with solid theory
Practicality	4.5	Open-source code, zero hyperparameters, directly applicable to LLM eval scenarios
Theoretical Rigor	5	Complete CLT derivation, closed-form solution for optimal \(\lambda\), convex optimization guarantees
Experimental Thoroughness	4	Comprehensive coverage of multiple scenarios, but lacks analysis on extreme small-sample regimes
Writing Quality	4	Clear structure and consistent notation, but formula-heavy
Overall Score	4.3	Solid theoretical work that fills an important gap in PPI