Skip to content

UCS: Estimating Unseen Coverage for Improved In-Context Learning

Conference: ACL 2026
arXiv: 2604.12015
Code: https://github.com/Raina-Xin/UCS
Area: In-Context Learning
Keywords: In-Context Learning, Exemplar Selection, Coverage Estimation, Good-Turing Estimation, Clustering

TL;DR

Ours proposes UCS (Unseen Coverage Selection), a training-free subset-level coverage prior based on the Smoothed Good-Turing estimator. By estimating the number of unobserved potential clusters in the candidate exemplar set to regularize existing ICL selection methods, it improves accuracy by 2-6% on intent classification and reasoning tasks.

Background & Motivation

Background: In-Context Learning (ICL) performance is highly dependent on which exemplars are selected for the prompt. Existing methods select exemplars based on similarity (e.g., semantic proximity to the query), diversity (e.g., DPP), or information-theoretic criteria (e.g., MDL).

Limitations of Prior Work: Existing methods operate at the instance level—evaluating the relevance of individual exemplars or pairwise diversity—but lack a subset-level coverage perspective. A high-quality exemplar set should cover the various underlying latent clusters of the task, yet no current method quantifies how many latent clusters remain uncovered in a selected set.

Key Challenge: The distribution of latent clusters in ICL exemplar pools is heavy-tailed—a few clusters contain many samples, while a large number of clusters contain only a few. Methods based on similarity or diversity tend to pick from frequent clusters, resulting in the systematic neglect of rare patterns.

Goal: To propose a subset-level coverage prior that acts as a lightweight plug-in to enhance existing ICL selection methods, encouraging the selection of exemplar sets that cover more latent clusters.

Key Insight: Drawing from the classic method for estimating the "number of unseen species" in ecology—the Smoothed Good-Turing (SGT) estimator—the "uncovered latent clusters" in ICL exemplar selection are analogous to "unobserved species."

Core Idea: Latent clusters are defined using model-consistent embeddings. The SGT estimator then calculates how many clusters remain uncovered based on the frequency spectrum of the selected subset. This estimate is used as a regularization term for existing selection objectives.

Method

Overall Architecture

UCS consists of three steps: (1) representing all candidate exemplars using the LLM's own embeddings (model-consistent representation); (2) discretizing continuous embeddings into cluster IDs via dictionary learning and DBSCAN (discretization); (3) estimating the total number of clusters from the frequency spectrum of the selected subset using the Smoothed Good-Turing estimator (coverage estimation) and combining it with existing selection objectives.

Key Designs

  1. Model-consistent Embedding and Cluster Discretization:

    • Function: Transitions continuous LLM embeddings into discrete latent cluster labels.
    • Mechanism: The same LLM used at inference extracts hidden states for candidate exemplars (input only, excluding labels), followed by masked mean pooling to obtain fixed-length vectors. Dictionary learning (ridge coding) produces encoding vectors over \(K\) atoms, and DBSCAN (cosine distance) clusters these in the normalized coding space. Noise points are assigned as independent singleton clusters.
    • Design Motivation: Using argmax atom assignment over-concentrates on high-frequency atoms and ignores multi-atom structural combinations. Dictionary learning combined with clustering captures recurring latent patterns while preserving long-tail fine-grained units.
  2. Smoothed Good-Turing Coverage Estimation:

    • Function: Estimates the total (including unobserved) number of clusters from the frequency spectrum of the selected subset.
    • Mechanism: A frequency spectrum \(f_s(S)\) (number of clusters appearing \(s\) times) is built for the selected subset \(S\). The SGT estimator \(\hat{U}_t^{SGT}(S) = -\sum_{s=1}^{M} (-t)^s w_s(t,\alpha) f_s(S)\) predicts how many new clusters would be observed if \(m\) additional samples were taken. The UCS coverage function is \(\Phi_{UCS}(S) = K_{seen}(S) + \hat{U}_t(S)\), accounting for both observed and predicted unobserved clusters.
    • Design Motivation: Good-Turing is a classic statistical tool for estimating "unobserved species." The key insight is that the counts of singletons (appearing once) and doubletons (appearing twice) in the frequency spectrum carry significant information about unobserved categories.
  3. UCS Regularized Selection:

    • Function: Seamlessly integrates the coverage prior into existing ICL selection methods.
    • Mechanism: The selection objective is \(S^* = \arg\max_{|S|=B} (U_{base}(S; x_{test}) + \lambda \Phi_{UCS}(S))\), where \(U_{base}\) is the original utility (DPP/MDL/VoteK) and \(\lambda\) controls regularization strength. Inverse frequency weighting is used for VoteK, marginal coverage gain for DPP, and direct addition for MDL. \(\lambda=0\) reverts to the base method.
    • Design Motivation: UCS is a subset-level function (not decomposable into instance-level scores). Using it as a prior regularization rather than a standalone selector maximizes the retention of the original method's advantages.

Loss & Training

UCS is entirely training-free. Offline preprocessing (embedding + clustering) takes 38-57 seconds per dataset, while online inference adds approximately 0-3 seconds. All hyperparameters have clear default values (dictionary size \(K\), SGT truncation order \(M=20\), expansion factor \(t\), etc.).

Key Experimental Results

Main Results

Method Banking77 (Qwen) CLINC150 (Qwen) HWU64 (Qwen)
VoteK 0.518 0.703 0.609
UCS+VoteK 0.543 (+2.5%) 0.744 (+4.1%) 0.671 (+6.2%)
DPP 0.831 0.755 0.791
UCS+DPP 0.831 0.775 (+2.0%) 0.794
MDL 0.764 0.748 0.785
UCS+MDL 0.771 0.752 0.801 (+1.6%)

Ablation Study

Configuration Key Metrics Note
UCS+VoteK Unique Clusters 10.0, Cluster Size 1.0 Completely eliminates redundancy
VoteK (Base) Unique Clusters 9.67, Cluster Size 8.50 Significant redundancy present
Cross-model Joint Dictionary Decrease Forced alignment of different embedding spaces loses information

Key Findings

  • Query-independent methods benefit most: VoteK + UCS showed gains of 6.2% (Qwen) and 4.1% (Llama) on HWU64, as VoteK is naturally prone to selecting redundant exemplars.
  • Effectiveness on reasoning tasks: On BBEH reasoning tasks, UCS+DPP improved by 12.5 pp on Shuffled Objects, and UCS+MDL improved by 8.4 pp on Causal Understanding.
  • Heavy-tailed cluster distributions: Across all dataset-model combinations, cluster size distributions were extremely skewed—many singletons and few dominant clusters—validating the necessity of the coverage prior.
  • Model-consistent embeddings outperform joint embeddings: Joint dictionary learning harms the fine-grained discriminative power of high-capability models.
  • Minimal computational overhead: Offline preprocessing 38-57s, online overhead 0-3s.

Highlights & Insights

  • Elegant connection between Statistics and NLP: Applying "unobserved species" estimation from ecology to "uncovered latent clusters" in ICL is natural and methodologically rigorous.
  • Plug-and-play design: UCS acts as a regularization term that can be superimposed on any existing selection method without modifying the underlying retrieval flow.
  • Interpretable cluster analysis: The clusters generated by UCS are semantically interpretable (e.g., micro-topics like authentication or ATM withdrawal in Banking77), providing insights into task structure.

Limitations & Future Work

  • UCS provides limited gains for query-dependent methods that are already strong (where DPP is near saturation on certain datasets).
  • Clustering quality depends on DBSCAN hyperparameter selection (eps requires adaptive heuristics).
  • The statistical reliability of the SGT estimator is limited under small selection budgets (\(B=10\)).
  • Evaluation was limited to a fixed budget of \(B=10\); performance under different budgets remains unknown.
  • vs DPP: DPP encourages diversity through determinant maximization but does not explicitly quantify coverage. UCS provides a complementary subset-level coverage signal.
  • vs VoteK: VoteK selects a global exemplar set based on voting without diversity guarantees. UCS significantly eliminates redundancy via inverse frequency weighting.
  • vs MDL: MDL selects informative exemplars using minimum description length; UCS provides an orthogonal optimization signal from a coverage perspective.

Rating

  • Novelty: ⭐⭐⭐⭐ The application of Good-Turing to ICL is novel; the subset-level coverage perspective is valuable.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three models across classification and reasoning tasks, though fixed budgets limit the depth of analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear and rigorous methodology with tight coupling between theory and experiments.
  • Value: ⭐⭐⭐⭐ A practical plug-and-play tool directly applicable to ICL deployment.