Skip to content

Unlabeled Data Can Provably Enhance In-Context Learning of Transformers

Conference: NeurIPS 2025 arXiv: 2601.10058 Code: None Area: LLM Reasoning / Theory Keywords: in-context learning, unlabeled data, semi-supervised learning, EM algorithm, chain-of-thought, transformer theory

TL;DR

This paper proposes an augmented ICL framework in which the prompt contains both a small set of labeled examples and a large collection of unlabeled examples. It theoretically proves that a multi-layer Transformer, via chain-of-thought (CoT) reasoning, can simulate the EM algorithm to extract information from unlabeled data, improving the classification excess risk from \(\mathcal{O}(1/\sqrt{N})\) to \(\mathcal{O}(1/\sqrt{N + \text{poly}(M)})\).

Background & Motivation

Background: ICL enables Transformers to learn new tasks from in-context examples without parameter updates, but it relies heavily on the quantity and quality of labeled demonstrations. Acquiring high-quality labeled data is costly (e.g., RLHF data for GPT-3.5/4 involves thousands of hours of expert annotation).

Limitations of Prior Work: (a) Prompt length limits the number of labeled examples; (b) existing methods use the LLM itself to generate pseudo-labels, which inherit model bias; (c) abundant unlabeled data remains unexploited by ICL.

Key Challenge: A large amount of unlabeled data exists, yet ICL has no mechanism to leverage it — conventional ICL processes only \((x, y)\) pairs, leaving unlabeled \(x\) unused.

Goal: To theoretically prove that unlabeled data can improve ICL performance, and to provide an explicit Transformer construction along with training convergence guarantees.

Key Insight: The paper connects augmented ICL (labeled + unlabeled in the prompt) to the classical EM algorithm in semi-supervised learning — a Transformer can iteratively refine class-mean estimates through multi-step CoT reasoning.

Core Idea: A 4-layer Transformer with CoT reasoning implicitly implements the EM algorithm, learning simultaneously from labeled and unlabeled data within the prompt.

Method

Overall Architecture

The input is a mixed prompt \(\mathcal{I} = \mathcal{D}_{label} \cup \mathcal{D}_{unlabel}\), where \(\mathcal{D}_{label} = \{(\mathbf{x}_j, y_j)\}_{j=1}^N\) and \(\mathcal{D}_{unlabel} = \{\mathbf{x}_j\}_{j=N+1}^{N+M}\). This is encoded into a matrix \(\mathbf{H}\) containing labeled, unlabeled, and reasoning blocks. The Transformer iteratively refines class-mean estimates \(\hat{\mu}_i^{(t)}\) in the reasoning block via CoT, and ultimately classifies unlabeled samples using nearest-neighbor assignment.

Key Designs

  1. CoT Encoding for Augmented ICL:

    • Function: Encodes labeled and unlabeled data together with intermediate reasoning states into a unified token sequence.
    • Mechanism: The reasoning block \(\mathbf{Q}^{(t)}\) stores the class-mean estimate \(\hat{\mu}_i^{(t)}\) at step \(t\). Each CoT step appends \(C\) newly generated tokens to the sequence.
    • Design Motivation: To exploit the autoregressive nature of CoT for EM iteration — each CoT step corresponds to one EM update.
  2. 4-Layer Transformer Construction (Theorem 4.1):

    • Function: Explicitly constructs a 4-layer Transformer that implements the EM update.
    • Mechanism: The update rule is \(\hat{\mu}_i^{(t+1)} = \hat{\mu}_i^{(t)} - \frac{\eta^{(t)}}{M} \sum_{j} p_{ij}^{(t)}(\hat{\mu}_i^{(t)} - \mathbf{x}_j) + \mathbf{1}_{\{t=0\}} \frac{C}{N} \sum_j (\mathbf{e}_i^\top \mathbf{y}_j) \mathbf{x}_j\). The first layer uses softmax attention to compute the E-step (class membership probabilities \(p_{ij}^{(t)}\)); subsequent layers implement the M-step (weighted mean update). Initialization uses class means from the labeled data.
    • Design Motivation: To precisely simulate the EM algorithm for a Gaussian mixture model — the E-step computes posterior probabilities and the M-step updates parameters.
  3. Convergence Analysis (Theorem 4.2):

    • Function: Proves that class-mean estimates converge to their true values as the number of CoT steps increases.
    • Mechanism: Under the conditions of signal-to-noise ratio \(\text{SNR} \geq \Omega(\sqrt{C \log(CM)})\), sufficiently large \(N\), and sufficiently large \(M\), the excess risk is \(\mathcal{O}(1/\sqrt{N + \text{poly}(M)})\), which strictly improves upon the labeled-only lower bound of \(\mathcal{O}(1/\sqrt{N})\).
    • Design Motivation: To rigorously establish the theoretical value of unlabeled data — beyond empirical observation alone.
  4. Training Convergence (Theorem 5.1):

    • Function: Proves that Transformer parameters trained with teacher forcing converge linearly to the target solution.
    • Mechanism: The gradient of the CoT training loss is decomposed into two analytically tractable terms; isotropy of the involved quantities simplifies the analysis.
    • Design Motivation: To show that the theoretical construction is not merely an existence result but can be recovered via standard training.

Loss & Training

Teacher forcing training: at each CoT step \(t\), the supervision signal is the "ground-truth" next-step output of the EM algorithm. Gradient descent on the population loss converges at a linear rate.

Key Experimental Results

Main Results

On a multi-class linear classification setting (\(d=20\), \(C=\) 3 and 5 classes):

Method Mean Estimation Error Classification Accuracy
Conventional ICL (N labeled only) Baseline Baseline
Augmented ICL (N labeled + M unlabeled) Significantly reduced Significantly improved
Bayes-optimal classifier (labeled only) Below Augmented ICL Below Augmented ICL

Ablation Study

Configuration Effect Remark
Increasing \(M\) (more unlabeled data) Consistent performance improvement Validates \(\mathcal{O}(1/\sqrt{N+\text{poly}(M)})\)
Increasing \(T\) (more CoT steps) Performance improves then saturates Consistent with EM convergence behavior
Fixed \(N\), increasing \(M\) only Performance improves beyond \(N\)-only level Independent contribution of unlabeled data

Key Findings

  • Augmented ICL significantly outperforms conventional ICL and even surpasses the Bayes-optimal classifier trained on labeled data alone, demonstrating the genuine value of unlabeled data.
  • Performance gains grow consistently with \(M\), in agreement with theoretical predictions.
  • \(T = 5\)\(10\) CoT steps are generally sufficient for convergence.

Highlights & Insights

  • Theoretical bridge between ICL and semi-supervised learning: This work is the first to theoretically establish the equivalence between Transformer CoT reasoning and the EM algorithm, offering a new perspective on the inference mechanism of ICL.
  • Precise quantification of excess risk improvement: The result \(\mathcal{O}(1/\sqrt{N}) \to \mathcal{O}(1/\sqrt{N + \text{poly}(M)})\) is a clean theoretical characterization that clearly captures the marginal contribution of unlabeled data.
  • Complete theory covering existence and learnability: Beyond constructing an ideal Transformer (existence), the paper also proves training convergence (learnability), forming a complete theoretical chain.

Limitations & Future Work

  • The theory is restricted to multi-class linear classification with Gaussian mixture models — whether it generalizes to nonlinear classification or regression tasks remains open.
  • Isotropic covariance \(\Sigma\) is assumed — theoretical analysis under anisotropic settings is more challenging.
  • The constructed 4-layer Transformer is idealized — it is unclear whether practically pretrained LLMs implicitly learn analogous EM strategies.
  • Experiments are conducted solely on synthetic data — the effectiveness of augmented ICL on real NLP/CV tasks is not validated.
  • Unlabeled data must be task-relevant — performance may degrade if the unlabeled data distribution does not match the task.
  • vs. Unsupervised ICL (Gupta et al. 2024): Purely unsupervised ICL relies on unlabeled data only; this paper adopts a semi-supervised framework with labeled + unlabeled data and provides stronger theoretical guarantees.
  • vs. Pseudo-label methods (Wan et al. 2023): Pseudo-labels inherit model bias; augmented ICL performs reasoning directly within the prompt without generating pseudo-labels.
  • vs. Li et al. 2025 (concurrent): That work uses linear Transformers, binary classification, and asymptotic analysis; this paper uses softmax attention, multi-class classification, and non-asymptotic convergence bounds.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First work to theoretically answer whether unlabeled data can benefit ICL, establishing the ICL–EM equivalence.
  • Experimental Thoroughness: ⭐⭐⭐ Experiments are limited to synthetic data with no validation on real tasks — standard practice for theoretical work, but still a limitation.
  • Writing Quality: ⭐⭐⭐⭐ Meets the conventions of theoretical paper writing with a clear notation system, though the technical density is high.
  • Value: ⭐⭐⭐⭐ Makes an important contribution to understanding ICL mechanisms, though the gap from linear classification to practical LLMs remains substantial.