Is In-Context Learning Learning?¶
Conference: ICLR 2026 arXiv: 2509.10414 Code: Not open-sourced Area: LLM Reasoning Keywords: in-context learning, ICL, memorisation, distributional shift, generalization, autoregressive models
TL;DR¶
This paper systematically investigates whether ICL constitutes genuine "learning" through large-scale controlled experiments. It demonstrates that ICL satisfies the formal mathematical definition of learning, yet empirical evidence reveals its generalization capacity to be limited — models primarily exploit structural regularities within the prompt via deduction rather than acquiring new capabilities from the provided demonstrations.
Background & Motivation¶
Background: In-context learning (ICL) enables autoregressive language models to solve downstream tasks via next-token prediction without parameter updates, requiring only a small number of exemplars in the prompt. This capability has sparked extensive debate on whether LLMs can "learn" unseen tasks from a handful of demonstrations.
Limitations of Prior Work:
- Existing research conflates "deduction" with "learning" — the two are not equivalent
- ICL does not explicitly encode the provided observations, but instead relies on the model's prior knowledge and the prompt exemplars
- Prior studies lack systematic control over confounding factors such as memorisation, pretraining data leakage, and distributional shift
- It remains unclear whether strong ICL performance stems from genuinely learning from demonstrations or from prior knowledge retrieval combined with pattern matching
Key Challenge: Although ICL formally resembles learning (inferring task rules from in-prompt exemplars), it is uncertain whether the autoregressive encoding mechanism provides sufficient inductive bias to support robust generalization and genuine knowledge acquisition — a question with direct implications for how LLMs should be understood and deployed.
Goal: The paper addresses the question "Is ICL learning?" at both theoretical and empirical levels. It first formally proves that ICL satisfies the mathematical definition of learning, then conducts large-scale controlled ablation studies to characterize the practical boundaries of ICL's learning capacity, systematically eliminating confounds including memorisation, pretraining leakage, distributional shift, and prompt style.
Method¶
Overall Architecture¶
The paper adopts a dual-track strategy combining theoretical analysis and large-scale empirical investigation. The theoretical component argues that ICL meets the formal criteria for learning (analogous to the PAC learning framework); the empirical component systematically ablates or controls multiple confounding factors to reveal the true boundaries of ICL's learning capacity. Experiments span diverse model architectures, prompt styles, task types, and data distribution settings, constituting one of the largest controlled studies of ICL behavior to date.
Key Design 1: Disentangling Memorisation from Pretraining Effects¶
The central goal is to separate the contribution of pretraining memorisation from genuine learning attributable to the prompt exemplars. Specific approaches include:
- Benchmark Contamination Detection: Multiple methods are applied to assess whether pretraining data already encodes test task information, quantifying the contribution of memorisation to ICL performance
- Zero-shot vs. Few-shot Delta Analysis: Zero-shot performance reflects pure prior knowledge; only the increment over zero-shot can potentially be attributed to "learning from demonstrations"
- Counterfactual Label Experiments: Random or inverted labels are used to test whether the model genuinely exploits the input–output mapping within the demonstrations
Formally, let \(f\) denote the target function and \(\hat{f}_{\text{ICL}}\) the ICL prediction function. The learning gain of ICL is defined as:
where \(\mathcal{L}\) is the task loss. After adequately controlling for memorisation, \(\Delta_{\text{learn}}\) is found to decrease substantially, indicating that a large portion of ICL performance derives from prior knowledge rather than learning from exemplars.
Key Design 2: Distributional Shift and Exemplar Scaling Behavior Analysis¶
The core experimental methodology involves systematically varying the following factors and observing the resulting scaling behavior of ICL accuracy:
- Number of exemplars \(k\): from \(k=0\) (zero-shot) to many-shot, tracking accuracy as a function of \(k\)
- Exemplar distribution: varying class balance, sample difficulty, selection strategy, and presentation order
- Prompt style: standard few-shot, chain-of-thought (CoT), and diverse template formats and phrasings
- Model selection: autoregressive models of multiple scales and architectures
A key finding is that as \(k\) increases, accuracy converges to a limit that is largely independent of the specific configuration. Formally:
where \(\mathcal{D}\) denotes the exemplar distribution, \(\mathcal{M}\) the model, and \(\mathcal{S}\) the prompt style. This finding contrasts with the expectation under genuine learning, whereby performance should continue to improve with more or better data.
Key Design 3: Distributional Sensitivity of Chain-of-Thought¶
The paper specifically analyzes ICL behavior under CoT prompting. Although CoT substantially improves accuracy on certain tasks, it exhibits greater sensitivity to the distributional properties and format of the prompt. This suggests that CoT improvements do not arise from deeper task learning, but rather from exploiting the structural regularity of reasoning chains to perform pattern deduction more efficiently. Key observations include:
- Standard few-shot performance is relatively stable but subject to a low accuracy ceiling
- CoT performance is more variable and highly dependent on the format and structure of the reasoning chain
- On tasks that are formally similar yet semantically distinct, CoT accuracy diverges substantially
- ICL essentially extracts patterns from the statistical regularity of the prompt rather than encoding new knowledge
Key Experimental Results¶
Main Results: Systematic Evaluation of ICL Learning Capacity¶
| Controlled Variable | Core Finding | Key Evidence |
|---|---|---|
| Memorisation | Pretraining memorisation substantially contributes to ICL performance | Performance drops markedly on contamination-free benchmarks |
| Number of exemplars | Accuracy saturates rapidly | Gains become negligible beyond \(k > 16\); logarithmic saturation curve |
| Exemplar distribution | Distribution-insensitive at the limit | Convergence values are similar across different class/difficulty distributions |
| Model selection | Inter-model differences diminish at the limit | Comparison across multiple architectures and scales |
| Prompt style | Standard few-shot is insensitive; CoT is sensitive | Format variation experiments show larger CoT fluctuations |
| Input linguistic features | Surface features have limited impact on asymptotic performance | Paraphrasing and format changes produce minimal accuracy differences |
Ablation Study: Analysis of ICL Information Utilization Mechanisms¶
| Experimental Condition | Accuracy | Core Implication |
|---|---|---|
| Correct labels | Baseline (highest) | Standard ICL performance |
| Random labels | Marginal drop | Model does not fully rely on the input–output mapping in demonstrations |
| Counterfactual labels | Moderate drop | Model partially uses label information but it is not the primary cue |
| No labels (input format only) | Approaches few-shot | Structural features of the prompt matter more than label content |
| Shuffled exemplar order | Negligible change | Model is insensitive to exemplar presentation order |
Rating¶
Rating: ⭐⭐⭐⭐
Strengths:
- Raises an important question about the nature of ICL, with both theoretical depth and empirical breadth
- Rigorous experimental design: systematically eliminates memorisation, distributional shift, prompt style, and other confounds
- The core finding — that accuracy saturates with more exemplars and is insensitive to hyperparameter choices — carries significant theoretical and practical implications
- The distributional sensitivity analysis of CoT offers a novel perspective for understanding CoT mechanisms
Weaknesses:
- The paper arrives primarily at negative conclusions (limited ICL learning capacity) without offering constructive improvements or solutions
- Some experiments may be subject to selection bias in the benchmark task set; generalizability to more complex reasoning and generation tasks requires further validation
- "Learning" admits multiple formulations (PAC learning, Bayesian learning, generalization-theoretic learning), and the robustness of the conclusions across these definitions is not fully discussed
- A systematic scaling law analysis of how ICL behavior varies across model sizes is absent
Key Distinctions from Related Work:
- Unlike work that explains ICL through Bayesian inference (Xie et al., 2021), this paper systematically challenges the hypothesis that ICL constitutes a robust learning mechanism from an empirical learning-theoretic perspective
- Unlike single-factor analyses, this paper simultaneously controls for memorisation, distribution, model, and prompt style, yielding more comprehensive and reliable negative conclusions
- Unlike purely theoretical analyses, this paper combines mathematical argumentation with large-scale experiments, substantially strengthening the persuasiveness of its conclusions