Exploring Explanations Improves the Robustness of In-Context Learning¶

Conference: ACL 2025
arXiv: 2506.02378
Code: https://github.com/CyberAgentAILab/x2-icl
Area: LLM/NLP
Keywords: In-Context Learning, OOD Robustness, Explanation Exploration, Latent Variable Models, Natural Language Understanding

TL;DR¶

This paper proposes the X²-ICL framework, which systematically explores the latent reasoning space by generating explanatory reasoning paths for all possible labels (rather than only the observed label) within the in-context examples. This significantly improves the robustness of ICL on out-of-distribution (OOD) data—across 8 OOD datasets on 5 LLMs, X²-ICL outperforms ICL and X-ICL in 6–8 datasets.

Background & Motivation¶

OOD Robustness Issues in ICL: Although In-Context Learning (ICL) is highly efficient, its performance drops severely on out-of-distribution (OOD) data—performance degrades significantly when there is an adversarial shift between the test distribution and the exemplar distribution.

Limitations of X-ICL: Existing explanation-based ICL (X-ICL) guides reasoning by generating explanations for the correct labels of exemplars. However, it only explores a single reasoning path corresponding to the "correct label", heavily constraining the latent variable space.

Latent Variable Modeling Perspective: From the perspective of statistical latent variable models, the reasoning underlying a label (explanation) is a latent variable. X-ICL focuses solely on the latent variables of the observed label, ignoring the reasoning possibilities of counterfactual (unrealized) labels.

Unreliability of Reasoning in OOD Scenarios: In OOD data, reasoning patterns learned from exemplars are not always reliable. Models need to analyze inputs from multiple perspectives to make accurate predictions.

Scalable Explanation Generation: One advantage of X-ICL is that explanations are generated by LLMs (rather than manually annotated). This scalability provides the feasibility to explore a richer reasoning space.

Theoretical Motivation of the Bayes Optimal Classifier: From classification theory, optimal decision-making requires comparing the posterior probabilities of all possible labels, which demands considering the reasoning paths associated with all labels.

Method¶

Overall Architecture¶

The core idea of X²-ICL: For each exemplar \((x, y)\), it generates explanations not only for the correct label \(y\), but also for all other possible labels \(\ell \in Y\). During inference, the LLM first generates explanations for all candidate labels of the test input, and then selects the label supported by the most valid reasoning as the prediction.

Key Designs¶

Key Design 1: Full-Label Explanation Generation (Preprocessing Phase)¶

For each exemplar \((x_i, y_i)\), a meta-prompt \(S_m\) is used to generate an explanation \(r_{i,\ell}\) for each possible label \(\ell = 1,...,L\):

\[r_{i,\ell} \sim \tilde{p}(r_\ell | y_i = \ell, x_i)\]

Ultimately, each exemplar is augmented as \((x_i, \mathbf{r}_i, y_i)\), where \(\mathbf{r}_i = (r_{i,1}, ..., r_{i,L})\). The meta-prompt only requires one explanation example per label, resulting in extremely low manual annotation costs.

Key Design 2: Systematic Exploration of the Latent Variable Space¶

From the perspective of latent variable models: - ICL: Directly models \(\hat{p}(y|x)\), with no latent variables. - X-ICL: Models \(\hat{p}(y|r_y, x)\), only exploring the reasoning path \(r_y\) of the observed label. - X²-ICL: Models \(\hat{p}(y|\mathbf{r}, x)\), exploring the set of reasoning paths for all labels \(\mathbf{r} = (r_1,...,r_L)\).

X²-ICL preserves the full dimensionality of the latent variable space, avoiding the constraint in X-ICL that binds the latent space to realized values.

Key Design 3: Inference Phase¶

Given a test input \(x'\): 1. The LLM generates the reasoning paths for all labels: \(\mathbf{r}' = (r'_1,...,r'_L) \sim \hat{p}(\mathbf{r}|x')\) 2. Computes \(\hat{p}(y'|\mathbf{r}', x')\) for each label \(y'\) 3. Selects the label with the highest probability: \(\delta^{\text{X}^2\text{-ICL}}(x') = \arg\max_{y'} \hat{p}(y'|\mathbf{r}', x')\)

Key Design 4: Theoretical Connection with the Bayes Optimal Classifier¶

The paper proves through a classification theory framework that X²-ICL is closer to the Bayes optimal classifier \(\delta^*(x) = \arg\max_y p(y|x)\), as it more accurately approximates the conditional distribution \(p(y|x)\) by exploring the complete latent variable space.

Loss & Training¶

Evaluation is performed using the 0-1 classification loss, with the misclassification probability \(\text{Pr}\{y \neq \delta(x)\}\) as the metric. X²-ICL does not involve any training or parameter updates; all optimizations are achieved through the exploration of reasoning paths during inference.

Key Experimental Results¶

Main Results: OOD Accuracy on GPT-4o (8-shot ICL)¶

Dataset	ICL	X-ICL	X²-ICL	Type
SNLI (ID)	90.95	90.00	90.25	In-Distribution
HANS	88.05	86.35	88.85	OOD
NAN	75.97	78.29	78.78	OOD
PISP	77.90	81.40	83.76	OOD
ST	78.25	81.50	82.35	OOD
ANLI-R1	70.67	75.58	77.40	OOD
ANLI-R2	61.05	63.87	67.61	OOD
ANLI-R3	61.58	65.07	67.70	OOD
QQP (ID)	83.65	82.75	78.85	In-Distribution
PAWS	65.15	63.80	70.85	OOD

Multi-Model Consistency Validation¶

Model	No. of OOD Datasets where X²-ICL outperforms ICL + X-ICL
GPT-4o	8/8
Gemini-1.5-Pro	6/8
Gemini-2.0-Flash	7/8
Phi-4-14B	6/8
DeepSeek-R1-8B	7/8

Comparison with Retrieval-Based ICL (GPT-4o)¶

Method	HANS	PISP	ANLI-R1	ANLI-R2	PAWS
Set-BSR	85.40	79.99	74.42	58.69	72.25
X-ICL	86.35	81.40	75.58	63.87	63.80
X²-ICL	88.85	83.76	77.40	67.61	70.85

Key Findings¶

X²-ICL outperforms both ICL and X-ICL on all 8 OOD datasets using GPT-4o, with the greatest improvement observed on ANLI-R2 (+6.56 vs. ICL, +3.74 vs. X-ICL).
High-performance LLMs benefit more from X²-ICL, as X²-ICL requires stronger reasoning capabilities to evaluate multiple reasoning paths simultaneously.
On in-distribution (ID) data (SNLI, QQP), X²-ICL and X-ICL perform slightly worse than standard ICL, exhibiting an ID-OOD performance trade-off.
Diversification of the reasoning space contributes more to OOD robustness than the diversification of the exemplar space—X²-ICL consistently outperforms the retrieval-based ICL method Set-BSR.
Even for small-scale open-source models (DeepSeek-R1-8B), X²-ICL yields improvements on most OOD datasets.

Highlights & Insights¶

Elegant Statistical Framework: From the perspective of latent variable modeling, the evolution of ICL \(\to\) X-ICL \(\to\) X²-ICL is clearly formulated as a progressive expansion of the latent variable space from empty \(\to\) partial \(\to\) complete. The theoretical motivation is highly intuitive.
Minimalist Method Design: X²-ICL requires no training or extra models, and is achieved solely by generating more explanations in the preprocessing phase, which has extremely low engineering complexity.
"Reasoning Diversity > Exemplar Diversity" Insight: Experiments clearly demonstrate that exploring different reasoning paths for the same exemplar improves OOD robustness more effectively than retrieving different exemplars.
Scalability of meta-prompt: Only one human-written explanation exemplar is required per label, resulting in negligible annotation costs and high practical usability.
Large-scale Validation Across 5 Models and 8 Datasets: The experiments have wide coverage, ensuring highly credible conclusions.

Limitations & Future Work¶

ID-OOD Performance Trade-off: X²-ICL may underperform standard ICL on in-distribution data; co-optimizing both remains an open question.
Increased Computational Cost: Generating explanations for each label increases token consumption in preprocessing and inference by \(L\) times (\(L\) is the number of labels), which could be a bottleneck for multi-class tasks.
Validation Limited to Classification Tasks: NLI and paraphrase identification are both classification tasks; its applicability to generation, regression, or other tasks has not been validated.
Dependency on Explanation Quality: Since explanations are generated by GPT-4o, the upper limit of explanation quality is bounded by the capabilities of the generator LLM.
English-Only Data: All evaluation datasets are in English, and performance in cross-lingual scenarios is unknown.
Weak Theoretical Guarantees: While there is intuitive justification from the latent variable framework, rigorous theoretical guarantees regarding OOD robustness improvements are lacking.

X-ICL (He et al., 2024): The direct predecessor of this work, which enhances ICL by generating explanations for the correct labels of exemplars. X²-ICL extends this from single-label explanations to full-label explanations.
Few-shot CoT (Wei et al., 2022): Provides human-written reasoning steps as context. X-ICL/X²-ICL achieves scalability through machine-generated explanations.
Set-BSR (Gupta et al., 2023): Enhances ICL by retrieving diverse exemplars, but experiments show that reasoning diversity is more crucial than exemplar distribution diversity.
Self-Consistency (Wang et al., 2022): Improves CoT by sampling multiple reasoning paths and voting, echoing the "multi-path reasoning" concept in X²-ICL.
Insights: The core idea of X²-ICL—"considering reasoning for all possible labels"—could be extended to other scenarios requiring multi-perspective analysis, such as multiple-choice reasoning, debate-based reasoning, and chain-of-thought critique.

Rating¶

Novelty: ⭐⭐⭐⭐ — Reinterprets ICL from the perspective of latent variable modeling and derives the method naturally; elegant yet uncomplicated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 5 models \(\times\) 10 datasets, providing a comprehensive comparison with highly consistent results.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear statistical framework, natural derivation of the methodology, and intuitive illustrations.
Value: ⭐⭐⭐⭐ — Provides a simple and effective solution for enhancing ICL robustness, with highly generalizable insights.