Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding¶

Basic Information¶

Conference: ACL 2025
arXiv: 2502.13738
Code: https://github.com/Romainpkq/CD_ICL
Area: LLM / NLP (LLM NLP)
Keywords: In-Context Learning, Contrastive Decoding, Input-Label Mapping, NLU, LLM
TL;DR: Proposes ICCD (In-Context Contrastive Decoding), which enhances LLM's utilization of input-label mapping information in ICL by contrasting the output distributions of positive and negative in-context examples, yielding consistent and significant performance improvements across 7 NLU tasks without requiring training.

Background & Motivation¶

In-Context Learning (ICL) is a core capability that allows LLMs to adapt to new tasks using a few examples. Existing research identifies two key factors that contribute to the success of ICL:

Task Recognition (TR): Identifying the task type from the examples and leveraging pre-trained knowledge to make predictions.

Task Learning (TL): Directly learning the input-label mapping relationships from the examples.

Core Problem: LLMs over-rely on pre-trained knowledge and neglect input-label mapping information in ICL.

Classic Finding: Min et al. (2022) showed that even when the labels of ICL examples are randomly shuffled, model performance does not drop significantly. This indicates that the model relies more on the signal of "observing a sentiment analysis task" (Task Recognition) rather than the specific mapping of "this positive sentence corresponds to the positive label" (Task Learning).

Practical Impact: When the task is inconsistent with the pre-training distribution (such as custom label mappings), the model makes erroneous predictions due to neglecting the mapping relationships in the examples.

Method¶

Core Idea¶

Drawing inspiration from contrastive decoding techniques, it extracts and reinforces the input-label mapping information by contrasting the output distributions of positive and negative in-context examples.

ICCD Formula Derivation¶

Standard ICL decoding: $y \sim p_\theta(y | \mathbf{c}, \mathcal{T}(x))$

ICCD enhanced decoding: $$y_t \sim \text{softmax}(\mathbf{z}_t + \alpha(\mathbf{z}_t - \mathbf{z}_t^-))$$

Equivalent form: $$\tilde{p}_\theta(y|\mathbf{c}, \mathbf{c}^-, \mathcal{T}(x)) \propto p_\theta(y|\mathbf{c}, \mathcal{T}(x)) \left(\frac{p_\theta(y|\mathbf{c}, \mathcal{T}(x))}{p_\theta(y|\mathbf{c}^-, \mathcal{T}(x))}\right)^\alpha$$

Intuitive Understanding: - $\mathbf{z}_t$: Output distribution of positive examples (containing correct mapping + pre-trained knowledge) - $\mathbf{z}_t^-$: Output distribution of negative examples (containing incorrect mapping + same pre-trained knowledge) - $\mathbf{z}_t - \mathbf{z}_t^-$: The difference represents the pure input-label mapping signal - Adding this signal back to the original output enhances the model's focus on the mapping information.

Negative Example Construction (Key Designs)¶

Why modify the inputs instead of the labels? - Directly modifying labels introduces completely different label biases, distorting the mapping information. - Modifying inputs maintains the label distribution unchanged while only altering the mapping relationships.

Specific Method: For each example $(x_i, y_i)$: 1. Randomly select a different label $y_j$ ($y_j \neq y_i$). 2. Randomly select an input $x_j$ with label $y_j$ from the exemplar pool. 3. Construct the negative example $(x_j, y_i)$ — where the input and label do not match.

$\alpha$ Parameter: Controls the importance of the input-label mapping information, set to 1 by default.

Experiments¶

Experimental Setup¶

Models: Llama3.2-1B/3B, Llama3.1-8B, Qwen2-0.5B/1.5B/7B (6 scales in total)
Tasks: 7 NLU tasks — SST-2, SST-5, CR, Subj, QNLI, MNLI, AG_NEWS
Exemplar Selection Methods: Random, BM25, TopK
ICL Setup: 16-shot, $\alpha=1$

Main Results (Table 2)¶

Model	Method	SST2	Subj	QNLI	MNLI	Avg
Llama3.2-1B	Regular	89.8	72.8	53.5	36.6	66.1
	ICCD	91.1	83.0	53.8	39.2	68.3 (+2.1)
Llama3.2-3B	Regular	93.7	86.0	54.2	56.9	72.9
	ICCD	94.0	92.1	57.2	57.0	74.6 (+1.7)
Llama3.1-8B	Regular	96.7	94.0	60.3	65.3	77.6
	ICCD	96.5	96.1	65.4	67.5	79.4 (+1.8)
Qwen2-1.5B	Regular	95.2	72.3	60.2	61.8	72.3
	ICCD	95.1	81.5	61.8	65.2	74.6 (+2.3)
Qwen2-7B	Regular	96.0	82.3	71.4	78.7	79.4
	ICCD	96.3	90.4	72.8	79.9	81.3 (+1.9)

Compatibility with Different Exemplar Selection Methods (Table 1)¶

Model	Method	Random	BM25	TopK
Llama3.1-8B	Regular	77.6	79.7	80.2
	ICCD	79.4	80.8	80.9

ICCD brings improvements across all exemplar selection methods with lower variance (greater stability).

Larger Label Spaces (Table 3)¶

Model	Method	TREC (6 classes)	Dbpedia (14 classes)
Llama3.2-1B	Regular	40.0	85.6
	ICCD	46.2 (+6.2)	90.5 (+4.9)
Llama3.1-8B	Regular	41.0	87.5
	ICCD	46.6 (+5.6)	93.8 (+6.3)

The larger the label space, the greater the improvement brought by ICCD (TREC +5-6%, Dbpedia +5-8%).

Alignment (Chat) Model Validation¶

ICCD is equally effective on Llama3.2-1B/3B-Instruct and Llama3.1-8B-Instruct, demonstrating its applicability to instruction-tuned and RLHF-aligned models.

Key Analysis¶

1. Comparison of Negative Example Construction Methods (Table 4)

Method	Random	BM25	TopK
Regular	77.6	79.7	80.2
+NULL (No exemplars/blank)	73.0	75.8	76.5
+Label (Modified label)	77.3	79.5	80.0
+Input (Modified input)	79.4	80.8	80.9

NULL objective reduces performance (removing the contribution of pre-trained knowledge).
Modifying labels is nearly ineffective (introducing label bias counteracts the mapping signal).
Modifying inputs is optimal (maintaining the label distribution unchanged while only altering the mapping relationships).

2. KL Divergence Analysis (Table 5) The KL divergence between the output distributions of positive and negative examples is significant on most tasks (SST2: 0.64, AGNEWS: 0.79), validating that the contrast indeed isolates distinct mapping information.

3. Influence of Number of Shots As the number of shots increases from 1 to 16, the improvement margin of ICCD expands. This is because more examples provide richer input-label mapping information for contrast.

4. $\alpha$ Sensitivity Analysis (Table 6) - Performance continuously improves as $\alpha$ increases from 0 to 1. - Performance stabilizes once $\alpha \geq 1.0$. - For advanced selection methods (e.g., TopK), an excessively large $\alpha$ may slightly degrade performance.

Highlights & Insights¶

Extremely Simple Method: The core idea is summarized in a single formula; it requires no training and is plug-and-play.
Clear Theoretical Motivation: Starting from the separation of TR and TL in ICL, it precisely targets the problem (the neglect of input-label mapping).
Clever Design in Negative Example Construction: Modifying inputs instead of labels avoids introducing label bias.
Broad Compatibility: Effective across model series (Llama/Qwen), model scales (0.5B-8B), exemplar selection methods, and alignment states.
Open Source Code: Facilitates reproduction and extension.
Contribution to Understanding ICL Mechanisms: Provides a new perspective for understanding the internal mechanisms of ICL using analytical tools such as KL divergence.

Limitations & Future Work¶

Doubled Computational Cost: Each prediction requires two forward passes (positive + negative examples), doubling the inference time.
While $\alpha = 1$ Works Decently, It May Not Be Globally Optimal: Tuning may be required for different tasks.
Only Validated on Classification Tasks: Not yet validated on generative tasks (such as translation, summarization, or dialogue).
Randomness of Negative Examples: The construction method relies on random sampling, which may introduce variance.
Upper Limit of Model Scale: Tested up to 8B/7B models, not yet validated on 70B+ models.
Limited Improvement on Fine-Grained Tasks Like SST-5: Potentially due to the inherently blurry boundaries between labels.

ICL Mechanism Research: Label randomization experiments in Min et al. (2022), TR vs TL in Pan et al. (2023).
Contrastive Decoding: Contrastive Decoding (Li et al., 2023), DoLa, CD-ICL.
ICL Exemplar Selection: Random, BM25, TopK (Liu et al., 2022), KATE.
LLMs: Llama-3 series, Qwen2 series.

Rating ⭐⭐⭐⭐¶

Novelty: ⭐⭐⭐⭐ — Cleverly addresses the mapping neglect problem in ICL from a contrastive decoding perspective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Exhaustive evaluation with 6 model scales, 7 tasks, 3 selection methods, and multi-dimensional analysis.
Value: ⭐⭐⭐⭐ — No training required and the code is open-source, though doubled computational cost remains a practical roadblock.
Writing Quality: ⭐⭐⭐⭐ — Clear mathematical derivations and comprehensive ablation studies.