ICLR2026 Self-Supervised Learning Post-training Distributional Coverage In-Context Steerability RLHF DPO Instruction Following

Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability¶

Conference: ICLR2026
arXiv: 2510.06084
Code: To be confirmed
Area: Self-Supervised Learning
Keywords: Post-training, Distributional Coverage, In-Context Steerability, RLHF, DPO, Instruction Following

TL;DR¶

This paper reveals that post-training methods such as RLHF and DPO systematically impair models' in-context steerability, output coverage, and distributional alignment. It proposes the Spectrum Suite evaluation framework and the Spectrum Tuning method, representing the first post-training approach that improves distributional alignment.

Background & Motivation¶

Post-training pipelines—including SFT, RLHF, and DPO—have become standard practice in LLM development.
Although post-training aims to make models "more helpful and safer," its side effects are severely underestimated:
Reduced output coverage: Post-trained models tend to produce "safe-mean" style responses, with dramatically reduced output diversity.
Loss of in-context steerability: Models become harder to guide toward specific output styles or formats via few-shot examples.
Degraded distributional alignment: The match between model output distributions and target task distributions deteriorates.
Key conceptual distinction:
- Capability elicitation ICL: Using a few examples to surface capabilities already encoded in the model (e.g., sentiment classification).
- In-context steerability: Using examples to precisely control the distributional characteristics of model outputs.

Method¶

Overall Architecture¶

Spectrum Suite provides >90 tasks requiring models to adapt to diverse distributions. Spectrum Tuning performs supervised fine-tuning on these tasks—each training sequence consists of a task description followed by multiple randomly permuted in-distribution samples (input/output pairs), with cross-entropy loss computed only over output tokens. This corresponds to the meta-training phase of meta-learning, but with the objective of fitting a distribution rather than a single answer.

Key Designs¶

Spectrum Suite Dataset:
- Compiled from >40 data sources into >90 tasks, covering natural interpersonal variation (opinion modeling, chat preferences), text distributions (poetry formats, synthetic data), numerical distributions (normal distribution sampling), and uncertainty reasoning.
- Focuses on individual-level modeling data—each person represents a distinct data-generating task, providing rich task diversity.
- Training and test tasks are drawn from disjoint data sources to ensure generalization evaluation.
Formal Definition of In-Context Steerability:
- Distinguished from capability elicitation ICL (using examples to activate existing model knowledge).
- Steerability = leveraging contextual information to override the model's prior and guide it toward a new data-generating distribution (e.g., mimicking a specific user's writing style).
- Requires the model to maintain a prior distribution family and maximally exploit contextual information for Bayesian posterior updating.
Spectrum Tuning Training Procedure:
- Each sequence: \([\text{description}] \| [\text{input}_1, \text{output}_1] \| [\text{input}_2, \text{output}_2] \| \cdots\)
- The description is randomly dropped (probability 0.2); when dropped, the loss on the first output is excluded—encouraging the model to utilize both descriptions and in-context examples.
- Cross-entropy loss is computed only over output tokens; training runs for ≤1 epoch to prevent overfitting. In the underfitting regime, CE loss encourages calibrated distributional estimation.
- Sample order is randomly permuted—the exchangeability assumption in training data corresponds to posterior invariance in Bayesian analysis.
Three Measurable Objectives:
- In-context steerability: \(k\)-shot accuracy / NLL, measuring whether the model can adapt to a new distribution.
- Valid output coverage: The range of valid answers covered in the output space.
- Distributional alignment: The degree of match between the output distribution and the target distribution.

Loss & Training¶

Standard CE loss computed over output tokens only. Models are initialized from pretrained (non-instruction-tuned) weights, with 2–3 format-specific special tokens added (embeddings initialized from IT models). Training runs for 1 epoch, validated on three model families: gemma-3-12b, Llama-3.1-8B, and Qwen3-14B.

Key Experimental Results¶

Main Results (In-Context Steerability, Spectrum Suite Test Tasks)¶

Post-Training Method	Classification Accuracy Change	Loss Change	Notes
IT (instruction tuning) vs. PT	35/76 groups show significant decline, only 7 show significant improvement	50/50 groups worse (Gemma, Qwen)	IT systematically harms steerability
ST (Spectrum Tuning) vs. PT	Generally on par or improved	Generally on par or improved	ST restores or surpasses PT
ST vs. IT	Substantially outperforms IT	Substantially outperforms IT	Core value of ST

Key Results (Held-out Test Tasks, gemma-3-12b)¶

Task	ST Loss	PT Loss	IT Loss	Notes
WVS Individual Opinion Modeling (\(k\)=21)	1.36	1.50	4.10	ST best
Number Game (\(k\)=25)	0.639	0.705	1.80	ST best
Chatbot Preference Prediction (\(k\)=3)	1.43	1.62	4.94	ST best
Flight Prediction (\(k\)=9)	1.09	1.32	4.06	ST best

Capability Elicitation ICL Unaffected¶

Task Type	IT vs. PT Accuracy Change	Notes
General capability tasks (8 tasks)	8/24 groups improved, 13 unchanged	IT does not harm capability elicitation
Steerability tasks	35/76 groups declined	IT selectively harms steerability

Key Findings¶

First empirical demonstration that instruction tuning systematically impairs in-context steerability, with a trend opposite to that observed for capability elicitation ICL.
Spectrum Tuning achieves a better balance between steerability and coverage than both PT and IT across all three model families.
To the authors' knowledge, ST is the first post-training method to improve distributional alignment—even surpassing pretrained models.
The steerability loss in IT models likely stems from over-optimization toward a single correct answer, resulting in an overly strong prior that resists contextual override.

Highlights & Insights¶

The paper quantifies the "hidden cost" of post-training—not as an emerging conjecture but as systematic empirical evidence.
The distinction between capability elicitation and in-context steerability is highly valuable, clarifying a long-standing conceptual conflation in the community.
Spectrum Suite is itself a significant contribution as an evaluation framework, filling a critical gap in steerability benchmarking.
Spectrum Tuning demonstrates that "restoring diversity while maintaining helpfulness" is achievable and not a zero-sum trade-off.

Limitations & Future Work¶

Constructing Spectrum Tuning training data depends on defining the "target distribution," which varies across application domains.
In safety-critical settings, improved steerability may increase susceptibility to jailbreaking—the trade-off between safety and steerability warrants deeper investigation.
The weighting scheme across the >90 tasks is underspecified, despite large variation in steerability requirements among tasks.
Comparison with concurrent work (e.g., Constitutional AI, ORPO) is insufficiently thorough.

RLHF/DPO: Ouyang et al., Rafailov et al.—Spectrum Tuning serves as a complement and corrective to these methods.
Output diversity: Nucleus sampling, temperature scaling—these are inference-time solutions; Spectrum Tuning addresses the problem at training time.
ICL theory: Min et al.—this work extends ICL from "capability elicitation" to a new dimension of "distributional control."
Implication: Post-training should not optimize for a single metric (helpfulness alone); steerability and coverage should become standard evaluation dimensions.

Rating¶

Novelty: ⭐⭐⭐⭐ (novel problem formulation and conceptual distinction)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (>90 tasks, multiple models and scales, highly comprehensive)
Writing Quality: ⭐⭐⭐⭐ (clear concept definitions, coherent narrative)
Value: ⭐⭐⭐⭐⭐ (significant implications for post-training paradigms)