Compositional Steering of Large Language Models with Steering Tokens¶
Conference: ACL 2026 arXiv: 2601.05062 Code: None Area: Model Compression Keywords: Compositional Steering, Steering Tokens, Self-Distillation, Multi-Behavior Control, Zero-Shot Composition
TL;DR¶
This paper proposes compositional steering tokens that compress behavioral instructions into input-space embedding vectors via self-distillation, and trains a dedicated composition token
Background & Motivation¶
Background: LLM deployment requires satisfying multiple simultaneous behavioral constraints (e.g., language, length, format). Fine-tuning is computationally expensive and may degrade general capabilities; arbitrary combinations of \(N\) behaviors imply \(2^N\) fine-tuning runs. Instruction-based steering is flexible but brittle—semantically equivalent prompts yield inconsistent behavior.
Limitations of Prior Work: (1) Activation-space steering methods (e.g., CAA) compose behaviors via vector addition, but directly combining independently trained modules is disruptive. (2) Gist tokens address only single-behavior compression and do not tackle compositional generalization. (3) Existing compositional steering lacks rigorous evaluation—most work provides only anecdotal evidence or omits baseline comparisons.
Key Challenge: Independently trained behavioral representations interfere when composed, yet training separately for each combination leads to combinatorial explosion. A representation that learns the concept of "composition" itself—rather than each specific combination—is needed.
Goal: Learn a universal composition token
Key Insight: Placing behavioral representations in the input space (rather than the activation space) facilitates better zero-shot composition; freezing behavior tokens during composition token training forces
Core Idea: Composition is treated as a learnable universal operator rather than a behavior-pair-specific adjustment.
Method¶
Overall Architecture¶
Training proceeds in two stages. (1) Each behavior's steering token is trained independently via self-distillation—a teacher receives the instruction text while a student receives the steering token, with training minimizing KL divergence between their output distributions. (2) The LLM and behavior tokens are frozen; a composition token
Key Designs¶
-
Input-Space Steering Tokens:
- Function: Compress behavioral instructions into a single input embedding vector.
- Mechanism: A steering token \(\mathbf{e}_b \in \mathbb{R}^d\) resides in the model's input embedding space (not intermediate activations) and is trained via self-distillation by minimizing \(\text{KL}(P_{\text{teacher}}(y|x, I_b) \| P_{\text{student}}(y|x, \texttt{<b>}))\). Ten instruction paraphrases are used to prevent overfitting.
- Design Motivation: Input-space behavioral representations are better suited for composition than activation-space ones, where direct vector addition tends to cause destructive interference.
-
Universal Composition Token
: - Function: Learn the concept of "composition" independent of specific behaviors.
- Mechanism: \(\mathbf{e}_{\text{<and>}}\) is trained on behavior pairs with behavior tokens frozen—this ensures that
learns the composition function itself rather than adjustments to individual behaviors. Zero initialization (to avoid bias toward any behavior) and orthogonal regularization (to prevent collapse into the behavior token space) are applied. - Design Motivation: If
were allowed to modify behavior tokens during training, it would learn only specific combinatorial adjustments rather than a general compositional capability.
-
Orthogonal Regularization:
- Function: Prevent the composition token from collapsing into the behavior token representation space.
- Mechanism: \(\mathcal{L}_{\text{orth}} = \sum_{b \in \mathcal{B}_{\text{seen}}} \left(\frac{\mathbf{e}_{\text{<and>}} \cdot \mathbf{e}_b}{\|\mathbf{e}_{\text{<and>}}\| \cdot \|\mathbf{e}_b\|}\right)^2\), with the total loss \(\mathcal{L} = \mathcal{L}_{\text{dist}} + \lambda \cdot \mathcal{L}_{\text{orth}}\).
- Design Motivation: Ablation experiments confirm that orthogonal regularization is critical for zero-shot compositional generalization.
Loss & Training¶
The training objective combines self-distillation loss and orthogonal regularization. The LLM is fully frozen; only \(|\mathcal{B}|+1\) vectors of dimension \(d\) are learned. A temperature of \(T=10.0\) encourages matching the full output distribution. Behavior tokens are semantically initialized (as the mean of instruction token embeddings); the composition token is zero-initialized.
Key Experimental Results¶
Main Results¶
2-Behavior Composition Accuracy (%) on Qwen3-8B
| Method | Seen Combinations | Unseen Combinations | Order Variance ↓ |
|---|---|---|---|
| CAA (Activation Steering) | 1.6 | 0.5 | - |
| LM-Steer | 18.1 | 13.4 | - |
| LoRA DARE | 81.5 | 44.8 | - |
| Instruction Steering | 86.2 | 67.3 | 12.3 |
| Steering Tokens | 90.5 | 75.5 | 4.8 |
| Steering Tokens + Instructions | 92.0 | 80.3 | 3.5 |
Ablation Study¶
| Configuration | Seen Combinations | Unseen Combinations | Notes |
|---|---|---|---|
| No |
Degraded | Significantly degraded | Composition token is critical |
| No orthogonal regularization | Slightly degraded | Noticeably degraded | Orthogonality important for generalization |
| Random initialization of |
Slightly degraded | Degraded | Zero initialization is superior |
| Trained on 2-behavior only | — | Generalizes to 3-behavior | Compositional concept transfers |
Key Findings¶
- Steering tokens substantially outperform activation-space steering on both seen and unseen combinations (CAA: 1.6% vs. steering tokens: 90.5%).
- The composition token successfully generalizes to: unseen behavior combinations, combinations involving unseen behaviors, and 3-behavior compositions (despite training only on 2-behavior combinations).
- The hybrid approach (steering tokens + instructions) achieves the best performance across all settings, indicating complementarity between the two control signals.
- Composition accuracy and robustness improve with model scale (4B → 8B → 14B).
- Steering tokens exhibit substantially lower order variance than instruction steering, reflecting more stable behavioral representations.
Highlights & Insights¶
- The idea of "learning a composition operator rather than memorizing each combination" is both concise and powerful—analogous to learning a function versus storing a lookup table.
- Freezing behavior tokens during
training is the critical design decision that enables generalization rather than overfitting. - The complementarity between steering tokens and instructions is noteworthy—compressed representations and natural language provide distinct types of control signals.
Limitations & Future Work¶
- Evaluation is restricted to automatically verifiable constraints (length, format, language); subjective behaviors such as style and tone are not covered.
- Each behavior requires independently trained steering tokens, resulting in training costs that scale linearly with the number of behaviors.
- The composition token is trained only on 2-behavior combinations; performance on larger compositions may degrade.
- The approach depends on the quality of self-distillation—if the teacher (instruction steering) itself fails to follow instructions reliably, the student cannot be trained effectively.
Related Work & Insights¶
- vs. CAA / Rimsky et al.: Activation-space steering suffers severe interference during composition (1.6%), while input-space steering tokens decisively outperform it.
- vs. Gist Tokens: Gist tokens compress only single instructions and do not address the compositional generalization problem.
- vs. LoRA Merging: LoRA DARE is competitive on seen combinations (81.5%) but generalizes poorly to unseen ones (44.8%).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The concept of a universal composition operator and the frozen-training design are highly original.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Seven models, 15 behaviors, multiple composition settings, and over 1M evaluations—remarkably comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clear, the method is elegant, and the experimental design is rigorous.
- Value: ⭐⭐⭐⭐⭐ Provides a concise and effective new paradigm for multi-behavior controllable generation.