Learning to Steer: Input-dependent Steering for Multimodal LLMs¶

Conference: NeurIPS 2025 arXiv: 2508.12815 Code: https://github.com/jayneelparekh/learn-to-steer Area: Multimodal VLM / Model Safety / Hallucination Mitigation / Representation Steering Keywords: steering, input-dependent, hallucination mitigation, safety enforcement, contrastive prompting

TL;DR¶

Addressing the limitation of existing steering methods that rely on fixed direction vectors incapable of adapting to diverse inputs, this paper proposes L2S (Learn-to-Steer): it first generates ideal input-specific steering vectors via contrastive prompting (P2S), then trains a lightweight 2-layer MLP to predict these vectors from the input context. This achieves input-dependent behavioral steering at negligible overhead, significantly outperforming static steering baselines on both safety enforcement and hallucination mitigation.

Background & Motivation¶

Background: Steering guides model behavior by applying linear offsets to the latent representations of LLMs/MLLMs, serving as a lightweight post-hoc control mechanism. Mainstream approaches (e.g., CAA/mean-steering) compute the mean difference between positive and negative behavior representations as a fixed steering vector applied uniformly to all inputs.

Limitations of Prior Work: The fatal flaw of fixed steering vectors is that the instantiation of desired behavior is input-dependent. For instance, a safe response to an illegal activity query should be a refusal, whereas a safe response to a medical consultation should recommend seeking expert advice. These two "safe" behaviors are fundamentally different and cannot be captured by a single fixed vector.

Key Challenge: The ideal input-specific steering vector (P2S) requires knowledge of the desired output content to be computed — yet at inference time, the very reason steering is needed is that the answer is unknown, creating a chicken-and-egg problem.

Key Insight: Although the desired answer is unavailable at inference time, contrastive prompts from training data can be used to construct P2S vectors as "teacher signals," after which a minimal network is trained to predict these vectors from the input context.

Core Idea: A 2-layer MLP predicts input-specific steering vectors from intermediate-layer representations of the input, translating the theoretical advantages of P2S into the practically deployable L2S method.

Method¶

Overall Architecture¶

The framework consists of two phases: - Training phase: For each sample \(X=(I,T)\), input-specific positive/negative contrastive prompts \((T_X^+, T_X^-)\) are constructed. Under teacher forcing, the representation difference of the last token at layer \(L^*\) is extracted as the P2S vector \(z_{X,L^*}\). Simultaneously, the input context representation \(h_{X,L'}\) at layer \(L'\) is extracted. An MLP \(g_{\Theta}\) is trained such that \(g_\Theta(h_{X,L'}) \approx z_{X,L^*}\). - Inference phase: For a new input, \(h_{X,L'}\) is extracted, and the trained \(g_{\Theta^*}\) predicts the steering vector, which is applied to the representations of all generated tokens at layer \(L^*\).

Key Designs¶

Input-Specific Contrastive Prompting (P2S):
- Function: Generates prompt completions reflecting desired/undesired behavior for each input.
- Mechanism: Constructs \(X^+ = (I, T||T_X^+)\) and \(X^- = (I, T||T_X^-)\), and extracts the difference of the last-token representations at layer \(L^*\) under teacher forcing: \(z_{X,L^*} = h_{L^*}^{q^+}(X^+) - h_{L^*}^{q^-}(X^-)\).
- Design Motivation: Unlike the fixed prompt pairs in CAA, P2S allows different inputs to use different behavioral descriptions. For example, in safety scenarios, illegal activity queries use a "refusal" template while medical consultations use a "recommend expert" template.
Learn-to-Steer (L2S) Auxiliary Network:
- Function: Predicts P2S steering vectors from the input context, eliminating the need for contrastive prompts at inference time.
- Mechanism: The input context is defined as the representation of the last input token at layer \(L'\): \(h_{X,L'} = h_{L'}^{N_V+N_T}(X)\). The training objective is mean squared error: \(\Theta^* = \arg\min_\Theta \mathbb{E}_X[\|z_{X,L^*} - g_\Theta(h_{X,L'})\|_2^2]\). At inference, the steering is applied to generated token \(p\) as \(h_{L^*}^p \leftarrow h_{L^*}^p + \alpha g_{\Theta^*}(h_{X,L'})\).
- Design Motivation: The 2-layer MLP (hidden size 100) is extremely lightweight. Training requires only representation-space operations without loading main model gradients, making memory overhead negligible.
Multi-Behavior Scenario Handling:
- Function: Handles multiple distinct desired behaviors within the same steering framework.
- Key Example (safety scenario): The first 9 categories of harmful content use "refusal/avoidance" templates for \((T_X^+, T_X^-)\); the remaining 3 sensitive consultation categories use "recommend expert" templates. L2S naturally supports multi-behavior by learning mappings from different inputs to different vectors.
- Key Contrast: Mean-steering suffers interference when mixing vectors from different templates (Mean-S performs worse than Mean-S(BA)), whereas L2S handles this gracefully.

Loss & Training¶

Auxiliary network: 2-layer MLP, hidden size 100
Trained for 100 epochs with the Adam optimizer, learning rate \(10^{-4}\) or \(5\times10^{-5}\)
Cosine learning rate schedule with plateau-based adaptation
Steering strength \(\alpha \in [1, 3.0)\) (LLaVA), ensuring response quality degradation < 10%
Safety task: \(L^*=15\) (steering layer), \(L'=30\) (context extraction layer)
Hallucination task: \(L^*=14, L'=14\)
Evaluated on LLaVA-v1.5-7B and Qwen2-VL-7B; runnable on a single RTX5000 (24GB) GPU

Key Experimental Results¶

Safety Enforcement — MMSafetyBench (LLaVA-v1.5)¶

Metric	No-steering	Prompt	Mean-S	Mean-S(BA)	L2S	P2S*
\(\mathbb{E}_{p\geq0.5}\)[Unsafe]↓	0.276	0.248	0.161	0.089	0.082	0.094
\(\mathbb{E}_{p\geq0.7}\)[Unsafe]↓	0.234	0.207	0.129	0.066	0.057	0.064
\(\mathbb{E}_{p\geq0.9}\)[Unsafe]↓	0.204	0.183	0.102	0.041	0.034	0.042
ED-score↑	0.250	0.197	0.329	0.276	0.395	0.382
Response quality↑	6.92	7.34	6.61	6.42	6.56	6.49

Hallucination Mitigation — POPE (LLaVA-v1.5)¶

Subset	Metric	No-steering	Prompt	Norm-Rnd	Mean-S	L2S	P2S*
Random	Accuracy↑	82.73	84.91	82.38	84.29	86.46	89.26
Random	F1↑	90.55	91.84	90.34	91.47	92.74	94.33
Popular	Accuracy↑	80.40	83.35	80.36	82.11	82.58	88.64
Adversarial	Accuracy↑	76.82	76.36	75.77	76.36	77.76	82.58

CHAIR Evaluation (LLaVA-v1.5, 500 COCO images)¶

Method	CHAIR_s↓	CHAIR_i↓	Recall↑	Gemini Win Rate↑
No-steering	17.31	52.80	71.23	35.80%
L2S	16.10	51.80	73.50	64.20%

Key Findings¶

L2S surpasses the P2S oracle on the safety task (Unsafe-score 0.082 vs. 0.094), indicating that the learned mapping generalizes better than per-sample ideal vector computation.
Mean-S degrades when mixing multiple behavior templates (Mean-S: 0.161 vs. Mean-S(BA): 0.089), whereas L2S handles multi-behavior settings simultaneously (ED-score 0.395, far exceeding all baselines).
Random-direction steering (Norm-Rnd) reduces harmful content but fails to elicit expert recommendation behavior, confirming the critical importance of steering direction precision.
On the hallucination task, Mean-S and Prompt fail to consistently improve across all subsets, while L2S uniformly outperforms all available baselines.
A Gemini Win Rate of 64.20% indicates that L2S not only reduces hallucinations but also improves description quality.

Highlights & Insights¶

The core insight of input-dependent steering is precise: desired behavior is not a fixed direction but a manifold conditioned on the input context — this is especially evident in safety scenarios (refusal vs. recommending experts vs. non-intervention).
The elegance of replacing teacher forcing with a 2-layer MLP lies in converting a theoretically impractical method (requiring the answer to perform steering) into a lightweight, deployable solution.
Training cost is minimal: only a small network in representation space needs to be trained without main model gradients, completable on a single 24GB GPU.
L2S outperforming the oracle P2S on the safety task suggests that the learned mapping provides a regularization effect with strong generalization.

Limitations & Future Work¶

Contrastive prompt design still requires manual effort; different application scenarios necessitate customizing distinct \((T_X^+, T_X^-)\) templates.
Steering is currently applied as a linear offset at a single layer \(L^*\); multi-layer or nonlinear steering may prove more effective.
The auxiliary network capacity (hidden size 100) may limit the modeling of complex behavioral patterns.
Validation is primarily conducted on LLaVA-v1.5 and Qwen2-VL; broader evaluation across more models and tasks is needed.
Performance is highly sensitive to the choice of \(\alpha\) (notable degradation at \(\alpha \geq 3\)); automated \(\alpha\) selection remains an open problem.
Misuse risk: the same methodology could be exploited to steer models toward harmful behaviors.

vs. CAA (Contrastive Activation Addition): CAA uses a fixed mean-difference vector, suited for scenarios with a single behavioral instantiation; L2S extends this to input-dependent settings covering multi-behavior scenarios.
vs. CAST: CAST scales a fixed steering vector based on similarity to a condition vector, but the direction remains unchanged; in L2S, both direction and magnitude are input-dependent.
vs. PAI / AD-HH (attention head intervention): These methods directly manipulate attention weights, while L2S operates on residual stream representations; the two are complementary and can be combined.
vs. fine-tuning (SFT/RLHF): Fine-tuning is costly and may cause forgetting; L2S is a post-hoc method that leaves model weights unchanged.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of input-dependent steering is natural yet previously underexplored; the two-stage P2S→L2S design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Two applications (safety + hallucination), two models, multiple evaluation dimensions, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated, examples are intuitive, and the method description is concise.
Value: ⭐⭐⭐⭐ Highly practical — a post-hoc behavioral control method with minimal cost, directly deployable in production environments.