Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering¶
Conference: AAAI 2026 arXiv: 2505.12189 Code: https://github.com/neuro-symbolic-ai/steering_content_effects Area: Multilingual Translation Keywords: content effects, activation steering, syllogistic reasoning, formal logic, reasoning bias
TL;DR¶
This paper applies activation steering to mitigate content effects in LLMs — the tendency to conflate content believability with formal logical validity. The proposed K-CAST (kNN-based Conditional Activation Steering) method achieves up to 15% improvement in formal reasoning accuracy on models unresponsive to standard static steering.
Background & Motivation¶
Background: LLMs exhibit strong commonsense reasoning capabilities but suffer from "content effects" in formal logical reasoning — when syllogism content aligns with world knowledge (e.g., "All students read; some readers are professors; therefore some students are professors"), models are more inclined to judge it as logically valid, even when it is not. This mirrors content bias observed in human cognition.
Limitations of Prior Work: (a) Chain-of-thought prompting improves reasoning but cannot eliminate content effect bias; (b) fine-tuning is costly and cannot fully debias models; (c) neurosymbolic approaches require integration of external symbolic solvers, increasing system complexity.
Key Challenge: Formal reasoning requires that validity depend solely on logical form, not content. However, world knowledge acquired during pretraining "contaminates" the formal reasoning process, causing believable but logically invalid arguments to be incorrectly accepted.
Goal: (a) Identify which internal layers of LLMs encode formal validity and content believability; (b) reduce the influence of content on formal reasoning through inference-time activation intervention.
Key Insight: Activation steering is an inference-time technique that requires no model retraining; it modulates model behavior by adding or subtracting "steering vectors" to internal activations. The authors proceed progressively from probing → static steering → conditional steering.
Core Idea: This work is the first to apply activation steering to content effect mitigation. It finds that static contrastive steering is effective for most but not all models, and proposes K-CAST — a fine-grained kNN-based conditional steering method — to handle unresponsive models.
Method¶
Overall Architecture¶
A four-step pipeline: (1) construct a controlled dataset (16K+ syllogisms spanning all four quadrants of believable/unbelievable × valid/invalid); (2) identify relevant layers via linear probing (third quartile of the residual stream); (3) compute contrastive steering vectors and apply static steering; (4) apply K-CAST conditional steering for models unresponsive to static steering.
Key Designs¶
-
Controlled Dataset Construction:
- Function: Generate ~16K English syllogisms that systematically decouple formal validity from content believability.
- Mechanism: Twenty-four abstract syllogistic schemata are instantiated into natural language using hypernym–hyponym relations from WordNet. Each schema yields four variants: believable-valid, unbelievable-valid, believable-invalid, and unbelievable-invalid.
- Design Motivation: Existing reasoning benchmarks do not systematically control for content believability, making it impossible to precisely measure content effects.
-
Contrastive Activation Addition (CAA):
- Function: Compute the difference vector between activations corresponding to correct reasoning and biased erroneous reasoning, then add it to model activations at inference time.
- Mechanism: The positive vector is the mean activation over correctly predicted instances, and the negative vector is the mean over incorrectly predicted instances. The steering vector is \(\Delta\phi = \frac{1}{N}\sum(a_i^+ - a_i^-)\). At inference time, \(\tilde{\phi}(x) = \phi(x) + \alpha \cdot \Delta\phi\), where \(\alpha\) controls the intervention strength.
- Design Motivation: Grounded in the "linear representation hypothesis" from representation engineering — bias directions are linearly separable in activation space, and adding or subtracting along that direction controls the bias.
-
K-CAST (kNN-based Conditional Activation Steering):
- Function: Address the failure of static steering on certain models by dynamically determining steering parameters on a per-instance basis.
- Mechanism: Activation vectors and condition labels for each training sample are stored. At inference time, the \(k\) nearest neighbors of the new input are retrieved, and a condition label is determined by majority vote: \(\hat{y}(x) = \text{sign}(\sum_{j \in \mathcal{N}_k} y_j)\). The sign of \(\alpha\) is then adjusted dynamically: \(\tilde{\phi}(x) = \phi(x) - \hat{y}(x) \cdot \alpha \cdot \Delta\phi\).
- Design Motivation: Standard CAST makes decisions based on aggregated conditional vectors, incurring substantial information loss. K-CAST retains per-sample activations and leverages the local structure of the activation space for finer-grained conditional judgment.
Loss & Training¶
- No training is required; the method is purely inference-time intervention.
- Steering vectors are computed on a 2,400-sample training set; intervention is applied at the last token position at the layer in the third quartile of the residual stream.
Key Experimental Results¶
Main Results (Static Contrastive Steering)¶
| Model | Size | Baseline Acc/CE | Best Steering Acc/CE | Relative Gain |
|---|---|---|---|---|
| Qwen 2.5 | 7b | 16.48 | 93.65 | +468% |
| Gemma 2 | 9b | 10.05 | 43.37 | +331% |
| Llama 3.2 | 1b | 1.32 | 11.58 | +777% |
| Llama 3.1 | 8b | 2.54 | 6.06 | +138% |
Static steering is effective for most models, but Llama 3.2 3b and Qwen 2.5 3b are unresponsive.
Conditional Steering (K-CAST)¶
| Model | Baseline Acc/CE | K-CAST Acc/CE | Relative Gain |
|---|---|---|---|
| Llama 3.2 3b | 4.45 | 22.92 | +415% |
| Qwen 2.5 3b | 11.99 | 16.42 | +37% |
For Llama 3.2 3b — unresponsive to static steering — K-CAST improves Acc from 77.79% to 92.60% and reduces CE from 17.50 to 4.04.
Key Findings¶
- Linear controllability: The sign of \(\alpha\) explicitly controls the direction of accuracy on valid vs. invalid arguments — positive \(\alpha\) improves accuracy on valid arguments, negative \(\alpha\) on invalid ones.
- Optimal layers lie in the third quartile: Linear probing consistently shows that information about formal validity and content believability is most abundant in the later layers of the residual stream.
- Minimal side effects on language modeling: Post-steering perplexity changes across English, Chinese, and German remain below 2%.
- Partial generalization to OOD reasoning tasks: Some generalization to other reasoning datasets is observed, though it varies across models.
Highlights & Insights¶
- First application of activation steering to reasoning debias: Prior work on activation steering focused primarily on safety and toxicity control. This paper demonstrates that the same technique can mitigate reasoning bias, extending the applicability of representation engineering.
- K-CAST: fine-grained conditional steering: Replacing coarse-grained aggregation with kNN-based conditional judgment preserves the local structural information of the training set, offering a general improvement strategy for activation steering.
- Elegant dataset design: The use of WordNet hypernym–hyponym relations to systematically control content believability yields a clean, confound-free benchmark with methodological value for studying reasoning biases in LLMs.
Limitations & Future Work¶
- Validation is limited to syllogistic reasoning; effectiveness on more complex reasoning forms (mathematical, causal) remains unknown.
- K-CAST requires storing training-set activation vectors, incurring storage overhead at scale.
- OOD generalization is unstable, with large variance across models.
- The optimal value of \(\alpha\) requires search; an automatic determination mechanism is lacking.
- The 16K dataset is relatively small; the stability of steering vectors at larger scales is unknown.
Related Work & Insights¶
- vs. CoT / fine-tuning: CoT prompting remains susceptible to bias; fine-tuning is parameter-expensive. Activation steering provides lightweight inference-time intervention without modifying model parameters.
- vs. CAST (Lee et al. 2025): CAST makes decisions based on aggregated conditional vectors; K-CAST uses kNN to retain local structure. Ablations show K-CAST has significant advantages on unresponsive models.
- vs. activation steering in safety: Prior work primarily targeted toxicity and harmful output control. This paper transfers the technique to reasoning bias mitigation, validating its cross-domain applicability.
Rating¶
- Novelty: ⭐⭐⭐⭐ First use of activation steering for reasoning debias; K-CAST is an elegant design.
- Experimental Thoroughness: ⭐⭐⭐⭐ 8 models × 2 settings; covers probing, static steering, conditional steering, robustness, and OOD evaluation.
- Writing Quality: ⭐⭐⭐⭐ The logical chain from observation to intervention is coherent and complete.
- Value: ⭐⭐⭐⭐ Practically valuable for LLM reasoning reliability; K-CAST is generalizable.
Additional Notes¶
- Content effects represent an important but underappreciated source of bias in LLM reasoning — benchmarks should control for content variables and evaluate purely structural reasoning ability.
- This work makes a pioneering methodological contribution to reasoning benchmark design; future benchmarks are encouraged to adopt analogous controlled-variable paradigms.