ActivationReasoning: Logical Reasoning in Latent Activation Spaces¶
Conference: ICLR 2026 arXiv: 2510.18184 Code: https://github.com/ml-research/ActivationReasoning Area: LLM Interpretability / Reasoning Keywords: Sparse Autoencoders, Logical Reasoning, Latent Space Intervention, Concept Composition, Model Steering
TL;DR¶
This paper proposes the ActivationReasoning (AR) framework, which embeds explicit logical reasoning into the latent activation space of LLMs (via SAE-extracted features) through a three-stage pipeline: discovering concept representations → detecting activated propositions → reasoning with logical rules. The framework supports multi-hop reasoning, concept composition, and safety control, achieving 95%+ accuracy on PrOntoQA with an 8B model, surpassing GPT-4o.
Background & Motivation¶
Background: SAEs have made the hidden activations of LLMs more interpretable, exposing latent features aligned with human-understandable concepts. Reasoning-oriented LLMs (e.g., o1, R1) improve performance by extending reasoning chains, but their reasoning processes remain opaque.
Limitations of Prior Work: SAE features are passive and brittle — they may be polysemous, context-unstable, or overly low-level. A critical deficiency is that SAEs lack mechanisms for compositional and higher-order reasoning. They cannot derive "Golden Gate Bridge" from "bridge" + "San Francisco" + "USA."
Key Challenge: Logical reasoning requires discrete propositional units and compositional rules, whereas LLMs rely on continuous, entangled representations. While SAEs provide approximately discrete features, they lack a formal framework for reasoning.
Goal: To embed explicit logical reasoning capabilities into the latent space of LLMs, enabling interpretable and controllable structured reasoning.
Key Insight: Treating SAE features as logical propositions, defining and applying logical rules (conjunction, disjunction, implication, negation) over them, and deriving new higher-order propositions via forward-chaining inference.
Core Idea: Treat SAE features as propositions, use user-defined logical rules as an inference engine, perform forward-chaining reasoning in the activation space, and steer LLM generation via activation guidance.
Method¶
Overall Architecture¶
A three-stage pipeline: (1) identify concept representations in the SAE space and construct a concept dictionary \(\mathcal{D}\); (2) at inference time, detect token-level activations and map them to logical propositions forming an activation matrix \(A\); (3) apply logical rules to \(A\) to derive new propositions, yielding an augmented matrix \(A'\) used for downstream analysis and LLM steering.
Key Designs¶
-
Three Forms of Concept Representation:
- Function: Organizes SAE features into latent representations of concepts.
- Mechanism: Single-feature \(\mathcal{R}_{single}\) (one SAE feature = one concept), multi-feature \(\mathcal{R}_{multi}\) (weighted aggregation of multiple SAE features), and relational feature \(\mathcal{R}_{relation}\) (decision tree modeling structured interactions among features). Automatic extraction is performed via \(r_c = \arg\max(\mathbb{E}[l_t|y=1] - \mathbb{E}[l_t|y=0])\).
- Design Motivation: The single-feature assumption often fails due to polysemy; multi-feature representations cannot model interactions (e.g., "hate" requires co-activation of defamation and stereotyping while excluding educational contexts); relational features use decision trees to balance expressiveness and interpretability.
-
Activation Propositionalization and Logical Reasoning:
- Function: Converts token-level activations into logical propositions and performs forward-chaining inference.
- Mechanism: Activation matrix \(A_{local}[c,t] = \max(a_{c,t} - \tau_c, 0)\), \(A_{global}[c] = \max(\text{Agg}_{t} a_{c,t} - \tau_c, 0)\). Users define logical rules such as "Bridge ∧ SF ∧ USA → Golden Gate Bridge," and forward chaining runs until a fixed point.
- Design Motivation: The SAE feature space may lack a direct feature for "Golden Gate Bridge," but possesses features for "bridge," "San Francisco," and "USA" — logical composition fills the expressiveness gap of SAEs.
-
Activation Steering for Control:
- Function: Uses concepts activated in the derived matrix \(A'\) to guide LLM generation.
- Mechanism: \(h' = h + \alpha \cdot \frac{(SAE_D[r_c] \times w) \times \|h\|_2}{\|SAE_D[r_c]\|_2}\), which promotes or suppresses specific concepts by adjusting activation vectors.
- Design Motivation: Pure analysis already provides significant value, but the control capability elevates AR from an interpretability tool to an alignment tool — enabling enforcement of safety constraints at inference time.
Loss & Training¶
AR requires no LLM training. Concept extraction employs simple statistical methods (mean difference, decision trees). Rules are user-defined. No additional training cost is incurred at inference time.
Key Experimental Results¶
Main Results¶
PrOntoQA Multi-Hop Reasoning (Accuracy% ↑):
| Model | 1-hop | 3-hop | 5-hop |
|---|---|---|---|
| Llama-3.1-8B | 51.0 | 50.8 | 50.3 |
| + AR | 95.0 | 95.6 | 95.3 |
| Gemma-2-9B | 48.5 | 47.5 | 47.9 |
| + AR | 93.5 | 93.5 | 93.5 |
| GPT-4o | 95.5 | 88.0 | 79.5 |
| DeepSeek-R1-8B | 86.0 | 79.5 | 67.5 |
Rail2Country Meta-Concept Generalization:
| Model | Explicit Concepts | Meta-Concepts (Figurative) |
|---|---|---|
| Llama-3.1-8B | 41.0 | 29.7 |
| + AR | 74.7 | 62.7 |
Ablation Study¶
| Concept Representation Type | BeaverTails Safety Detection F1 |
|---|---|
| \(\mathcal{R}_{single}\) | Lower |
| \(\mathcal{R}_{multi}\) | Moderate |
| \(\mathcal{R}_{relation}\) | Highest |
Key Findings¶
- AR enables 8B models to surpass GPT-4o and DeepSeek-R1 on multi-hop reasoning: 8B+AR (95.3%) vs. GPT-4o (79.5%) at 5-hop.
- Critically, AR's performance does not degrade with reasoning depth, whereas all baseline models (including GPT-4o) exhibit significant accuracy drops as hop count increases.
- Meta-concept generalization (e.g., "a color like a tomato" → "red") validates AR's capability beyond literal matching.
- On the BeaverTails safety task, \(\mathcal{R}_{relation}\) outperforms \(\mathcal{R}_{single}\) and \(\mathcal{R}_{multi}\), indicating that safety concepts require structured feature interactions.
Highlights & Insights¶
- SAE Features as a Bridge to Logical Propositions: This represents the most natural connection between the continuous representations of neural networks and the discrete propositions of symbolic reasoning — SAE features are themselves designed to be approximately monosemantic, making them naturally suited as propositions.
- 8B Surpassing GPT-4o in Reasoning: This is achieved not through better training but by layering logical reasoning atop existing representations — the model already "knows" the answer but lacks the capacity for compositional inference.
- Modularity and Auditability: The entire reasoning chain is transparent — where concepts originate, how rules are applied, and how conclusions are reached are all inspectable and modifiable at every step.
- Cross-Model Transferability: The same framework is effective across Llama and Gemma, suggesting that the propositionalization of SAE features is model-agnostic.
Limitations & Future Work¶
- Logical rules must be manually defined by users; automatic rule discovery is an important direction for future work.
- Concept extraction relies on token-level labeled data, and cross-domain generalization may require new annotations.
- The current framework supports only propositional logic; extension to first-order logic (with quantifiers and variables) remains unexplored.
- SAE feature quality directly affects AR performance — if features are insufficiently monosemantic, reasoning may be unreliable.
- The computational overhead of rule application scales with the number of concepts and rules.
Related Work & Insights¶
- vs. Reasoning LLMs (o1, R1): Reasoning LLMs improve through chain-of-thought but remain opaque; AR performs reasoning in the activation space, where each step is auditable.
- vs. Neuro-Symbolic Methods (DeepProbLog): Traditional neuro-symbolic approaches require end-to-end differentiable training; AR applies rules directly at inference time without model training.
- vs. SAE Analysis (Anthropic): SAEs are typically used for passive analysis and feature visualization; AR actively leverages SAE features for reasoning and control.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The idea of embedding logical reasoning into the latent space of LLMs is both natural and powerful, representing a significant extension of SAE applications.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four complementary tasks (multi-hop reasoning / meta-concepts / natural language inference / safety) with dual-model validation.
- Writing Quality: ⭐⭐⭐⭐⭐ — The narrative flows smoothly from motivation to method to experiments; the Golden Gate Bridge running example is woven throughout the paper.
- Value: ⭐⭐⭐⭐⭐ — Provides a novel paradigm for interpretable reasoning and controllable alignment in LLMs.