ActivationReasoning: Logical Reasoning in Latent Activation Spaces¶

Conference: ICLR 2026 arXiv: 2510.18184 Code: https://github.com/ml-research/ActivationReasoning Area: LLM Interpretability / Reasoning Keywords: Sparse Autoencoders, Logical Reasoning, Latent Space Intervention, Concept Composition, Model Steering

TL;DR¶

This paper proposes the ActivationReasoning (AR) framework, which embeds explicit logical reasoning into the latent activation space of LLMs (via SAE-extracted features) through a three-stage pipeline: discovering concept representations → detecting activated propositions → reasoning with logical rules. The framework supports multi-hop reasoning, concept composition, and safety control, achieving 95%+ accuracy on PrOntoQA with an 8B model, surpassing GPT-4o.

Background & Motivation¶

Background: SAEs have made the hidden activations of LLMs more interpretable, exposing latent features aligned with human-understandable concepts. Reasoning-oriented LLMs (e.g., o1, R1) improve performance by extending reasoning chains, but their reasoning processes remain opaque.

Limitations of Prior Work: SAE features are passive and brittle — they may be polysemous, context-unstable, or overly low-level. A critical deficiency is that SAEs lack mechanisms for compositional and higher-order reasoning. They cannot derive "Golden Gate Bridge" from "bridge" + "San Francisco" + "USA."

Key Challenge: Logical reasoning requires discrete propositional units and compositional rules, whereas LLMs rely on continuous, entangled representations. While SAEs provide approximately discrete features, they lack a formal framework for reasoning.

Goal: To embed explicit logical reasoning capabilities into the latent space of LLMs, enabling interpretable and controllable structured reasoning.

Key Insight: Treating SAE features as logical propositions, defining and applying logical rules (conjunction, disjunction, implication, negation) over them, and deriving new higher-order propositions via forward-chaining inference.

Core Idea: Treat SAE features as propositions, use user-defined logical rules as an inference engine, perform forward-chaining reasoning in the activation space, and steer LLM generation via activation guidance.

Method¶

Overall Architecture¶

A three-stage pipeline: (1) identify concept representations in the SAE space and construct a concept dictionary \(\mathcal{D}\); (2) at inference time, detect token-level activations and map them to logical propositions forming an activation matrix \(A\); (3) apply logical rules to \(A\) to derive new propositions, yielding an augmented matrix \(A'\) used for downstream analysis and LLM steering.

Key Designs¶

Three Forms of Concept Representation:
- Function: Organizes SAE features into latent representations of concepts.
- Mechanism: Single-feature \(\mathcal{R}_{single}\) (one SAE feature = one concept), multi-feature \(\mathcal{R}_{multi}\) (weighted aggregation of multiple SAE features), and relational feature \(\mathcal{R}_{relation}\) (decision tree modeling structured interactions among features). Automatic extraction is performed via \(r_c = \arg\max(\mathbb{E}[l_t|y=1] - \mathbb{E}[l_t|y=0])\).
- Design Motivation: The single-feature assumption often fails due to polysemy; multi-feature representations cannot model interactions (e.g., "hate" requires co-activation of defamation and stereotyping while excluding educational contexts); relational features use decision trees to balance expressiveness and interpretability.
Activation Propositionalization and Logical Reasoning:
- Function: Converts token-level activations into logical propositions and performs forward-chaining inference.
- Mechanism: Activation matrix \(A_{local}[c,t] = \max(a_{c,t} - \tau_c, 0)\), \(A_{global}[c] = \max(\text{Agg}_{t} a_{c,t} - \tau_c, 0)\). Users define logical rules such as "Bridge ∧ SF ∧ USA → Golden Gate Bridge," and forward chaining runs until a fixed point.
- Design Motivation: The SAE feature space may lack a direct feature for "Golden Gate Bridge," but possesses features for "bridge," "San Francisco," and "USA" — logical composition fills the expressiveness gap of SAEs.
Activation Steering for Control:
- Function: Uses concepts activated in the derived matrix \(A'\) to guide LLM generation.
- Mechanism: \(h' = h + \alpha \cdot \frac{(SAE_D[r_c] \times w) \times \|h\|_2}{\|SAE_D[r_c]\|_2}\), which promotes or suppresses specific concepts by adjusting activation vectors.
- Design Motivation: Pure analysis already provides significant value, but the control capability elevates AR from an interpretability tool to an alignment tool — enabling enforcement of safety constraints at inference time.

Loss & Training¶

AR requires no LLM training. Concept extraction employs simple statistical methods (mean difference, decision trees). Rules are user-defined. No additional training cost is incurred at inference time.

Key Experimental Results¶

Main Results¶

PrOntoQA Multi-Hop Reasoning (Accuracy% ↑):

Model	1-hop	3-hop	5-hop
Llama-3.1-8B	51.0	50.8	50.3
+ AR	95.0	95.6	95.3
Gemma-2-9B	48.5	47.5	47.9
+ AR	93.5	93.5	93.5
GPT-4o	95.5	88.0	79.5
DeepSeek-R1-8B	86.0	79.5	67.5

Rail2Country Meta-Concept Generalization:

Model	Explicit Concepts	Meta-Concepts (Figurative)
Llama-3.1-8B	41.0	29.7
+ AR	74.7	62.7

Ablation Study¶

Concept Representation Type	BeaverTails Safety Detection F1
\(\mathcal{R}_{single}\)	Lower
\(\mathcal{R}_{multi}\)	Moderate
\(\mathcal{R}_{relation}\)	Highest

Key Findings¶

AR enables 8B models to surpass GPT-4o and DeepSeek-R1 on multi-hop reasoning: 8B+AR (95.3%) vs. GPT-4o (79.5%) at 5-hop.
Critically, AR's performance does not degrade with reasoning depth, whereas all baseline models (including GPT-4o) exhibit significant accuracy drops as hop count increases.
Meta-concept generalization (e.g., "a color like a tomato" → "red") validates AR's capability beyond literal matching.
On the BeaverTails safety task, \(\mathcal{R}_{relation}\) outperforms \(\mathcal{R}_{single}\) and \(\mathcal{R}_{multi}\), indicating that safety concepts require structured feature interactions.

Highlights & Insights¶

SAE Features as a Bridge to Logical Propositions: This represents the most natural connection between the continuous representations of neural networks and the discrete propositions of symbolic reasoning — SAE features are themselves designed to be approximately monosemantic, making them naturally suited as propositions.
8B Surpassing GPT-4o in Reasoning: This is achieved not through better training but by layering logical reasoning atop existing representations — the model already "knows" the answer but lacks the capacity for compositional inference.
Modularity and Auditability: The entire reasoning chain is transparent — where concepts originate, how rules are applied, and how conclusions are reached are all inspectable and modifiable at every step.
Cross-Model Transferability: The same framework is effective across Llama and Gemma, suggesting that the propositionalization of SAE features is model-agnostic.

Limitations & Future Work¶

Logical rules must be manually defined by users; automatic rule discovery is an important direction for future work.
Concept extraction relies on token-level labeled data, and cross-domain generalization may require new annotations.
The current framework supports only propositional logic; extension to first-order logic (with quantifiers and variables) remains unexplored.
SAE feature quality directly affects AR performance — if features are insufficiently monosemantic, reasoning may be unreliable.
The computational overhead of rule application scales with the number of concepts and rules.

vs. Reasoning LLMs (o1, R1): Reasoning LLMs improve through chain-of-thought but remain opaque; AR performs reasoning in the activation space, where each step is auditable.
vs. Neuro-Symbolic Methods (DeepProbLog): Traditional neuro-symbolic approaches require end-to-end differentiable training; AR applies rules directly at inference time without model training.
vs. SAE Analysis (Anthropic): SAEs are typically used for passive analysis and feature visualization; AR actively leverages SAE features for reasoning and control.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The idea of embedding logical reasoning into the latent space of LLMs is both natural and powerful, representing a significant extension of SAE applications.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four complementary tasks (multi-hop reasoning / meta-concepts / natural language inference / safety) with dual-model validation.
Writing Quality: ⭐⭐⭐⭐⭐ — The narrative flows smoothly from motivation to method to experiments; the Golden Gate Bridge running example is woven throughout the paper.
Value: ⭐⭐⭐⭐⭐ — Provides a novel paradigm for interpretable reasoning and controllable alignment in LLMs.