Enhancing Automated Interpretability with Output-Centric Feature Descriptions¶
Paper Information¶
- Conference: ACL 2025
- arXiv: 2501.08319
- Code: https://github.com/yoavgur/Feature-Descriptions
- Area: Interpretability / Automated Explanation Pipelines
- Keywords: Automated Interpretability, Feature Descriptions, Output-Centric, SAE Features, Model Steering
TL;DR¶
This paper proposes output-centric feature description methods (VocabProj and TokenChange) to overcome the limitation of existing automated interpretability pipelines that solely rely on input activation examples. An ensemble approach combining both input and output perspectives achieves state-of-the-art performance across both types of evaluation.
Background & Motivation¶
- Background: Automated interpretability pipelines (e.g., Bills et al., 2023) use LLMs to describe concepts encoded by model features (neurons, SAE directions, etc.). They widely rely on the MaxAct method, which collects top-activating input examples to generate descriptions via LLMs.
- Limitations of Prior Work: MaxAct exhibits three major limitations: (1) High computational cost: it requires gathering activation data across large-scale corpora; (2) Causal incompleteness: it only describes 'what inputs activate the feature' while ignoring 'how feature activation influences the output'; (3) Dataset dependence: differing datasets can lead to inconsistent descriptions, or even incorrectly classify meaningful features as 'dead features'.
- Key Insight: The mechanistic role of a feature is determined by both directions of causality—how inputs activate the feature (input \(\rightarrow\) feature) and how the feature activation affects outputs (feature \(\rightarrow\) output). Feature descriptions should be output-centric for downstream applications like model steering.
- Core Idea: Two efficient output-centric methods are proposed, based on vocabulary projection and token probability changes respectively, which are complementary to MaxAct.
Method¶
Overall Architecture¶
A dual-sided input-output evaluation framework is proposed. The input side evaluates the accuracy of descriptions in characterizing activation triggering conditions, whereas the output side evaluates the ability of descriptions to capture the causal effects of the features. Under this framework, three methods and their ensembles are compared.
Key Designs¶
- VocabProj (Vocabulary Projection): The feature vector \(\mathbf{v}_f\) is projected onto the vocabulary space \(\mathbf{w} = W_U \cdot \text{LayerNorm}(\mathbf{v}_f)\) via the unembedding matrix. The tokens with the highest/lowest scores are retrieved as the concepts "promoted/suppressed" by the feature, requiring only a single matrix multiplication.
- TokenChange: The original model and the feature-steered model are run on \(k\) random prompts, respectively. The average change in logits for each token is calculated, and the tokens with the largest changes are selected as the conceptual description of the feature's influence. This requires only \(\le 2\) inference steps.
- Dual-Sided Evaluation Framework: On the input side, an LLM is prompted to generate activating/non-activating samples based on the description, and the average activations are compared. On the output side, model steering is used to generate three sets of text (target feature versus two random features), and a judge LLM determines which set matches the description.
Integration Strategy¶
- Ensemble Raw: Concatenates raw data from multiple methods (activation examples, top tokens, etc.) and feeds them into the explainer LLM to generate a unified description.
- Ensemble Concat: Simply concatenates the description texts generated by each method.
Experiments¶
Main Results: Input-Output Evaluation on Different Models/Feature Types (%, Higher is Better)¶
| Method | Gemma-2 Res. SAE (Input/Output) | Gemma-2 MLP SAE (Input/Output) | Llama-3.1 Inst. MLP (Input/Output) |
|---|---|---|---|
| MaxAct | 56.6 / 49.2 | 50.4 / 35.1 | 85.6 / 36.9 |
| VocabProj | 50.1 / 56.5 | 20.9 / 37.2 | 71.2 / 45.8 |
| TokenChange | 44.7 / 54.9 | 22.3 / 40.3 | 74.0 / 43.8 |
| EnsembleR (All) | 66.6 / 64.9 | 55.7 / 48.7 | 86.2 / 41.8 |
| EnsembleC (All) | 57.7 / 66.9 | 31.6 / 49.9 | 84.9 / 44.6 |
Ablation Study: Manual Classification of Feature-Description Relationships (100 Gemma Scope SAE Features)¶
| Relationship Type | Margin | Explanation |
|---|---|---|
| Similar | 41% | Input and output descriptions are highly consistent |
| Composition | 23% | Describe different aspects; combining them is more comprehensive |
| Abstraction | 23% | The output description is a higher-level abstraction of the input description |
| Different | 13% | Describe different aspects with no clear correlation |
Key Findings¶
- Input and Output Perspectives are Complementary: MaxAct dominates in input evaluation (+6-15%), while VocabProj/TokenChange perform better in output evaluation (+7-15%), indicating that the two types of methods capture different aspects of feature information.
- Ensemble Consistency is Optimal: Ensemble Raw performs best in input evaluation, whereas Ensemble Concat is superior in output evaluation; the three-method ensemble consistently outperforms any single method across all models and feature types.
- Dead Features Can Be Revived: For 1,850 "dead features" in Gemma-2, probe inputs generated based on VocabProj and TokenChange descriptions successfully activated 9.1% of MLP features and 62% of residual features.
- Layer-wise Effect: VocabProj performs poorly in early layers but improves layer-by-layer, which is consistent with existing observations of the "logit lens".
- MLP vs Residual: Output evaluation metrics on MLP features are significantly lower than on residual features (45-50 vs. ~66), possibly because the influence of MLP layers on the residual stream is incremental.
Highlights & Insights¶
- Comprehensively proposes the "input-output duality" framework for feature descriptions for the first time, addressing the gap left by prior work that focused solely on the input side.
- VocabProj requires only a single matrix multiplication, which is substantially more computationally efficient than MaxAct (which requires large-scale corpus scanning).
- The dead feature revival experiments (where 62% of residual features were successfully activated) directly demonstrate the unique value of output-centric approaches.
- Versatility: The method supports multiple feature types, including SAE features, MLP neurons, and residual streams.
Limitations & Future Work¶
- Output-side evaluation contains notable noise; although alleviated by extensive sampling, there remains room for improvement.
- Output-centric methods depend on the model vocabulary and cannot easily describe non-vocabulary concepts (such as positional features).
- The direction of "promotion" vs. "suppression" of concepts by features is not explicitly distinguished.
- Comparison with other interpretability methods, such as activation patching-based approaches, is not explored.
- The ensemble methods are sensitive to prompt design.
Related Work & Insights¶
- Automated Interpretability: Bills et al. (2023) GPT-4 explaining GPT-2 neurons; Bricken et al. (2023) SAE feature explanation
- Feature Description Improvements: Paulo et al. (2024) optimizing explainer prompts; Choi et al. (2024) Transluce description selection
- Internal Model Representations: Geva et al. (2021, 2022) MLP as key-value memory; logit lens styles
- Model Steering: Templeton et al. (2024) feature clamping steering; Rimsky et al. (2024) steering behavioral control
- SAE Training: Gemma Scope (Lieberum et al., 2024), Llama Scope (He et al., 2024)
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Value | ⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Overall Rating | 8/10 |
Summary: The core contribution of this work lies in extending feature descriptions from a singular input perspective to a dual-sided input-output problem. The proposed VocabProj method is remarkably low-cost and yields compelling results. The dead feature revival experiments are particularly striking, directly demonstrating the irreplaceable value of output-centric approaches. This study provides important methodological insights for the interpretability research community.