Skip to content

Position-aware Automatic Circuit Discovery

Conference: ACL 2025
arXiv: 2502.04577
Code: https://github.com/technion-cs-nlp/PEAP
Area: Interpretability
Keywords: Circuit Discovery, Position-awareness, Mechanistic Interpretability, Attribution Patching, Schema

TL;DR

Proposes Position-aware Edge Attribution Patching (PEAP) and a dataset Schema mechanism to address the cancellation effect and overestimation of importance in automatic circuit discovery caused by ignoring position information, enabling smaller and more faithful circuit discovery.

Background & Motivation

Circuit analysis is a core method for understanding the internal mechanisms of language models, revealing how a model works by finding the minimal computational subgraph that executes a specific task. However, existing automatic circuit discovery methods (such as EAP) suffer from a key blind spot: ignoring position information.

Specifically, position-independent methods suffer from two major issues:

Cancellation Effect (Low Recall): The importance score of a component at different positions may have opposite signs. When summing across positions, positive and negative cancellation causes actually important edges to be missed. Experiments confirm that in the IOI task, the difference in top-1% edge ranking is as high as 17.1%.

Importance Overestimation (Low Precision): Without considering positions, methods tend to select edges that have small impacts across multiple positions, while ignoring those that have large impacts in a small number of positions. Similarly, the difference in the top-1% reaches 17.5%.

Manual circuit discovery (e.g., IOI circuit, Greater-Than circuit) can distinguish positions but is not scalable and is prone to introducing human bias. The goal of this work is: to maintain position sensitivity while automating the process.

Method

Overall Architecture

The method consists of two core contributions: 1. PEAP (Position-aware Edge Attribution Patching): extending edge attribution patching to cross-position edges. 2. Schema: defining semantic labels to handle the position alignment problem of variable-length inputs.

Key Designs

  1. Position-aware Edge Attribution Patching (PEAP):

    • The original EAP only handles edges at the same position. PEAP extends this to cross-position attention edges.
    • For an attention head \(h^i_t\) at position \(t\), it connects to nodes at other positions through three types of edges: value, key, and query.
    • By patching \(v_{t'}\), \(k_{t'}\), and \(q_t\) separately, the attribution score for each type of edge is calculated.
    • Using a first-order linear approximation: \(M(x|e=e_{x'}) - M(x) \approx (z^*_{h^i_t} - z_{h^i_t})^\top \nabla_{z_{h^i_t}} M(x)\)
    • Key: Evaluating the edge importance of each position separately, rather than aggregating across positions.
  2. Dataset Schema:

    • Problem: Input lengths of real-world datasets vary, making direct position alignment across samples impossible.
    • Proposed Solution: Define semantic spans (e.g., "Subject", "Year"), mapping samples of different lengths to a unified abstract computation graph.
    • Mapping function \(f_\mathcal{S}^x\): Maps edges of the abstract graph to a set of edges in the specific sample's computation graph.
    • Schema-level attribution score: Summing the scores of all concrete edges mapped to the same abstract edge, and then averaging across samples.
  3. Automated Schema Generation Pipeline:

    • Using LLMs (Claude 3.5 Sonnet) to automatically generate Schemas: sampling multiple subsets to generate them separately, and then unifying them into a final version.
    • Saliency Enhancement: Using input×gradient to calculate the importance of each token position to the target metric, generating a saliency mask.
    • Providing the mask to the LLM so that it considers the model's actual computation patterns when designing the Schema.
    • Schema application is also automated by the LLM, with a validity rate of \(\geq 90\%\) considered successful.

Loss & Training

This is an inference-time analysis method and does not involve model training. The key hyperparameter is the choice of circuit size—constructed by incrementally adding the most important edges using a greedy algorithm.

Key Experimental Results

Main Results

On the Greater-Than task (GPT2-small), the circuits discovered by PEAP are several orders of magnitude smaller than non-positional circuits at the same level of faithfulness.

Hard Faithfulness Comparison (Multi-task, Multi-model):

Task Model PEAP+Schema Non-positional Gain
Greater-Than GPT2-small Smaller circuit achieves equivalent faithfulness Larger circuit required for faithfulness Significant
IOI GPT2-small LLM + Mask ≈ Manual Schema Large gap for non-positional Significant
IOI Llama-3-8B LLM Schema slightly outperforms manual - Significant
Winobias Llama-3-8B Mask enhancement consistently improves - Significant

Ablation Study

Comparison of Schema Generation Methods:

Method Characteristics Faithfulness
Manual Schema Gold standard High
LLM + Mask Automated + Saliency-guided ≈ Manual
LLM Only Purely automated Slightly lower than LLM+Mask
No Schema (EAP) Non-positional Lowest

Quantifying Cancellation and Overestimation Effects (IOI, GPT2-small):

K% Cancellation Difference Control Difference Overestimation Difference Control Difference
1% 17.1% 3.9% 17.5% 3.6%
5% 13.4% 2.4% 14.6% 2.1%
10% 12.1% 2.3% 12.4% 2.2%

Key Findings

  • Position-aware circuits achieve a better faithfulness-circuit size trade-off across all tasks and models.
  • LLM-generated Schema + saliency mask can match or even outperform manual designs.
  • On Llama-3-8B, the LLM-generated Schema even outperforms the manual Schema designed for GPT2-small, indicating that Schemas should be customized for specific models.
  • The cancellation effect can occur across positions within a single sample, not just across samples.

Highlights & Insights

  • Clear Theoretical Intuition: The formalization of the cancellation and overestimation issues is intuitive. Simple diagrams (Figure 2) allow readers to quickly grasp the pitfalls of ignoring position.
  • Fully Automated Pipeline: From Schema generation and application to circuit discovery, the entire process can be fully automated, drastically reducing the manual labor costs of mechanistic interpretability research.
  • Model-aware Schema Design: Allowing LLMs to "see" the model's computation pattern through input×gradient saliency scores is an elegant AI-assisted interpretability method.
  • Discovery of an Counter-intuitive Phenomenon: The manual Schema meticulously designed for GPT2-small underperforms when transferred to Llama-3-8B compared to the automatically generated schema by the LLM.

Limitations & Future Work

  • Schema requires spans in all samples to appear in the same order, which limits its applicability to more free-form text formats.
  • There is a lack of a priori principles for what constitutes a "good" Schema; currently, it can only be evaluated a posteriori through downstream faithfulness.
  • LLM automatic application of Schemas also suffers from failure rates, requiring filtering of invalid applications.
  • Experiments were only conducted on GPT2-small and Llama-3-8B; scalability on larger models remains to be verified.
  • This work only used Claude 3.5 Sonnet; other LLMs (Llama-3-70B, GPT-4o) failed to meet the standards for Schema application.
  • Directly improves the methodology of EAP (Syed et al., 2023), introducing the position dimension to automatic circuit discovery.
  • Complements manual circuit discovery work (such as the IOI circuit in Wang et al., 2023; the Greater-Than circuit in Hanna et al., 2024)—PEAP achieves automated position awareness.
  • The concept of Schema is similar to the role labeling in the IOI dataset (IO, S1, etc.) but generalizes it into a systematic method.
  • Saliency-guided Schema generation is an interesting example of using LLM-as-agent for automated interpretability.

Rating

  • Novelty: ⭐⭐⭐⭐ Position-aware circuit discovery is a natural yet important improvement; the Schema + automated pipeline adds practical value.
  • Experimental Thoroughness: ⭐⭐⭐⭐ A comprehensive comparison across three tasks, two models, and multiple Schema generation methods.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear articulation of research motivation, well-designed diagrams, and structurally progressive method explanations.
  • Value: ⭐⭐⭐⭐ Provides more precise tools and automated pipelines for mechanistic interpretability research, with the potential to become a standard practice.