Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings¶

Conference: CVPR 2026 arXiv: 2604.08192 Code: GitHub Area: Interpretability Keywords: generalization measurement, circuit discovery, Vision Transformer, distribution shift, mechanistic interpretability

TL;DR¶

This paper proposes generalization performance prediction metrics based on model-internal circuits, including Dependency Depth Bias (DDB) for pre-deployment model selection and Circuit Shift Score (CSS) for post-deployment performance monitoring, improving average correlation over existing proxy metrics by 13.4% and 34.1%, respectively.

Background & Motivation¶

Reliable generalization evaluation is critical for machine learning deployment, especially in high-stakes settings with scarce annotations (e.g., medical imaging). The core challenges arise from two practical scenarios:

Pre-deployment model selection: How to select the best model without labeled target data? In-distribution (ID) accuracy is unreliable (the underspecification problem: models with similar ID accuracy can exhibit vastly different OOD performance).
Post-deployment performance monitoring: How to detect performance degradation under continuous distribution shift? Confidence-based metrics are unreliable (overconfidence problem: high confidence is assigned even to incorrect predictions).

Existing proxy metrics (e.g., confidence scores, accuracy-on-the-line, RANKME) analyze only the model's external behavior (output probabilities or feature quality), ignoring the internal mechanisms that produce these outputs.

Core Idea: This paper leverages circuit discovery techniques from Mechanistic Interpretability to extract generalization signals from internal computation paths — because how a model computes is more reflective of its generalization ability than what it computes.

Method¶

Overall Architecture¶

Extract continuous edge-weight circuits from ViTs via the EAP-IG method.
Aggregate circuits into an Inter-layer Dependency Matrix (IDM).
Pre-deployment: Discover "Generalization Motifs" via CCA and design the DDB metric.
Post-deployment: Monitor performance degradation via the circuit-shift measure CSS.

Key Designs¶

Continuous Circuit Definition and Discovery:
- Function: Represent the ViT computation graph as a continuous edge-weight map, quantifying the causal importance of each edge to model behavior.
- Mechanism: For the ViT computation graph \(\mathcal{G}=(\mathcal{V}, \mathcal{E})\), define the circuit as an edge-weight function \(c(e) = \mathbb{E}_{x \sim \mathcal{D}}[KL(\mathcal{M}_{\setminus\{e\}}(x), \mathcal{M}(x))]\), i.e., the KL divergence of model output after removing edge \(e\). Mean ablation is adopted over interchange ablation, as it is better suited for vision tasks. The EAP-IG method balances faithfulness and computational efficiency.
- Design Motivation: Binary circuits discard fine-grained information; continuous relaxation preserves richer structural information necessary for generalization assessment. The entire process requires no labels.
Dependency Depth Bias (DDB) — Pre-deployment Metric:
- Function: Quantify the model's relative reliance on deep vs. shallow features to predict OOD generalization ability.
- Mechanism: Circuit edge weights are aggregated into an inter-layer dependency matrix \(\Lambda_{ij}\). CCA is applied to discover cross-task "universal generalization motifs" — well-generalizing models rely on deep paths (∇-shaped), while poorly-generalizing models rely on shallow shortcuts (Δ-shaped). DDB is defined as the log ratio of the summed deep edge weights to shallow edge weights: \(DDB = \log(\sum_{deep}/\sum_{shallow})\).
- Three variants: DDB_global (global), DDB_deep (deep-to-deep connections), DDB_out (connections to output nodes).
- Design Motivation: Deep layers encode more abstract, domain-invariant semantic representations, while shallow layers capture domain-specific spurious correlations. \(\tau=0.3\) yields the best performance.
Circuit Shift Score (CSS) — Post-deployment Metric:
- Function: Measure the degree of circuit shift on OOD data relative to an ID baseline, to predict performance degradation.
- Mechanism: It is observed that the inter-layer topology of circuits remains stable post-deployment, but edge rewiring increases with distribution shift. CSS is defined as \(CSS = d(\mathcal{R}(c_{ID}), \mathcal{R}(c_{OOD}))\), supporting two representation families: vectorized (cosine/ℓ2/SRCC distance) and graph-structured (Laplacian/NetLSD/Jaccard).
- Design Motivation: Post-deployment comparison is made between circuits of the same model on different data; inter-layer topology no longer provides consistent signals (CCA yields contradictory generalization motifs across datasets), necessitating fine-grained rewiring pattern measurement. CSS(v, SRCC) performs best, indicating that relative ranking changes of edge weights are more reliable than absolute magnitudes.

Loss & Training¶

No training is involved. Threshold calibration for CSS uses 39 corruption domains from CIFAR10-C as proxy data to simulate distribution shift; the CSS value corresponding to the corruption domain closest to the performance threshold \(\delta\) is selected as the alert threshold \(\delta'\).

Key Experimental Results¶

Main Results — Pre-deployment Model Selection¶

Dataset	Metric	DDB_out (Ours)	Prev. SOTA	Gain
PACS (style shift)	R²/SRCC/KRCC	0.862/0.897/0.731	0.765/0.878/0.720 (ID Acc)	+13%
Camelyon17 (institution shift)	R²/SRCC/KRCC	0.748/0.820/0.646	0.588/0.802/0.628 (ATC)	+22%
Terra Incognita (geographic shift)	R²/SRCC/KRCC	0.714/0.838/0.642	0.684/0.813/0.613 (DDB_global)	+5%
Average	Composite score	0.766±0.029	0.632±0.047 (ID Acc)	+13.4%

Main Results — Post-deployment Performance Monitoring¶

Dataset	Metric	CSS(v,SRCC) (Ours)	Prev. SOTA	Gain
PACS	R²/SRCC/KRCC	0.912/0.983/0.944	0.645/0.617/0.444 (ATC)	+78%
FMoW	R²/SRCC/KRCC	0.723/0.750/0.722	0.428/0.717/0.611 (MDE)	+29%
Camelyon17	R²/SRCC/KRCC	0.519/0.807/0.608	0.036/0.273/0.187 (MDE)	+187%
ImageNet	R²/SRCC/KRCC	0.953/0.961/0.855	0.942/0.957/0.861 (ATC)	+1%
Average	Composite score	0.811±0.041	0.470±0.095 (ATC)	+34.1%

Ablation Study¶

Configuration	R²	SRCC	KRCC	Note
τ=0.1	0.744	0.743	0.562	Shallow/deep boundary too narrow
τ=0.2	0.772	0.843	0.653	Suboptimal
τ=0.3	0.798	0.862	0.684	Best
τ=0.4	0.801	0.849	0.671	Near-best
τ=0.5	0.772	0.838	0.673	Mid-point boundary

Key Findings¶

DDB closely aligns with training dynamics of OOD accuracy: well-generalizing models show DDB increasing from 2.6 to 4.1, while poorly-generalizing models stagnate around −0.9.
Vectorized CSS substantially outperforms graph-structured CSS, indicating that fine-grained edge-weight patterns are more informative than coarse-grained topological similarity.
Different datasets exhibit distinct circuit shift patterns: FMoW shows broad cross-layer changes, while Camelyon17 concentrates changes in deep layers.
CSS alert F1 improves by approximately 45% over the best baseline within the clinically acceptable performance range (0.8–0.9).

Highlights & Insights¶

This work pioneers the transition of mechanistic interpretability from "post-hoc explanation" to "predictive metric," establishing a new paradigm for model evaluation.
The "universal generalization motif" reveals an intuitively consistent regularity: well-generalizing models rely more on deep abstract features, while poorly-generalizing models rely on shallow surface features.
CSS detects "silent failures" without labels, offering significant practical value in high-stakes domains such as healthcare.
The transfer of circuit discovery from language models to vision models validates the existence of interpretable generalization patterns in vision Transformers.

Limitations & Future Work¶

The computational cost of circuit discovery is high, limiting the practicality of real-time post-deployment monitoring.
Validation is restricted to the ViT architecture; applicability to CNNs or hybrid architectures remains unknown.
Constructing the model zoo (72–144 ViTs) is costly, and such a large pool of candidate models may not be available in practice.
The threshold calibration strategy relies on artificially constructed corruption datasets; its robustness under real-world distribution shifts requires further verification.

vs. Accuracy-on-the-Line: This observation assumes a linear ID–OOD accuracy relationship, which is violated by the underspecification phenomenon; DDB avoids this issue by examining internal structure rather than external behavior.
vs. Confidence-based metrics (AC/ANE/MDE): Confidence scores only reflect output probability distributions and are severely affected by overconfidence; CSS measures changes in computation paths and is therefore more reliable.
vs. RANKME/α-ReQ: Feature quality metrics assess representation space properties but do not account for how features are used by subsequent layers; DDB captures the full inter-layer dependency structure.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First application of circuit discovery to generalization prediction; paradigm-level innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ Three pre-deployment and four post-deployment datasets, multiple baselines, thorough ablation.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear; the two scenarios are treated as self-contained, well-structured threads.
Value: ⭐⭐⭐⭐ Provides practical guidance for model evaluation and monitoring, though computational cost limits applicability.