Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings¶
Conference: CVPR 2026 arXiv: 2604.08192 Code: GitHub Area: Interpretability Keywords: generalization measurement, circuit discovery, Vision Transformer, distribution shift, mechanistic interpretability
TL;DR¶
This paper proposes generalization performance prediction metrics based on model-internal circuits, including Dependency Depth Bias (DDB) for pre-deployment model selection and Circuit Shift Score (CSS) for post-deployment performance monitoring, improving average correlation over existing proxy metrics by 13.4% and 34.1%, respectively.
Background & Motivation¶
Reliable generalization evaluation is critical for machine learning deployment, especially in high-stakes settings with scarce annotations (e.g., medical imaging). The core challenges arise from two practical scenarios:
- Pre-deployment model selection: How to select the best model without labeled target data? In-distribution (ID) accuracy is unreliable (the underspecification problem: models with similar ID accuracy can exhibit vastly different OOD performance).
- Post-deployment performance monitoring: How to detect performance degradation under continuous distribution shift? Confidence-based metrics are unreliable (overconfidence problem: high confidence is assigned even to incorrect predictions).
Existing proxy metrics (e.g., confidence scores, accuracy-on-the-line, RANKME) analyze only the model's external behavior (output probabilities or feature quality), ignoring the internal mechanisms that produce these outputs.
Core Idea: This paper leverages circuit discovery techniques from Mechanistic Interpretability to extract generalization signals from internal computation paths — because how a model computes is more reflective of its generalization ability than what it computes.
Method¶
Overall Architecture¶
- Extract continuous edge-weight circuits from ViTs via the EAP-IG method.
- Aggregate circuits into an Inter-layer Dependency Matrix (IDM).
- Pre-deployment: Discover "Generalization Motifs" via CCA and design the DDB metric.
- Post-deployment: Monitor performance degradation via the circuit-shift measure CSS.
Key Designs¶
-
Continuous Circuit Definition and Discovery:
- Function: Represent the ViT computation graph as a continuous edge-weight map, quantifying the causal importance of each edge to model behavior.
- Mechanism: For the ViT computation graph \(\mathcal{G}=(\mathcal{V}, \mathcal{E})\), define the circuit as an edge-weight function \(c(e) = \mathbb{E}_{x \sim \mathcal{D}}[KL(\mathcal{M}_{\setminus\{e\}}(x), \mathcal{M}(x))]\), i.e., the KL divergence of model output after removing edge \(e\). Mean ablation is adopted over interchange ablation, as it is better suited for vision tasks. The EAP-IG method balances faithfulness and computational efficiency.
- Design Motivation: Binary circuits discard fine-grained information; continuous relaxation preserves richer structural information necessary for generalization assessment. The entire process requires no labels.
-
Dependency Depth Bias (DDB) — Pre-deployment Metric:
- Function: Quantify the model's relative reliance on deep vs. shallow features to predict OOD generalization ability.
- Mechanism: Circuit edge weights are aggregated into an inter-layer dependency matrix \(\Lambda_{ij}\). CCA is applied to discover cross-task "universal generalization motifs" — well-generalizing models rely on deep paths (∇-shaped), while poorly-generalizing models rely on shallow shortcuts (Δ-shaped). DDB is defined as the log ratio of the summed deep edge weights to shallow edge weights: \(DDB = \log(\sum_{deep}/\sum_{shallow})\).
- Three variants: DDB_global (global), DDB_deep (deep-to-deep connections), DDB_out (connections to output nodes).
- Design Motivation: Deep layers encode more abstract, domain-invariant semantic representations, while shallow layers capture domain-specific spurious correlations. \(\tau=0.3\) yields the best performance.
-
Circuit Shift Score (CSS) — Post-deployment Metric:
- Function: Measure the degree of circuit shift on OOD data relative to an ID baseline, to predict performance degradation.
- Mechanism: It is observed that the inter-layer topology of circuits remains stable post-deployment, but edge rewiring increases with distribution shift. CSS is defined as \(CSS = d(\mathcal{R}(c_{ID}), \mathcal{R}(c_{OOD}))\), supporting two representation families: vectorized (cosine/ℓ2/SRCC distance) and graph-structured (Laplacian/NetLSD/Jaccard).
- Design Motivation: Post-deployment comparison is made between circuits of the same model on different data; inter-layer topology no longer provides consistent signals (CCA yields contradictory generalization motifs across datasets), necessitating fine-grained rewiring pattern measurement. CSS(v, SRCC) performs best, indicating that relative ranking changes of edge weights are more reliable than absolute magnitudes.
Loss & Training¶
No training is involved. Threshold calibration for CSS uses 39 corruption domains from CIFAR10-C as proxy data to simulate distribution shift; the CSS value corresponding to the corruption domain closest to the performance threshold \(\delta\) is selected as the alert threshold \(\delta'\).
Key Experimental Results¶
Main Results — Pre-deployment Model Selection¶
| Dataset | Metric | DDB_out (Ours) | Prev. SOTA | Gain |
|---|---|---|---|---|
| PACS (style shift) | R²/SRCC/KRCC | 0.862/0.897/0.731 | 0.765/0.878/0.720 (ID Acc) | +13% |
| Camelyon17 (institution shift) | R²/SRCC/KRCC | 0.748/0.820/0.646 | 0.588/0.802/0.628 (ATC) | +22% |
| Terra Incognita (geographic shift) | R²/SRCC/KRCC | 0.714/0.838/0.642 | 0.684/0.813/0.613 (DDB_global) | +5% |
| Average | Composite score | 0.766±0.029 | 0.632±0.047 (ID Acc) | +13.4% |
Main Results — Post-deployment Performance Monitoring¶
| Dataset | Metric | CSS(v,SRCC) (Ours) | Prev. SOTA | Gain |
|---|---|---|---|---|
| PACS | R²/SRCC/KRCC | 0.912/0.983/0.944 | 0.645/0.617/0.444 (ATC) | +78% |
| FMoW | R²/SRCC/KRCC | 0.723/0.750/0.722 | 0.428/0.717/0.611 (MDE) | +29% |
| Camelyon17 | R²/SRCC/KRCC | 0.519/0.807/0.608 | 0.036/0.273/0.187 (MDE) | +187% |
| ImageNet | R²/SRCC/KRCC | 0.953/0.961/0.855 | 0.942/0.957/0.861 (ATC) | +1% |
| Average | Composite score | 0.811±0.041 | 0.470±0.095 (ATC) | +34.1% |
Ablation Study¶
| Configuration | R² | SRCC | KRCC | Note |
|---|---|---|---|---|
| τ=0.1 | 0.744 | 0.743 | 0.562 | Shallow/deep boundary too narrow |
| τ=0.2 | 0.772 | 0.843 | 0.653 | Suboptimal |
| τ=0.3 | 0.798 | 0.862 | 0.684 | Best |
| τ=0.4 | 0.801 | 0.849 | 0.671 | Near-best |
| τ=0.5 | 0.772 | 0.838 | 0.673 | Mid-point boundary |
Key Findings¶
- DDB closely aligns with training dynamics of OOD accuracy: well-generalizing models show DDB increasing from 2.6 to 4.1, while poorly-generalizing models stagnate around −0.9.
- Vectorized CSS substantially outperforms graph-structured CSS, indicating that fine-grained edge-weight patterns are more informative than coarse-grained topological similarity.
- Different datasets exhibit distinct circuit shift patterns: FMoW shows broad cross-layer changes, while Camelyon17 concentrates changes in deep layers.
- CSS alert F1 improves by approximately 45% over the best baseline within the clinically acceptable performance range (0.8–0.9).
Highlights & Insights¶
- This work pioneers the transition of mechanistic interpretability from "post-hoc explanation" to "predictive metric," establishing a new paradigm for model evaluation.
- The "universal generalization motif" reveals an intuitively consistent regularity: well-generalizing models rely more on deep abstract features, while poorly-generalizing models rely on shallow surface features.
- CSS detects "silent failures" without labels, offering significant practical value in high-stakes domains such as healthcare.
- The transfer of circuit discovery from language models to vision models validates the existence of interpretable generalization patterns in vision Transformers.
Limitations & Future Work¶
- The computational cost of circuit discovery is high, limiting the practicality of real-time post-deployment monitoring.
- Validation is restricted to the ViT architecture; applicability to CNNs or hybrid architectures remains unknown.
- Constructing the model zoo (72–144 ViTs) is costly, and such a large pool of candidate models may not be available in practice.
- The threshold calibration strategy relies on artificially constructed corruption datasets; its robustness under real-world distribution shifts requires further verification.
Related Work & Insights¶
- vs. Accuracy-on-the-Line: This observation assumes a linear ID–OOD accuracy relationship, which is violated by the underspecification phenomenon; DDB avoids this issue by examining internal structure rather than external behavior.
- vs. Confidence-based metrics (AC/ANE/MDE): Confidence scores only reflect output probability distributions and are severely affected by overconfidence; CSS measures changes in computation paths and is therefore more reliable.
- vs. RANKME/α-ReQ: Feature quality metrics assess representation space properties but do not account for how features are used by subsequent layers; DDB captures the full inter-layer dependency structure.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First application of circuit discovery to generalization prediction; paradigm-level innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three pre-deployment and four post-deployment datasets, multiple baselines, thorough ablation.
- Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear; the two scenarios are treated as self-contained, well-structured threads.
- Value: ⭐⭐⭐⭐ Provides practical guidance for model evaluation and monitoring, though computational cost limits applicability.