Vision Transformers Need More Than Registers¶
Conference: CVPR 2026 arXiv: 2602.22394 Code: GitHub Area: Visual Representation Learning / Vision Transformer Analysis Keywords: ViT artifacts, lazy aggregation, patch score, foreground aggregation, register tokens
TL;DR¶
This paper systematically analyzes the artifact phenomenon widely observed in ViTs across fully supervised, text-supervised, and self-supervised paradigms, revealing that the root cause is "lazy aggregation"—ViTs exploit semantically irrelevant background patches as shortcuts to represent global semantics. The authors propose LaSt-ViT (LazyStrike ViT), which anchors the CLS token to foreground regions via frequency-aware selective channel aggregation, consistently eliminating artifacts and improving performance across 12 benchmarks.
Background & Motivation¶
Background: ViTs have become the de facto standard for image recognition and, more importantly, serve as general-purpose feature extractors (frozen foundation models) for diverse downstream tasks. ViTs trained under different supervision paradigms each have their strengths: fully supervised and text-supervised models (e.g., CLIP) excel at open-vocabulary tasks and serve as visual encoders for LVLMs, while self-supervised models (e.g., DINO) are well-suited for unsupervised segmentation and object discovery.
Limitations of Prior Work: 1. DINO identifies attention deficit issues in fully supervised ViTs. 2. CLIPSelf finds that dense features from text-supervised ViTs are misaligned with textual cues. 3. The Register paper discovers that self-supervised ViTs (DINOv2) produce high-norm token artifacts that impair object localization. 4. These phenomena suggest a shared underlying problem in ViTs, yet no unified explanation or solution has been proposed.
Key Challenge: ViTs achieve excellent image-level classification performance, but patch-level dense feature quality is poor—top-scoring patches fall on background rather than foreground regions, and register tokens only suppress the high-norm phenomenon without addressing the fundamental issue (PiB even degrades).
Goal: To define, analyze, and resolve the artifact problem in ViTs across different supervision paradigms from first principles in a unified manner.
Key Insight: The authors introduce Patch Score (CLS-patch cosine similarity) and Point-in-Box (PiB) as unified metrics for quantifying artifacts, and identify the root cause as lazy aggregation—global attention combined with coarse-grained supervision leads ViTs to take shortcuts by encoding global semantics through background patches.
Core Idea: Distinguish foreground from background patches via frequency-domain stability analysis, and selectively aggregate stable patches into the CLS token to eliminate lazy aggregation.
Method¶
Overall Architecture¶
The core of LaSt-ViT is replacing the original CLS token aggregation mechanism in ViTs (Attention Pooling or GAP) with a channel-wise frequency-stability-based Top-K selective aggregation. The intuition is that foreground patch features are more homogeneous along the channel dimension (semantically consistent) and are therefore more stable under low-pass filtering.
Key Designs¶
-
Patch Score and Point-in-Box (PiB) as Unified Metrics:
- Patch Score is defined as the cosine similarity between each patch feature and the CLS token: \(\mathcal{S}_p = \frac{\mathbf{x}_{\text{patch}} \cdot Q_{\text{CLS}}}{\|\mathbf{x}_{\text{patch}}\|_2 \|Q_{\text{CLS}}\|_2}\)
- PiB measures whether the highest-scoring patch falls within a foreground ground-truth bounding box, serving as an indicator of artifact severity.
- Experiments show that ViT achieves a PiB of only 42.7, far below ResNet's 68.4; while register tokens eliminate the high-norm phenomenon, PiB actually drops to 41.5.
- These metrics are agnostic to the supervision paradigm and apply uniformly across fully supervised, text-supervised, and self-supervised ViTs.
-
Validating the Lazy Aggregation Hypothesis:
- Masking probe: Removing the top 50% of patches by Patch Score has negligible or slightly positive effect on ImageNet accuracy (+1.2%), whereas removing low-scoring patches causes a sharp accuracy drop (−60% at 70% masking), confirming that high-scoring patches are semantically irrelevant background.
- Training dynamics: ViT's PiB remains low (~0.42) from the very beginning of training and hardly changes, indicating that lazy aggregation is an intrinsic behavior established early in training.
- Factor disentanglement: (1) Increasing patch size (28→fewer background tokens, −10%) raises PiB from 0.44 to 0.52 but hurts accuracy; (2) Replacing global attention with window attention raises PiB to 59.8 but drops accuracy from 72.3 to 63.9.
- Conclusion: coarse-grained supervision (image-level labels) combined with global dependencies (long-range attention) jointly induces lazy aggregation.
-
LaSt-ViT: Frequency-Aware Selective Aggregation:
- Stability Score computation: Apply a channel-wise 1D FFT to patch features → multiply by a Gaussian low-pass filter \(\mathbf{g}\) → apply inverse FFT to obtain low-frequency features \(\hat{\mathbf{x}}_{\text{patch}}\).
- Stability score: \(\mathbf{S}_{i,j} = \frac{\hat{\mathbf{x}}_{\text{patch}}[i,j]}{|\hat{\mathbf{x}}_{\text{patch}}[i,j] - \mathbf{x}_{\text{patch}}[i,j]| + \varepsilon}\), where a higher score indicates that patch \(i\) is more stable in channel \(j\) (and thus more likely to be foreground).
- Channel-wise Top-K Pooling: For each channel \(j\), select the \(K\) patches with the highest stability scores and compute their mean as the corresponding channel value of the CLS token: \(\mathcal{Q}_{\text{CLS}}[j] = \frac{1}{K} \sum_{i \in \mathcal{I}_K(j)} \mathbf{x}_{\text{patch}}[i,j]\)
- Vote Count visualization: Define \(v_i = \sum_{j=1}^D \mathbf{1}\{i \in \mathcal{I}_K(j)\}\); patches with high vote counts are strongly aligned with foreground regions.
Loss & Training¶
- LaSt-ViT introduces no additional loss functions; it only replaces the CLS token aggregation mechanism.
- It is applicable to any ViT pre-training pipeline (fully supervised, CLIP, DINO) as a drop-in replacement.
- The hyperparameter \(K\) controls the number of patches selected per channel; the optimal value is approximately 50% of the total patch count (e.g., 98 out of 196 patches for ViT-B/16).
- The training procedure is identical to that of the original ViT, requiring no additional data or hyperparameter tuning.
Key Experimental Results¶
Main Results¶
Artifact elimination (Patch Score / PiB):
| Method | High Norm | Point-in-Box (PiB) |
|---|---|---|
| ResNet | ✗ | 68.4 |
| ViT | ✓ | 42.7 |
| ViT + Register | ✗ | 41.5 |
| ViT + LazyStrike | ✗ | 55.1 (+12.4) |
| DINO-ViT | ✗ | 44.5 |
| DINO + LazyStrike | ✗ | 69.7 (+25.2) |
| CLIP-ViT | ✓ | 39.8 |
| CLIP + LazyStrike | ✗ | 50.1 (+10.3) |
Zero-shot semantic segmentation (mIoU %, CLIP ViT-L/14):
| Method | VOC20 | ADE20K | Cityscapes | COCO-Stf. |
|---|---|---|---|---|
| CLIP | 17.1 | 1.6 | 2.7 | 3.2 |
| CLIP + LazyStrike | 72.4 (+55.3) | 8.4 (+6.8) | 12.3 (+9.6) | 11.9 (+8.7) |
Ablation Study¶
CLS aggregation method comparison (OpenCLIP ViT-B/16):
| Method | ImageNet Top-1 | VOC20 (seg) | COCO-Stf. (seg) |
|---|---|---|---|
| Attention-Pool | 55.8 | 49.0 | 7.2 |
| Max-Pool | 53.1 | 71.9 | 12.2 |
| LazyStrike K=1 | 53.5 | 72.7 | 13.5 |
| LazyStrike K=49 | 55.8 | 75.8 | 18.5 |
| LazyStrike K=98 | 56.2 | 75.9 | 18.0 |
| LazyStrike K=196 (Full) | 55.3 | 13.5 | 4.8 |
Unsupervised object discovery (CorLoc, DINO ViT-S):
| Method | VOC07 | VOC12 | COCO | FPS |
|---|---|---|---|---|
| DINO-seg | 45.8 | 46.2 | 42.1 | 29.4 |
| LOST | 61.9 | 64.0 | 50.7 | 29.4 |
| DINO + LazyStrike | 64.4 | 67.6 | 51.6 | 55.9 |
Key Findings¶
- Register tokens merely relocate the high-norm phenomenon from the feature map to the register tokens; PiB actually decreases (41.5 < 42.7), substantiating the claim that "Vision Transformers Need More Than Registers."
- LazyStrike simultaneously eliminates both the high-norm and patch score artifacts, as both are distinct manifestations of the same underlying lazy aggregation behavior.
- On CLIP ViT-L/14, zero-shot segmentation mIoU on VOC20 jumps from 17.1% to 72.4% (+55.3%), demonstrating that dense feature quality improves substantially once artifacts are eliminated.
- LazyStrike endows fully supervised ViTs with emergent segmentation capability (evidenced by PCA visualizations), a property previously considered exclusive to self-supervised models such as DINO.
- Setting \(K\) to the full patch count degrades to GAP and hurts performance; setting \(K\) too small discards too much information; the optimal value is \(K \approx N/2\).
Highlights & Insights¶
- Analytical rigor: The investigation proceeds from first principles through masking probes, training dynamics tracking, and factor disentanglement experiments, yielding a rigorous hypothesis-validation workflow.
- Unified perspective: Diverse artifact phenomena observed under three supervision paradigms are attributed to a single root cause (lazy aggregation), offering a new conceptual framework for understanding ViT behavior.
- Simplicity and effectiveness: Replacing only the CLS aggregation mechanism—without additional modules, data, or losses—consistently improves performance across 12 benchmarks.
- Counterintuitive finding: Emergent segmentation is not exclusive to self-supervised learning; fully supervised ViTs exhibit the same capability once lazy aggregation is eliminated.
Limitations & Future Work¶
- The frequency-domain stability assumption (foreground patches are more stable) may not hold in certain scenarios (e.g., texture-rich foreground against a uniform background).
- Channel-wise Top-K selection requires additional FFT/IFFT computation; while lightweight, it still introduces overhead for real-time inference.
- Validation is limited to ViT-S/B/L; larger-scale models (e.g., ViT-G) have not been evaluated.
- The value of \(K\) is fixed during training; adaptive or learnable selection mechanisms remain unexplored.
- The paper title implies that register tokens are insufficient, yet the combination of LazyStrike and register tokens is not thoroughly investigated.
Related Work & Insights¶
- Register Tokens (Darcet et al.): Absorb high-norm artifacts via additional tokens, but do not address the root cause.
- CLIPSelf: Repairs CLIP dense features through additional alignment training, constituting a post-hoc solution.
- MaskCLIP: First demonstrates that CLIP features can be applied to zero-shot semantic segmentation.
- F-ViT: Performs open-vocabulary detection with a frozen CLIP backbone, directly benefiting from LazyStrike.
- Insight: Shortcut learning in pre-trained models is a pervasive problem; understanding model-internal behavior enables simple yet highly effective remediation.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐⭐ |
| Practicality | ⭐⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐⭐ |
| Overall | ⭐⭐⭐⭐⭐ |