PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation¶

Conference: CVPR 2026 arXiv: 2603.21528 Code: https://github.com/PGSmall/PEARL Area: Semantic Segmentation / Open-Vocabulary Keywords: Open-vocabulary semantic segmentation, training-free, Procrustes alignment, Laplacian propagation, CLIP

TL;DR¶

PEARL proposes a two-step inference framework based on Procrustes alignment and text-aware Laplacian propagation. Without introducing any additional training or auxiliary backbone networks, it corrects the geometric mismatch between keys and queries in the final self-attention layer of CLIP and leverages textual semantics to guide label propagation, achieving new state-of-the-art performance on training-free open-vocabulary semantic segmentation.

Background & Motivation¶

Background: Open-vocabulary semantic segmentation (OVSS) enables models to classify each pixel according to a category set specified via natural language at inference time. Mainstream approaches fall into two paradigms: training-based methods enhance CLIP's dense prediction capability by learning decoders or lightweight adapters, while training-free methods keep the backbone frozen and only modify the inference procedure.

Limitations of Prior Work: Training-free methods face two core challenges. First, CLIP's contrastive pre-training emphasizes global image-text alignment rather than dense prediction, causing the top-layer self-attention of the visual encoder to be dominated by a few background-driven directions. This results in a severe geometric mismatch between patch feature structure and text prototypes, leading to unstable patch-text similarity. Second, text is typically used only as a classifier and rarely governs the exchange of information between pixels—despite the fact that inter-class relationships in text space can indicate which categories should mutually reinforce or remain separated.

Key Challenge: Existing methods often apply post-hoc smoothing (e.g., DenseCLIP, PAMR), but this addresses symptoms rather than root causes—when the source geometry is already incorrect, downstream smoothing can only alleviate but not eliminate the problem. Other methods introduce auxiliary visual backbones (e.g., DINOv2), increasing complexity and latency.

Goal: (1) Correct the token geometry at the source where attention scores are formed; (2) transform text from a simple labeler into a structural prior that guides semantic propagation across pixels.

Key Insight: The authors observe that in CLIP ViT-B/16 attention maps, vanilla CLIP produces diffuse maps biased toward backgrounds, while NACLIP improves but suffers from severe fragmentation. By applying an orthogonal Procrustes rotation to the keys in the final self-attention layer, the output features can be better aligned with the query subspace.

Core Idea: First repair attention geometry via orthogonal Procrustes alignment, then diffuse semantic consistency across the entire image through text-aware Laplacian propagation—an align-then-propagate paradigm.

Method¶

Overall Architecture¶

PEARL inserts two modules into the frozen CLIP ViT inference pipeline. After the input image is encoded by the ViT, a Procrustes Alignment (PA) step is inserted into the final self-attention block: an orthogonal rotation is applied to the keys of each head to align them with the query subspace, yielding corrected patch features. These features are then compared with class prototypes generated by the frozen text encoder via cosine similarity to produce an initial logit field \(\widetilde{Z}\). Subsequently, Text-aware Laplacian Propagation (TLP) performs confidence-weighted, text-guided graph Laplacian solving on a downsampled grid, outputting refined scores \(F\), which are upsampled to full resolution and passed through argmax to obtain the final segmentation.

Key Designs¶

Procrustes Alignment (PA):
- Function: Corrects the basis mismatch between keys and queries in the final self-attention layer.
- Mechanism: For each head, query-norm-weighted centroids are computed for both queries and keys, which are then mean-centered. The orthogonal Procrustes problem \(R^* = \arg\min_{R \in O(d)} \|K_c R - Q_c\|_F^2\) is then solved; the closed-form solution is the orthogonal factor \(UV^\top\) from the SVD of the cross-covariance matrix \(K_c^\top Q_c\). Only the keys are rotated—values remain unchanged—and attention scores and outputs are recomputed within the same attention block. A Newton-Schulz iteration can also be used as an SVD-free alternative.
- Design Motivation: Weighted mean-centering suppresses the influence of high-norm background tokens and the CLS token; the orthogonal mapping preserves local magnitudes. The method repairs the directional consistency of patch features in the query subspace, stabilizing downstream cosine similarity. The additional cost per head is limited to one \(d \times d\) SVD and two \(N \times d\) matrix multiplications.
Text-aware Laplacian Propagation (TLP):
- Function: Performs graph-based smooth refinement of the initial logit field on a grid.
- Mechanism: The logit field is downsampled to a \(H_g \times W_g\) grid and a 4-connected graph is constructed. Each node's data fidelity weight \(\rho_i\) is jointly determined by the peak post-softmax probability and text-prior consistency \(u_i = p_i^\top G p_i\). Edge weights \(a_{ij}\) are determined by image-gradient-based edge detection \(b_{ij}^{img}\) and a text-consistency gate \(g_{ij} = p_i^\top G p_j\). Refined logits are obtained by minimizing the convex quadratic objective \(\mathcal{L}(F_g) = \frac{1}{2}\sum_i \rho_i \|F_{g,i} - Z_{g,i}\|^2 + \frac{\tau}{2}\sum_{(i,j)} a_{ij}\|F_{g,i} - F_{g,j}\|^2\), solved efficiently on the small grid via conjugate gradient.
- Design Motivation: The inter-prototype similarity matrix \(G\) encodes class co-occurrence relationships, allowing semantically related categories to mutually reinforce each other during propagation; image gradients protect boundaries. This is more targeted than category-agnostic smoothing and more compact than multi-backbone pipelines.
Sliding Window Inference and Fusion:
- Function: Handles high-resolution images.
- Mechanism: The image is covered by overlapping windows; each window independently executes the PA + TLP pipeline to produce upsampled logits, which are then merged into a global coordinate system via weighted fusion.
- Design Motivation: Since ViT patch size is fixed (e.g., 16×16), directly processing large images yields insufficient resolution. Sliding windows preserve patch-level detail while overlapping regions smooth the seams.

Loss & Training¶

PEARL requires no training whatsoever—all hyperparameters are fixed constants (temperature \(\tau_s\), grid size \(H_g \times W_g\), edge detection threshold \(\kappa\), etc.) and the method is plug-and-play at inference time.

Key Experimental Results¶

Main Results¶

Comparison of mIoU (%) on 8 standard OVSS benchmarks; all methods use CLIP ViT-B/16 without additional backbones:

Dataset	PEARL	NACLIP	SFP	SCLIP	ClearCLIP
V21 (w/ bg)	64.1	58.9	56.8	59.1	51.8
PC60 (w/ bg)	35.1	32.2	32.3	30.4	32.6
Object (w/ bg)	37.3	33.2	32.1	30.5	33.0
V20	86.9	79.7	83.4	80.4	80.9
PC59	38.6	35.2	36.0	34.1	35.9
City	37.6	35.5	34.1	32.2	30.0
ADE	19.4	17.4	18.1	16.1	16.7
Average	43.2	39.4	39.6	38.2	38.1

Compared to methods using auxiliary backbones: PEARL (43.2) outperforms CASS+DINOv3 (42.2) without requiring any auxiliary model.

Ablation Study¶

Configuration	Avg. mIoU	V21	PC59	City
Vanilla CLIP (no PA, no TLP)	13.8	18.6	11.2	6.7
PA only	40.6	59.2	35.3	35.0
TLP only	29.3	35.4	25.0	20.5
PA + TLP (full)	43.2	64.1	38.6	37.6

Effect of applying TLP as a plug-and-play module to other methods:

Method	Original Avg.	+TLP Avg.	Gain
SCLIP	38.2	42.2	+4.0
NACLIP	39.4	42.3	+2.9
SFP	39.6	41.5	+1.9

Key Findings¶

PA is the largest contributor, improving average mIoU from 13.8 to 40.6 (+26.8), demonstrating that correcting the geometry at the attention source is critical.
TLP is complementary to PA, adding a further 2.6 points on top of PA; TLP also delivers plug-and-play gains of 2–4 points when applied to other methods.
Even without any auxiliary backbone, PEARL surpasses methods that rely on DINOv2/DINOv3 (e.g., CASS at 42.2).
On pixel accuracy (pAcc), PEARL achieves the best result among auxiliary-backbone-free methods at 67.2%, even surpassing CASS+DINOv3 (67.0%).

Highlights & Insights¶

Procrustes Alignment is the central contribution of the paper: it directly applies an orthogonal rotation correction at the attention computation step with negligible cost (a single \(d \times d\) SVD) and substantial impact (+26.8 mIoU). This paradigm—performing closed-form orthogonal alignment in feature space rather than learning parameters—deserves broader exploration in other vision-language dense prediction tasks.
Text as more than a classifier: PEARL elegantly repurposes the inter-prototype similarity matrix as a structural constraint for graph propagation. This insight is transferable to other dense prediction tasks—for example, using text category relationships to constrain NMS or post-processing in open-vocabulary detection.
The entire method is completely training-free, requires no external data, and uses no auxiliary models—only two steps with fixed constants—making the design remarkably compact.

Limitations & Future Work¶

A performance gap remains on fine-grained stuff categories in datasets such as ADE20K (19.4 vs. ProxyCLIP 19.7); CLIP's generic prompts have limited discriminative power for rare stuff categories.
When semantically similar categories such as "tree" and "mountain" share low-frequency textures, the absence of depth or shape cues can cause confusion.
The grid size must be manually tuned per dataset (224×224 for Cityscapes, 80×80 for others); adaptive grid scale selection could further improve performance.
Procrustes alignment applies a single global orthogonal mapping, which may be insufficiently fine-grained for region-specific key-query mismatches—region-adaptive alignment could be more effective.

vs. NACLIP: NACLIP enhances locality by modifying the attention neighborhood mask, but suffers from severe fragmentation. PEARL addresses the geometric root cause directly, achieving better results without imposing artificial locality constraints.
vs. CASS (CVPR'25): CASS employs DINOv2/DINOv3 auxiliary backbones for visual context graph construction, resulting in a complex pipeline that requires additional models. PEARL surpasses CASS+DINOv3 on average mIoU using only a single CLIP model in a far more compact design.
vs. ProxyCLIP: ProxyCLIP also relies on DINOv2 for region grouping, performing better on ADE but weaker overall compared to PEARL.

Rating¶

Novelty: ⭐⭐⭐⭐ Applying orthogonal Procrustes alignment to correct CLIP self-attention is a genuinely novel angle; text-aware Laplacian propagation is also a meaningful contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Eight benchmarks, comprehensive ablations, plug-and-play validation across methods, and supplementary pixel accuracy reporting—extremely complete.
Writing Quality: ⭐⭐⭐⭐⭐ A clear logical chain from observation → insight → method; mathematical derivations are rigorous and conceptually well-explained.
Value: ⭐⭐⭐⭐ Advances training-free OVSS performance beyond methods with auxiliary backbones, with high practical applicability.