PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation¶

Conference: CVPR 2026 arXiv: 2603.17520 Code: https://github.com/NUST-Machine-Intelligence-Laboratory/PCA-Seg Area: Segmentation / Open-Vocabulary Segmentation Keywords: open-vocabulary segmentation, parallel cost aggregation, expert-driven perception, feature orthogonalization, vision-language models

TL;DR¶

This paper revisits cost aggregation strategies and proposes PCA-Seg, a parallel architecture that replaces the conventional serial design. It integrates class-semantic and spatial-contextual information via an Expert-driven Perception Learning (EPL) module, and employs a Feature Orthogonalization Decoupling (FOD) strategy to reduce redundancy. PCA-Seg achieves state-of-the-art performance on 8 benchmarks with only 0.35M additional parameters per block.

Background & Motivation¶

Background: CLIP-based open-vocabulary semantic and part segmentation methods extract vision-language alignment cues from cost volumes through spatial and class aggregation.
Limitations of Prior Work: Existing methods adopt a serial structure—performing spatial aggregation before class aggregation (or vice versa)—which causes knowledge interference: the prior aggregation step alters the input distribution for the subsequent step, and spatial aggregation may distort class semantics.
Key Challenge: Class-semantic and spatial-structural information must be captured simultaneously, yet serial processing allows one type of information to contaminate the other.
Goal: Design a parallel architecture to eliminate knowledge interference introduced by serial processing.
Key Insight: Class semantics and spatial structure represent knowledge along two orthogonal dimensions and should be processed independently before fusion.
Core Idea: Parallel cost aggregation + Expert-driven Perception Learning (EPL) for dual-branch fusion + Feature Orthogonalization Decoupling (FOD) to reduce redundancy.

Method¶

Overall Architecture¶

The cost volume is constructed as a similarity matrix between CLIP visual and text encoder features. Parallel spatial aggregation and class aggregation branches process the cost volume independently; the EPL module fuses their outputs, and the FOD strategy enforces orthogonality between the two feature streams.

Key Designs¶

EPL (Expert-driven Perception Learning Module): A set of expert parsers extracts complementary features from multiple perspectives, while a coefficient mapper adaptively learns pixel-wise weights to integrate knowledge from both branches.
FOD (Feature Orthogonalization Decoupling Strategy): An orthogonalization loss constrains the cosine similarity between class-semantic features and spatial-structural features toward zero, ensuring the two knowledge streams remain non-redundant.
Parallel Architecture: Spatial aggregation and class aggregation operate on the cost volume independently, avoiding cascading effects.

Loss & Training¶

Segmentation loss + orthogonalization decoupling loss (constraining the cosine similarity between the two feature streams toward 0).

Key Experimental Results¶

Main Results¶

Dataset	Metric	PCA-Seg	DeCLIP	H-CLIP	PartCATSeg
A-150	mIoU	SOTA	2nd	3rd	—
PAS-20b	mIoU	SOTA	—	—	2nd
ADE20K-Part	mIoU	SOTA	—	—	2nd

Ablation Study¶

Configuration	A-150 mIoU	Note
Full PCA-Seg	SOTA	Parallel + EPL + FOD
Serial baseline	−1.5%	Knowledge interference
w/o FOD	−0.9%	Redundant dual-branch features
Single convolution replacing EPL	−0.2%	Insufficient fusion

Parameter Efficiency Analysis¶

Component	Extra Parameters	GPU Memory	mIoU Contribution
Parallel branches	0.25M	0.72G	+0.8%
EPL	0.08M	0.18G	+0.5%
FOD	0M (loss only)	0.06G	+0.9%
Total / block	0.35M	0.96G	+2.2%

Key Findings¶

FOD yields a +0.9% mIoU gain, demonstrating that orthogonalization constraints effectively reduce redundancy.
Each parallel block introduces only 0.35M additional parameters and 0.96G GPU memory.
State-of-the-art performance is achieved on both semantic segmentation and part segmentation tasks.

Highlights & Insights¶

The identification of "serial knowledge interference" offers meaningful insights into understanding cost aggregation.
The FOD orthogonalization constraint is concise yet effective and can be generalized to other multi-branch architectures.
The minimal parameter overhead (0.35M per block) makes the method feasible for practical deployment.

Limitations & Future Work¶

The parallel architecture introduces a modest increase in computational cost.
The orthogonality assumption may be overly strong—in certain scenarios, class semantics and spatial information are reasonably correlated, and enforcing orthogonality may discard useful information.
Validation is limited to open-vocabulary segmentation; instance segmentation and panoptic segmentation remain untested.
Cost volume construction depends on the quality of CLIP features, so CLIP's inherent limitations propagate to downstream tasks.
The weight of the FOD orthogonalization loss requires careful tuning; an excessively large value may suppress useful information.
The selection of the number of expert parsers in EPL lacks theoretical guidance.
Integration with recent SAM-based open-vocabulary segmentation methods has not been explored.

vs. CATSeg / PartCATSeg: These methods adopt serial architectures that suffer from knowledge interference; PCA-Seg's parallel design eliminates this issue.
vs. DeCLIP: DeCLIP fine-tunes CLIP attention layers, whereas PCA-Seg innovates at the cost aggregation level.

Additional Discussion¶

The core contribution of this work lies in transforming the problem from a single-dimensional analysis to a multi-dimensional one, providing a more comprehensive understanding.
The experimental design covers diverse scenarios and baseline comparisons, with statistically significant results.
The modular design of the method facilitates extension to related tasks and new datasets.
Open-sourcing the code and data is of significant value to the community for reproducibility and follow-up research.
Compared to concurrent works, this paper demonstrates greater depth in problem formulation and comprehensiveness in experimental analysis.
The paper is logically structured, forming a complete loop from problem definition to method design to experimental validation.

Rating¶

Novelty: ⭐⭐⭐⭐ Parallel cost aggregation and FOD strategy are original contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 8 benchmarks.
Writing Quality: ⭐⭐⭐⭐ Motivation figures are clear and problem definition is precise.
Value: ⭐⭐⭐⭐ Makes a practical contribution to open-vocabulary segmentation.