CLIPSym: Delving into Symmetry Detection with CLIP¶

Conference: ICCV 2025 arXiv: 2508.14197 Code: https://github.com/timyoung2333/CLIPSym Area: Multimodal VLM Keywords: Symmetry Detection, CLIP, Rotation Equivariance, Semantics-Aware Prompt Grouping, G-Convolution

TL;DR¶

This paper proposes CLIPSym, the first method to leverage the multimodal understanding capability of pretrained CLIP for reflection and rotation symmetry detection. It introduces a Semantics-Aware Prompt Grouping (SAPG) strategy to integrate textual semantic cues and a decoder with theoretical rotation equivariance guarantees, achieving state-of-the-art results on three benchmarks: DENDI, SDRW, and LDRS.

Background & Motivation¶

Symmetry is one of the most fundamental geometric cues in computer vision, with broad applications in object recognition, scene understanding, and image matching. Nevertheless, symmetry detection remains challenging due to the complexity and variability of real-world scenes.

Early methods rely on keypoint matching (e.g., SIFT descriptors) and perform poorly under complex symmetry patterns or noise. Deep learning approaches (PMCNet, EquiSym) achieve progress using equivariant convolutions, but the potential of learning-based methods remains underexplored due to limited annotated dataset scale.

A key observation motivating this work: approximately 10% of image captions in the LAION-400M dataset contain words conveying shape/symmetry cues (e.g., "rectangle," "circle," "oval"), suggesting that CLIP's vision-language representations may encode useful symmetry-related knowledge. This raises the central question: how can pretrained vision-language models be leveraged to assist symmetry detection?

Method¶

Overall Architecture¶

Input image \(I\) → CLIP image encoder extracts patch tokens \(Z_I\)
Text prompt set \(\mathcal{T}\) → CLIP text encoder extracts text tokens \(Z_\mathcal{T}\)
Decoder: FiLM modulation → Transformer + aggregation → rotation-equivariant upsampling → symmetry heatmap \(\hat{S}_I\)
Output is a per-pixel probability map indicating the likelihood of each location being a reflection axis or rotation center

Key Designs¶

Semantics-Aware Prompt Grouping (SAPG):
- Challenge: "Symmetry" is a highly abstract concept, and CLIP training data is unlikely to contain descriptions such as "symmetry axes."
- Strategy: Construct a prompt set \(\mathcal{T} = \{t_1, t_2, ..., t_M\}\), where each prompt \(t_m\) is composed of \(K\) frequently occurring object category names in the dataset.
- Example: \(t_m\) = "apple cloud table" (\(K=3\))
- All images share the same prompt set (symmetry is a universal concept).
- Three design motivations:
  - Frequent objects provide better initialization (CLIP has good alignment for common objects)
  - Multi-prompt aggregation provides complementary semantic cues
  - Fixed prompts provide consistent semantic anchors (embeddings are continuously updated during training to capture symmetry features)
- Optimal configuration: \(M=25, K=4\) (25 prompts, each with 4 words)
Rotation-Equivariant Decoder:
- ① FiLM Modulation Block: Text tokens modulate image features via \(z_{p_{ij}|t} = \gamma(z_t) \odot z_{p_{ij}} + \beta(z_t)\)
- ② Transformer + Aggregation: Each text-conditioned stream independently passes through a Transformer to learn spatial dependencies, then all prompt tokens are aggregated via weighted averaging: \(\bar{z}_{p_{ij}} = \sum_{t \in \mathcal{T}} w_t \hat{z}_{p_{ij}|t}\)
- ③ Rotation-Equivariant Upsampler:
  - Aggregated tokens are rearranged into 2D feature maps and lifted to the rotation-translation group \(\mathbb{Z}_M^2 \rtimes C_n\)
  - Three layers of G-Conv followed by 4× bilinear upsampling
  - Final heatmap is generated by mean pooling along the rotation dimension \(\theta\)
Theoretical Equivariance Guarantee:
- Theorem: Decoder \(D\) is rotation-equivariant with respect to the \(C_4\) group: \(D(T_\theta Z_I, Z_\mathcal{T}) = R_\theta \hat{S}_I, \forall \theta \in C_4\)
- Proof proceeds in three steps: FiLM block is an element-wise operation (permutation equivariant) → Transformer is equivariant to permutation of token order → G-Conv is equivariant to the \(C_n\) group
- In practice, large-scale CLIP pretraining additionally confers robustness to inexact rotations

Loss & Training¶

An \(\alpha\)-focal loss is used to handle foreground/background pixel class imbalance: \(\mathcal{L}_{focal}(I) = \sum_{x,y} -\alpha'_{I_{xy}}(1-\hat{S}'_{I_{xy}})^\lambda \log(\hat{S}'_{I_{xy}})\)
Both CLIP image and text encoders are fine-tuned (experiments show that fine-tuning both yields the best performance)
Training for 500 epochs with the Adam optimizer; inputs are resized to 417×417

Key Experimental Results¶

Main Results¶

DENDI Dataset F1-score (%):

Method	Pretraining	Reflection F1	Rotation F1
SymResNet	ImageNet	30.7	11.9
PMCNet	ImageNet	53.8±0.5	-
EquiSym	ImageNet	61.7±0.6	22.0±0.7
CLIPSym^{no-text}	CLIP	63.7±0.3	17.7±0.2
CLIPSym^{non-eq.}	CLIP	62.9±0.2	24.2±0.1
CLIPSym	CLIP	66.5±0.2	25.1±0.1

SDRW + LDRS Dataset Reflection F1 (%):

Method	SDRW	LDRS	Mixed
PMCNet	40.8±0.4	30.5±0.5	33.8±0.2
EquiSym	48.2±0.1	37.7±0.1	41.1±0.1
CLIPSym	51.8±0.3	39.5±0.1	42.8±0.1

Ablation Study¶

Prompt initialization strategies (DENDI Reflection F1):

Prompt Type	Config	F1
Single prompt "reflection axis"	M=1	64.4
Single prompt "symmetry axes in the image"	M=1	64.8
Single prompt — frequent objects (K=25)	M=1	65.8
Multi-prompt M=25, K=1	-	65.3
Multi-prompt M=25, K=4	-	66.5
Multi-prompt M=25, K=16	-	65.9
Multi-prompt M=50, K=4	-	65.4

Impact of trainable components:

Text Encoder	Image Encoder	F1
✗	✗	59.4
✓	✗	58.9
✗	✓	65.3
✓	✓	66.5

Equivariance evaluation (DENDI Reflection, ±45° random rotation):

Method	Robustness↑	Consistency↓
PMCNet	52.2	0.417
EquiSym	57.1	0.244
CLIPSym^{non-eq.}	58.3	0.093
CLIPSym	59.7	0.082

Key Findings¶

CLIP pretraining is critical: Training from scratch yields only F1 = 32.1, while CLIP pretraining boosts this to 66.5.
Text encoder is effective: CLIPSym outperforms the no-text variant by 2.8 F1, demonstrating that textual semantics genuinely aids symmetry understanding.
Image encoder is more critical: Freezing the image encoder causes a substantial performance drop (~58.9), whereas freezing the text encoder has a smaller effect (65.3).
Equivariant decoder is effective but not the primary contributor: The non-equivariant variant is already strong (62.9/24.2), largely due to the robustness conferred by CLIP pretraining.
CLIP pretraining provides robustness beyond geometric equivariance: Even compared to the fully equivariant EquiSym, CLIPSym achieves superior consistency and robustness.
25 prompts × 4 words is the optimal configuration; too many prompts degrades performance.

Highlights & Insights¶

Novel use of cross-modal knowledge for geometric tasks: Symmetry is a purely geometric concept, yet it benefits from language modality knowledge, challenging the assumption that "geometric tasks require only geometric methods."
Counter-intuitive SAPG design: Using object names (rather than symmetry-related words) as prompts proves more effective, because CLIP has better vision-semantic alignment for concrete objects.
Theoretical equivariance guarantee: Strict \(C_4\) equivariance is proven mathematically rather than validated solely empirically.
Implicit geometric knowledge in CLIP: Experiments demonstrate that CLIP does learn symmetry-related visual features through large-scale training.

Limitations & Future Work¶

Equivariance is only guaranteed for \(C_4\) (multiples of 90°); equivariance to arbitrary angles relies on the implicit robustness of CLIP pretraining.
Computational cost (148.8 GFLOPs) is higher than EquiSym (114.0), though lower than PMCNet (167.7).
Rotation detection F1 remains low (25.1), indicating that rotational symmetry detection is still an open problem.
Only ViT-B/16 is used as the backbone; larger models (e.g., ViT-L/14) exhibit a slight performance decrease, which is not thoroughly analyzed.
Fixed prompt sets may lack flexibility in specialized domains (e.g., medical imaging).

The SAPG strategy of using frequent object names as prompts can be generalized to other scenarios that leverage CLIP priors.
The hybrid architecture combining FiLM modulation, Transformer, and G-Conv is a valuable design reference.
Geometric knowledge encoded in CLIP visual representations (symmetry, spatial relations, etc.) warrants further exploration.
The equivariant decoder design pattern can be applied to other dense prediction tasks requiring equivariance.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First application of CLIP to symmetry detection; SAPG prompt design is highly original.
Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, multiple baselines, and ablation variants; dataset scale is somewhat limited.
Writing Quality: ⭐⭐⭐⭐⭐ Mathematical derivations are rigorous, equivariance proofs are complete, and the narrative is coherent.
Value: ⭐⭐⭐⭐ Symmetry detection has practical applications in industrial vision and robotics.