Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning¶
Conference: ICLR 2026
Code: https://github.com/Longin-Yu/GeoPerceive
Area: multimodal_vlm
Keywords: Geometric Perception, Visual Language Models, Domain-Specific Language, Reinforcement Learning, DPO
TL;DR¶
This work introduces the GEOPERCEIVE benchmark (geometric perception evaluation based on unambiguous DSL) and the GEODPO framework (translator-guided reinforcement learning). It enables VLMs to maintain natural language output while utilizing an NL→DSL translator to calculate fine-grained reward signals, significantly enhancing geometric primitive perception and downstream reasoning capabilities.
Background & Motivation¶
Background: Geometric Problem Solving (GPS) is a critical application scenario for multimodal VLMs. Current mainstream methods employ end-to-end reasoning on graphics and text, measuring model capability via final answer accuracy. However, even state-of-the-art models like GPT-o3 and Qwen3-235B exhibit basic perception errors, such as misidentifying "tangency" as "intersection" or missing key intersection points.
Limitations of Prior Work: First, existing GPS benchmarks (e.g., MathVista, GeoQA) conflate perception errors with reasoning errors, failing to independently assess geometric perception. Second, existing DSLs like AlphaGeometry or Inter-GPS suffer from "one-image-multiple-programs" ambiguity, where the same graphic corresponds to multiple semantically equivalent DSLs, making precise program-level evaluation impossible. Furthermore, direct Supervised Fine-Tuning (SFT) on DSLs faces permutation equivalence explosion and causes models to deviate from pre-trained natural language distributions.
Key Challenge: There is a need for a geometric perception evaluation and training system that is both unambiguous and automatically generatable at scale. Simultaneously, a training paradigm is required to bypass SFT limitations and utilize fine-grained DSL-level rewards for VLM alignment.
Goal: To independently measure and enhance VLM perception of geometric primitives (points, lines, circles) and their spatial relationships, rather than optimizing for end-to-end final answers.
Core Idea: Design a normalized DSL (GEODSL) as the unique formal representation of geometric figures. Use a pipeline of "VLM generates NL description → NL2DSL translator maps description back to DSL → Calculate F1 score against ground truth DSL" as the reward function. Apply DPO for preference alignment, ensuring the model outputs in natural language space while receiving DSL-level fine-grained supervision.
Method¶
Overall Architecture¶
The GEODPO system consists of three synergetic components: the GEOPERCEIVE data engine for automated (image, DSL) pair generation; an NL2DSL translator trained on synthetic corpora to map VLM natural language output back to DSL; and a DPO trainer that constructs preference pairs using fine-grained scores from the translator for VLM reinforcement alignment.
flowchart LR
A[GEODSL Generation Engine\nRandom Sampling Geometric Programs] --> B[Graphic Solver Engine\nGradient Descent Pixel Rendering]
B --> C[(GEOPERCEIVE\nImage+DSL Dataset)]
C --> D[VLM Generates\nNL Description]
C --> E[NL2DSL Translator\nQwen2.5-7B + LoRA]
D --> E
E --> F[DSL-level F1 Score]
F --> G[Preference Pair Construction\nwinner / loser]
G --> H[DPO Loss\nAlign VLM]
H --> D
Key Designs¶
1. GEODSL: Unambiguous Normalized Geometric DSL
Existing DSLs (AlphaGeometry, Inter-GPS, etc.) suffer from the "one-image-multiple-programs" issue—the same geometric relationship can be expressed with different construction sequences, leading to non-unique ground truths. GEODSL adopts a relational rather than constructive approach, representing figures as 4-tuples \(G = \langle P, L, C, R \rangle\) (Points, Lines, Circles, Relations). Point-curve associations are embedded within curve declarations, ensuring each image corresponds to a unique DSL program. Complexity is controllable, as program length scales linearly with element count.
2. GEOPERCEIVE Metrics: Weighted F1 based on Hungarian Matching
Given ground truth \(G\) and prediction \(\hat{G}\), similarity matrices are constructed for each primitive category (Point/Line/Circle/Relation). The maximum weight bipartite matching is solved via the Hungarian algorithm to calculate the F1 score for that category. The final score \(Score(G, \hat{G})\) is the equal-weighted average of the four category F1 scores. This metric is naturally robust to permutation equivalence.
3. Gradient Descent Graphic Solver Engine
Given a GEODSL program, the engine parameterizes geometric primitives (point coordinates, line coefficients, circle center/radius) and translates all geometric constraints into loss functions (e.g., squared distance of a point to a line). It incorporates regularization terms for visual plausibility (density, distribution, scale/boundary penalties) and renders pixel images via iterative optimization in PyTorch.
4. Translator-Guided DPO Preference Alignment
The core strategy is that the VLM does not learn to output DSL directly (avoiding distribution drift and permutation explosion). Instead, it maintains NL output, and a separately trained NL2DSL translator (Qwen2.5-7B + LoRA, rank=4) maps the description back to DSL for scoring. For each training image, \(N_\text{samples}=10\) descriptions are sampled from the reference VLM and ranked by DSL scores. The top half are winners and the bottom half are losers. A minimum margin \(\delta_\text{min}=0.3\) is required to filter out uninformative pairs. The standard DPO loss is applied:
$\(\mathcal{L}_\text{DPO} = -\mathbb{E}\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(S_w|D)}{\pi_\text{ref}(S_w|D)} - \beta\log\frac{\pi_\theta(S_l|D)}{\pi_\text{ref}(S_l|D)}\right)\right]\)$
to align the VLM towards generating geometrically accurate natural language descriptions.
Key Experimental Results¶
Main Results (GEOPERCEIVE Main-test, In-domain Perception)¶
| Model | Method | Overall Score | Δ vs Raw |
|---|---|---|---|
| Qwen2.5-VL 7B | Raw | 57.96 | — |
| Qwen2.5-VL 7B | SFT | 64.02 | +10.46% |
| Qwen2.5-VL 7B | GEODPO | 66.19 | +14.2% |
| InternVL3 8B | Raw | 58.44 | — |
| InternVL3 8B | SFT | 62.71 | +7.31% |
| InternVL3 8B | GEODPO | 67.41 | +15.35% |
| LLaVA-Next 7B | Raw | 41.01 | — |
| LLaVA-Next 7B | SFT | 51.10 | +24.60% |
| LLaVA-Next 7B | GEODPO | 51.86 | +26.46% |
OOD and Downstream Reasoning¶
| Dataset | Model | Raw | GEODPO | Δ |
|---|---|---|---|---|
| GEOPERCEIVE-OOD (Perception) | Qwen2.5-VL 7B | 58.14 | 60.28 | +3.68% |
| GEOPERCEIVE-OOD (Perception) | InternVL3 8B | 58.74 | 60.91 | +3.69% |
| MathVista Geometry (Reasoning) | Qwen2.5-VL 7B | — | — | +39.0% (Overall report) |
Key Findings¶
- SFT leads to a performance decrease in the Constraint category (InternVL3 constraint F1 dropped by 6.32%), while GEODPO consistently improves it by +9.9% to +19.27%, indicating GEODPO's robustness for "fragile" relations.
- SFT yields almost zero gain on OOD sets (+0.46% or even −0.29%), whereas GEODPO maintains consistent positive gains, demonstrating the stronger generalization of RL preference alignment.
- The translator's performance drops significantly for circles and constraints as geometric complexity increases (iterations 4–5), identifying it as the main performance bottleneck.
Highlights & Insights¶
- Decoupling Perception and Reasoning: Through an independent perception benchmark, it distinguishes "VLM cannot see the image clearly" from "VLM cannot reason correctly," providing a new diagnosis tool.
- Translator as a Reward Bridge: Utilizing an NL→DSL translator to "graft" structured formal scoring onto natural language output avoids distribution drift. This is a general "cross-modal reward injection" idea applicable to other formal language tasks like chemical structures or code.
- Automated Data Pipeline: The generation and solver engines require zero human annotation and can generate geometry of arbitrary complexity, facilitating large-scale pre-training.
- SFT Negative Transfer: The observation that SFT can degrade performance in specific categories (constraints) serves as a warning against the assumption that generic SFT is suitable for all tasks.
Limitations & Future Work¶
- Translator F1 scores drop on high-complexity figures (many circles/constraints), directly limiting reward signal quality. Stronger translators or using the VLM itself as a translator could be explored.
- GEODSL currently covers Euclidean configurations; support for non-standard geometry (projective, coordinate geometry with values) is missing.
- The OOD dataset is small (100 samples annotated by 10 students); statistical confidence needs further verification.
- The DPO paradigm is sensitive to \(N_\text{samples}\) and computational overhead (10 samples per image) may become a bottleneck in large-scale training.
Related Work & Insights¶
- vs SFT on DSL: Direct SFT to DSL output faces permutation explosion and distribution drift. GEODPO bypasses this by maintaining NL output and using external scoring.
- vs AlphaGeometry / Inter-GPS: These DSLs have "one-image-multiple-programs" ambiguity; GEODPO solves the foundation of precise evaluation with normalized GEODSL.
- vs MathVista / GeoQA (End-to-End): Existing benchmarks only consider the final answer; GEODPO reveals that perception is an independent bottleneck.
- vs RLVR (Verifiable Reward): Following the trend of "answer verification as reward" in math reasoning, this work extends the paradigm to geometric perception using "DSL-level structural matching."
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of unambiguous DSL and translator reward bridge is novel, filling a gap in geometric perception evaluation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison across three model series, in-domain/OOD/downstream reasoning, and detailed ablation.
- Writing Quality: ⭐⭐⭐⭐ Clear problem definition, well-defined contributions, and rich visualizations.
- Value: ⭐⭐⭐⭐ The perception-reasoning decoupling framework and translator-guided reward approach offer strong insights for multimodal training.