Skip to content

Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

Conference: ICLR 2026
Code: https://github.com/Longin-Yu/GeoPerceive
Area: multimodal_vlm
Keywords: Geometric Perception, Visual Language Models, Domain-Specific Language, Reinforcement Learning, DPO

TL;DR

This work introduces the GEOPERCEIVE benchmark (geometric perception evaluation based on unambiguous DSL) and the GEODPO framework (translator-guided reinforcement learning). It enables VLMs to maintain natural language output while utilizing an NL→DSL translator to calculate fine-grained reward signals, significantly enhancing geometric primitive perception and downstream reasoning capabilities.

Background & Motivation

Background: Geometric Problem Solving (GPS) is a critical application scenario for multimodal VLMs. Current mainstream methods employ end-to-end reasoning on graphics and text, measuring model capability via final answer accuracy. However, even state-of-the-art models like GPT-o3 and Qwen3-235B exhibit basic perception errors, such as misidentifying "tangency" as "intersection" or missing key intersection points.

Limitations of Prior Work: First, existing GPS benchmarks (e.g., MathVista, GeoQA) conflate perception errors with reasoning errors, failing to independently assess geometric perception. Second, existing DSLs like AlphaGeometry or Inter-GPS suffer from "one-image-multiple-programs" ambiguity, where the same graphic corresponds to multiple semantically equivalent DSLs, making precise program-level evaluation impossible. Furthermore, direct Supervised Fine-Tuning (SFT) on DSLs faces permutation equivalence explosion and causes models to deviate from pre-trained natural language distributions.

Key Challenge: There is a need for a geometric perception evaluation and training system that is both unambiguous and automatically generatable at scale. Simultaneously, a training paradigm is required to bypass SFT limitations and utilize fine-grained DSL-level rewards for VLM alignment.

Goal: To independently measure and enhance VLM perception of geometric primitives (points, lines, circles) and their spatial relationships, rather than optimizing for end-to-end final answers.

Core Idea: Design a normalized DSL (GEODSL) as the unique formal representation of geometric figures. Use a pipeline of "VLM generates NL description → NL2DSL translator maps description back to DSL → Calculate F1 score against ground truth DSL" as the reward function. Apply DPO for preference alignment, ensuring the model outputs in natural language space while receiving DSL-level fine-grained supervision.

Method

Overall Architecture

The GEODPO system consists of three synergetic components: the GEOPERCEIVE data engine for automated (image, DSL) pair generation; an NL2DSL translator trained on synthetic corpora to map VLM natural language output back to DSL; and a DPO trainer that constructs preference pairs using fine-grained scores from the translator for VLM reinforcement alignment.

flowchart LR
    A[GEODSL Generation Engine\nRandom Sampling Geometric Programs] --> B[Graphic Solver Engine\nGradient Descent Pixel Rendering]
    B --> C[(GEOPERCEIVE\nImage+DSL Dataset)]
    C --> D[VLM Generates\nNL Description]
    C --> E[NL2DSL Translator\nQwen2.5-7B + LoRA]
    D --> E
    E --> F[DSL-level F1 Score]
    F --> G[Preference Pair Construction\nwinner / loser]
    G --> H[DPO Loss\nAlign VLM]
    H --> D

Key Designs

1. GEODSL: Unambiguous Normalized Geometric DSL
Existing DSLs (AlphaGeometry, Inter-GPS, etc.) suffer from the "one-image-multiple-programs" issue—the same geometric relationship can be expressed with different construction sequences, leading to non-unique ground truths. GEODSL adopts a relational rather than constructive approach, representing figures as 4-tuples \(G = \langle P, L, C, R \rangle\) (Points, Lines, Circles, Relations). Point-curve associations are embedded within curve declarations, ensuring each image corresponds to a unique DSL program. Complexity is controllable, as program length scales linearly with element count.

2. GEOPERCEIVE Metrics: Weighted F1 based on Hungarian Matching
Given ground truth \(G\) and prediction \(\hat{G}\), similarity matrices are constructed for each primitive category (Point/Line/Circle/Relation). The maximum weight bipartite matching is solved via the Hungarian algorithm to calculate the F1 score for that category. The final score \(Score(G, \hat{G})\) is the equal-weighted average of the four category F1 scores. This metric is naturally robust to permutation equivalence.

3. Gradient Descent Graphic Solver Engine
Given a GEODSL program, the engine parameterizes geometric primitives (point coordinates, line coefficients, circle center/radius) and translates all geometric constraints into loss functions (e.g., squared distance of a point to a line). It incorporates regularization terms for visual plausibility (density, distribution, scale/boundary penalties) and renders pixel images via iterative optimization in PyTorch.

4. Translator-Guided DPO Preference Alignment
The core strategy is that the VLM does not learn to output DSL directly (avoiding distribution drift and permutation explosion). Instead, it maintains NL output, and a separately trained NL2DSL translator (Qwen2.5-7B + LoRA, rank=4) maps the description back to DSL for scoring. For each training image, \(N_\text{samples}=10\) descriptions are sampled from the reference VLM and ranked by DSL scores. The top half are winners and the bottom half are losers. A minimum margin \(\delta_\text{min}=0.3\) is required to filter out uninformative pairs. The standard DPO loss is applied: $\(\mathcal{L}_\text{DPO} = -\mathbb{E}\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(S_w|D)}{\pi_\text{ref}(S_w|D)} - \beta\log\frac{\pi_\theta(S_l|D)}{\pi_\text{ref}(S_l|D)}\right)\right]\)$ to align the VLM towards generating geometrically accurate natural language descriptions.

Key Experimental Results

Main Results (GEOPERCEIVE Main-test, In-domain Perception)

Model Method Overall Score Δ vs Raw
Qwen2.5-VL 7B Raw 57.96
Qwen2.5-VL 7B SFT 64.02 +10.46%
Qwen2.5-VL 7B GEODPO 66.19 +14.2%
InternVL3 8B Raw 58.44
InternVL3 8B SFT 62.71 +7.31%
InternVL3 8B GEODPO 67.41 +15.35%
LLaVA-Next 7B Raw 41.01
LLaVA-Next 7B SFT 51.10 +24.60%
LLaVA-Next 7B GEODPO 51.86 +26.46%

OOD and Downstream Reasoning

Dataset Model Raw GEODPO Δ
GEOPERCEIVE-OOD (Perception) Qwen2.5-VL 7B 58.14 60.28 +3.68%
GEOPERCEIVE-OOD (Perception) InternVL3 8B 58.74 60.91 +3.69%
MathVista Geometry (Reasoning) Qwen2.5-VL 7B +39.0% (Overall report)

Key Findings

  • SFT leads to a performance decrease in the Constraint category (InternVL3 constraint F1 dropped by 6.32%), while GEODPO consistently improves it by +9.9% to +19.27%, indicating GEODPO's robustness for "fragile" relations.
  • SFT yields almost zero gain on OOD sets (+0.46% or even −0.29%), whereas GEODPO maintains consistent positive gains, demonstrating the stronger generalization of RL preference alignment.
  • The translator's performance drops significantly for circles and constraints as geometric complexity increases (iterations 4–5), identifying it as the main performance bottleneck.

Highlights & Insights

  • Decoupling Perception and Reasoning: Through an independent perception benchmark, it distinguishes "VLM cannot see the image clearly" from "VLM cannot reason correctly," providing a new diagnosis tool.
  • Translator as a Reward Bridge: Utilizing an NL→DSL translator to "graft" structured formal scoring onto natural language output avoids distribution drift. This is a general "cross-modal reward injection" idea applicable to other formal language tasks like chemical structures or code.
  • Automated Data Pipeline: The generation and solver engines require zero human annotation and can generate geometry of arbitrary complexity, facilitating large-scale pre-training.
  • SFT Negative Transfer: The observation that SFT can degrade performance in specific categories (constraints) serves as a warning against the assumption that generic SFT is suitable for all tasks.

Limitations & Future Work

  • Translator F1 scores drop on high-complexity figures (many circles/constraints), directly limiting reward signal quality. Stronger translators or using the VLM itself as a translator could be explored.
  • GEODSL currently covers Euclidean configurations; support for non-standard geometry (projective, coordinate geometry with values) is missing.
  • The OOD dataset is small (100 samples annotated by 10 students); statistical confidence needs further verification.
  • The DPO paradigm is sensitive to \(N_\text{samples}\) and computational overhead (10 samples per image) may become a bottleneck in large-scale training.
  • vs SFT on DSL: Direct SFT to DSL output faces permutation explosion and distribution drift. GEODPO bypasses this by maintaining NL output and using external scoring.
  • vs AlphaGeometry / Inter-GPS: These DSLs have "one-image-multiple-programs" ambiguity; GEODPO solves the foundation of precise evaluation with normalized GEODSL.
  • vs MathVista / GeoQA (End-to-End): Existing benchmarks only consider the final answer; GEODPO reveals that perception is an independent bottleneck.
  • vs RLVR (Verifiable Reward): Following the trend of "answer verification as reward" in math reasoning, this work extends the paradigm to geometric perception using "DSL-level structural matching."

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of unambiguous DSL and translator reward bridge is novel, filling a gap in geometric perception evaluation.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison across three model series, in-domain/OOD/downstream reasoning, and detailed ablation.
  • Writing Quality: ⭐⭐⭐⭐ Clear problem definition, well-defined contributions, and rich visualizations.
  • Value: ⭐⭐⭐⭐ The perception-reasoning decoupling framework and translator-guided reward approach offer strong insights for multimodal training.