Chart Deep Research in LVLMs via Parallel Relative Policy Optimization¶
Conference: ICLR2026 arXiv: 2603.06677 Code: To be confirmed Area: Others Keywords: chart understanding, deep research, RLHF, policy optimization, benchmark
TL;DR¶
This paper proposes PRPO (Parallel Relative Policy Optimization), which addresses GRPO's training bottlenecks under multi-dimensional reward interference and heterogeneous data gradient conflicts through two-level parallel decoupled optimization — across reward dimensions and data types. It also introduces MCDR-Bench, which leverages an "error uniqueness principle" to transform subjective generation evaluation into objective error identification, enabling quantitative assessment of chart deep research capabilities.
Background & Motivation¶
Background: Chart understanding has evolved from simple data extraction to reasoning and analysis. Existing methods (ChartQA, PlotQA, etc.) primarily handle shallow tasks — visual recognition and factual QA — while capabilities for genuine "deep research" (trend analysis, causal reasoning, strategic recommendations) remain severely underdeveloped.
Limitations of Prior Work: (a) Training bottleneck — Chart deep research requires simultaneous mastery of background knowledge integration, fact extraction, relation construction, deep reasoning, and predictive planning, yet GRPO compresses multi-dimensional rewards into a single scalar, causing signal interference and mutual cancellation; gradient conflicts from heterogeneous data allow simple tasks to dominate training. (b) Evaluation bottleneck — Existing benchmarks only assess factual QA and cannot evaluate end-to-end analytical reasoning; subjective generation tasks incur high annotation costs and exhibit large answer diversity.
Key Challenge: Tension between coordinated multi-dimensional capability development and single-objective optimization — GRPO aggregates all dimensional rewards into one scalar, compressing variance and weakening the discriminative power of optimization signals, preventing balanced development across dimensions.
Goal: (a) How to achieve balanced training under multi-dimensional rewards and heterogeneous data? (b) How to objectively evaluate chart deep research capabilities?
Key Insight: Introducing the concept of "parallelism" into policy optimization — parallel optimization across reward dimensions and data capability partitions — to decouple the sources of conflict. On the evaluation side, controlled error injection transforms subjective generation into objective classification.
Core Idea: Building on GRPO, PRPO introduces two-level parallel decoupling (Reward-PRPO decomposing reward dimensions + Data-PRPO partitioning data types) to eliminate signal interference and gradient conflicts in multi-dimensional training.
Method¶
Overall Architecture¶
PRPO is a unified framework combining Reward-PRPO and Data-PRPO. Given a chart and question as input, the model generates a deep analysis. During training: (1) Data-PRPO partitions training samples by capability dimension (visual understanding, logical reasoning, data analysis, etc.) and computes advantages independently within each partition; (2) Reward-PRPO separately computes advantages for each reward dimension (background knowledge, factual accuracy, relation construction, reasoning depth, prediction quality) within each partition and performs weighted optimization. For evaluation, MCDR-Bench transforms generation evaluation into error identification through a 5-stage annotation pipeline with controlled error injection.
Key Designs¶
-
Reward-PRPO (Reward Dimension Parallelism):
- Function: Decomposes optimization along reward dimensions, computing advantages independently per dimension.
- Mechanism: For \(K\) reward dimensions, advantages are computed as \(\hat{A}_i^{(k)} = (R_i^{(k)} - \bar{R}^{(k)}) / \sigma^{(k)}\), then combined with weighted aggregation: \(J_{\text{Reward-PRPO}} = \sum_{k=1}^K \lambda_k \mathbb{E}[\cdots L_{\text{clip}}(r_{i,t}, \hat{A}_i^{(k)})]\)
- Design Motivation: GRPO compresses multi-dimensional rewards as \(R_i = \sum_k R_i^{(k)}\), causing advantages in some dimensions to be cancelled by disadvantages in others. Reward-PRPO preserves independent optimization signals per dimension, allowing the model to learn each dimension separately.
-
Data-PRPO (Data Type Parallelism):
- Function: Partitions data by capability dimension, normalizing advantages independently within each partition.
- Mechanism: A
capability_uidassigns samples to \(M\) capability partitions \(\{P(Q^{(m)})\}_{m=1}^M\), with partition-level statistics for normalization: \(\hat{A}_i^{(m)} = (R_i - \bar{R}^{(m)}) / \sigma^{(m)}\) - Outlier Handling: Iteratively detects samples where \(|\hat{A}_i^{(t)}| > \tau\) and demotes them to rollout-level individual optimization, preventing outliers from corrupting partition statistics.
- Design Motivation: Reward distributions vary drastically across capability dimensions (simple visual recognition vs. complex causal reasoning). Global normalization allows high-variance simple tasks to dominate gradients; partition-level normalization ensures each capability type competes within its own scale.
-
MCDR-Bench (Evaluation Framework):
- Function: A benchmark for quantitative evaluation of chart deep research capabilities.
- Construction Pipeline: Phase 1 — 5-stage multi-agent annotation (background acquisition → fact extraction → relation construction → deep research report → predictive planning) with human review; Phase 2 — controlled error injection based on the "error uniqueness principle," transforming subjective generation into objective error identification.
- Scale: 1,021 high-complexity charts → 3,084 high-difficulty samples covering 5 capability dimensions.
- Design Motivation: Subjective generation tasks are difficult to score objectively. By injecting a single known error into an otherwise correct report and asking the model to identify it, the task becomes objectively decidable and enables precise diagnosis of capability deficiencies per dimension.
Loss & Training¶
The unified PRPO objective: for partition \(m\) and reward dimension \(k\), the advantage is \(\hat{A}_i^{(k,m)} = (R_i^{(k)} - \bar{R}^{(k,m)}) / \sigma^{(k,m)}\), and the total objective is a two-level weighted sum: \(J_{\text{PRPO}} = \sum_m \lambda_m \sum_k \lambda_k \mathbb{E}[\cdots L_{\text{clip}}(r_{i,t}, \hat{A}_i^{(k,m)})]\). The base model is Qwen2.5-VL-7B-Instruct.
Key Experimental Results¶
Main Results (MCDR-Bench)¶
| Model | BG | FE | RL | DR | F/P | Overall |
|---|---|---|---|---|---|---|
| GPT-4o | 27.2 | 21.9 | 41.0 | 47.5 | 60.0 | 35.8 |
| Claude-3.7 Sonnet | 68.8 | 57.3 | 89.5 | 85.0 | 87.0 | 75.0 |
| Gemini-2.5-Pro | 81.2 | 87.3 | 91.4 | 93.8 | 93.0 | 89.3 |
| Qwen2.5-VL-7B (base) | 23.4 | 39.4 | 51.0 | 37.6 | 45.6 | 40.0 |
| + GRPO | 41.2 | 51.7 | 75.4 | 66.1 | 77.4 | 61.7 |
| + PRPO | 50.7 | 61.4 | 81.8 | 72.8 | 84.0 | 69.6 |
| + PRPO Think | 62.9 | 65.2 | 88.9 | 80.9 | 87.2 | 76.3 |
Ablation Study (ChartQAPRO Cross-Validation)¶
| Configuration | Factoid | MCQ | Conv. | FactChk | Hypo. | Overall |
|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B base | 27.5 | 37.9 | 55.2 | 46.7 | 44.4 | 36.3 |
| + ChartReasoner-GRPO | - | - | - | - | - | 40.0 |
| + PRPO | 36.2 | 50.5 | 49.6 | 53.3 | 53.7 | 43.0 |
Key Findings¶
- PRPO comprehensively outperforms GRPO: On MCDR-Bench, PRPO exceeds GRPO by +7.91% (direct) and +13.26% (Think), with consistent improvements across all 5 dimensions.
- Think mode amplifies gains: PRPO + Think further improves over direct PRPO by +6.64%, indicating that models trained with PRPO release greater potential under chain-of-thought reasoning.
- 7B model approaches proprietary large models: PRPO Think's 76.3% surpasses Claude-3.7 Sonnet (75.0%) and approaches Gemini-2.5-Pro (gap of only 13 points), despite being 10–100× smaller.
- Cross-benchmark generalization: PRPO also outperforms GRPO by +6.64% on ChartQAPRO, ruling out overfitting to MCDR-Bench.
- Largest gain in FE (Fact Extraction): From 39.4 → 61.4 (+22.0), indicating that PRPO's dimension-wise optimization most significantly benefits information extraction.
Highlights & Insights¶
- "Parallel decoupling" is a general strategy for multi-dimensional optimization conflicts: Reward-PRPO decouples along reward dimensions; Data-PRPO decouples along data types. This design philosophy is transferable to any multi-objective RL scenario (e.g., correctness vs. efficiency vs. safety in code generation).
- Error injection evaluation paradigm is elegant: Transforming subjective generation into objective classification reduces annotation costs while enabling fine-grained diagnosis. This evaluation approach generalizes to any long-form generation task (e.g., RAG accuracy, report quality assessment).
- Outlier demotion mechanism is practically useful: Data-PRPO is not a rigid partition — samples detected as unsuitable for their current partition are automatically demoted to individual optimization, balancing partition efficiency with per-instance fairness.
Limitations & Future Work¶
- Single base model: All experiments use Qwen2.5-VL-7B. Effectiveness on larger models (72B+) or different architectures remains unvalidated.
- Manually defined capability partitions: Data-PRPO's
capability_uidrequires predefined capability categories; automatic capability partition discovery is a promising direction for improvement. - Sensitivity of reward dimension weights \(\lambda_k\): The paper does not thoroughly discuss weight sensitivity. Adaptive adjustment of dimension weights (e.g., based on per-dimension convergence rates) may yield further gains.
- Chart domain only: Although PRPO's parallel optimization idea is general, experiments are limited to charts — validation in general multi-task VLM training is warranted.
Related Work & Insights¶
- vs. GRPO/DAPO: GRPO uses group-level normalization but a single reward scalar. DAPO addresses entropy collapse but does not handle multi-dimensional conflicts. PRPO's key addition is "two-level parallelism" — across dimensions and data types.
- vs. ChartReasoner: ChartReasoner applies SFT+GRPO for structured reasoning. PRPO does not modify the reasoning structure, only the optimization strategy — making it more lightweight and general.
- vs. PPO/DPO: PPO requires an additional value model; DPO avoids the reward model but does not naturally accommodate multi-dimensional rewards. PRPO natively supports multiple dimensions within the GRPO framework.
Rating¶
- Novelty: ⭐⭐⭐⭐ Dual innovation of parallel decoupled training + error injection evaluation, though Reward-PRPO is essentially standard multi-objective optimization decomposition.
- Experimental Thoroughness: ⭐⭐⭐⭐ Dual validation on MCDR-Bench + ChartQAPRO with comprehensive comparisons to proprietary and open-source models, but lacks experiments on additional base models.
- Writing Quality: ⭐⭐⭐⭐ Problem analysis is clear and mathematical derivations are rigorous, though Sections 3–4 are slightly redundant in structure.
- Value: ⭐⭐⭐⭐ PRPO's parallel optimization idea offers general reference value for multi-dimensional RLHF training; MCDR-Bench fills the evaluation gap for chart deep research.