Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning¶

Conference: ICLR 2026 arXiv: 2602.20197 Code: https://github.com/zhh6425/CalibRL Area: Multimodal VLM / Reinforcement Learning Keywords: RLVR, hybrid-policy optimization, multimodal reasoning, entropy collapse, controllable exploration

TL;DR¶

CalibRL reframes expert data as a distribution calibration baseline (rather than a strict imitation target), and achieves fine-grained control over the exploration–exploitation trade-off in MLLM reasoning training via asymmetric LeakyReLU activation combined with advantage weighting. This addresses entropy collapse in RLVR and substantially outperforms GRPO/DAPO on tasks such as geometric reasoning.

Background & Motivation¶

Background: RLVR has become the dominant paradigm for enhancing MLLM reasoning capabilities (e.g., DeepSeek-R1), yet performance gains are often accompanied by a significant drop in policy entropy—entropy exhaustion has become a bottleneck for further improvement.

Limitations of Prior Work: - Conventional entropy regularization encourages randomness but provides no directional guidance → low exploration efficiency - In the SFT-then-RL paradigm, SFT anchors the policy to a static demonstration distribution → undermines subsequent RL exploration - Directly injecting SFT supervision into hybrid-policy frameworks → distributional mismatch between the current policy and expert trajectories → high bias-variance → accelerated entropy collapse

Key Challenge: Expert data is a double-edged sword—it provides useful guidance but also compresses the policy distribution. Maximizing \(\pi_\theta(\tau^{expert})\) inevitably reduces the probability of other trajectories → overall entropy decreases.

Key Insight: Expert data should not be "imitated" as an absolute target, but rather used as a reference baseline for relative calibration—under-represented correct reasoning paths are reinforced, while overconfident incorrect predictions are suppressed.

Core Idea: Transform expert supervision from a rigid imitation signal into a fine-grained calibration mechanism, achieving directed and regulated exploration via a log-probability gap combined with asymmetric LeakyReLU gating.

Method¶

Overall Architecture¶

An controllable exploration loss term \(\mathcal{L}_{exploration}\) is introduced on top of GRPO. The key innovation lies in using the log-probability gap \(\Delta\ell_i = \log\pi_\theta(\tau_i^{policy}) - \log\pi_\theta(\tau_i^{expert})\) to measure the model's relative preference for its own responses versus expert responses, with asymmetric LeakyReLU activation controlling the magnitude of reinforcement and suppression.

Key Designs¶

Log-Probability Gap:
- Function: Measures the model's relative confidence in its own responses versus expert responses.
- Mechanism: \(\Delta\ell_i = \log\frac{\pi_\theta(\tau_i^{policy})}{\pi_\theta(\tau_i^{expert})}\). A positive value indicates the model favors its own answer; a negative value indicates lower confidence relative to the expert. This signal determines whether reinforcement or suppression is applied.
Asymmetric LeakyReLU Activation:
- Function: Asymmetrically controls the gradient for reinforcement and suppression.
- Mechanism: \(\mathcal{L}_{exploration} = |\hat{A}_i| \cdot \text{LeakyReLU}(-s_i \cdot \Delta\ell_i, \alpha)\), where \(s_i = +1\) (correct) or \(-1\) (incorrect). When the input is negative, the gradient is scaled by \(\alpha < 1\) → once the response probability crosses the expert baseline, the magnitude of further reinforcement or suppression is attenuated.
- Design Motivation: Pure ReLU completely cuts off gradients → useful signals are discarded; linear activation → cannot distinguish regions that do or do not require reinforcement; LeakyReLU strikes a balance between the two.
Advantage Weighting:
- Function: Scales updates by group-wise rarity.
- Mechanism: \(|\hat{A}_i|\) (absolute advantage value) serves as the weight. A rare correct response when most responses are incorrect receives a large weight → reinforced as an exploration signal. By modulating the update magnitude, rare but informative deviations are emphasized.

Loss & Training¶

Total objective = GRPO clipped surrogate + \(\lambda \cdot \mathcal{L}_{exploration}\)
Key hyperparameters: \(\alpha=0.5\) (LeakyReLU slope), \(\lambda=0.1\) (exploration weight)
Expert baseline > reference policy baseline (confirmed by ablation)

Key Experimental Results¶

Main Results (Geometric Reasoning, In-Domain Benchmarks)¶

Method	GeoEval↑	Geo3K↑	GeoQA↑	Avg.↑
GRPO	26.15	39.77	52.52	39.48
SFT+GRPO	6.00	18.64	40.98	21.87
DAPO	25.19	40.93	52.52	39.55
CalibRL	33.44	40.60	60.74	44.93

Ablation Study¶

LeakyReLU \(\alpha\)	Effect
0.3	Aggressive early exploration but unstable, entropy oscillation
0.5	Balanced entropy growth, no oscillation
0.8	Over-constrained, rapid entropy decay

Key Findings¶

SFT+GRPO performs worst: Directly combining SFT and RL leads to severe entropy collapse—supporting the necessity of the "imitation → calibration" paradigm shift.
CalibRL achieves best performance both in-domain and out-of-domain: Outperforms GRPO by an average of +5.45% (in-domain), with better generalization as well.
Stable entropy curve: Policy entropy grows steadily throughout CalibRL training, whereas other methods exhibit continuous decline.
\(\alpha=0.5\) is the sweet spot: Too small → instability; too large → exploration is constrained.

Highlights & Insights¶

"Calibration over imitation" represents a profound reconceptualization of hybrid-policy RL—expert data should not be treated as a mandatory target to reach, but rather as a reference coordinate for measuring the current policy's deviation.
LeakyReLU provides an elegant gradient gating mechanism—a simple choice of activation function realizes adaptive control that reinforces when necessary and attenuates when sufficient.
The finding that "SFT+GRPO performs worst" carries important practical implications—naively stacking SFT and RL may cause them to cancel each other out.

Limitations & Future Work¶

Validation is limited to geometric reasoning—scenarios such as mathematical and code reasoning remain to be explored.
The LeakyReLU slope \(\alpha\) requires tuning—a more adaptive activation function design may be preferable.
The quality of expert data affects the reliability of the calibration baseline.
Integration with other RL variants (Dr.GRPO, CPPO) has not been explored.

vs. GRPO/DAPO: Standard policy optimization does not address entropy collapse; CalibRL maintains exploration through the calibration mechanism.
vs. LUFFY: Also a hybrid-policy framework but still relies on imitation-style supervision → distributional mismatch persists.
vs. SFT+GRPO: Direct sequential combination leads to catastrophic interference; CalibRL's calibration paradigm avoids this issue.

Rating¶

Novelty: ⭐⭐⭐⭐ The "calibration over imitation" paradigm is insightful, and the application of LeakyReLU is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Ablations are thorough, but the task scope is narrow (primarily geometric reasoning).
Writing Quality: ⭐⭐⭐⭐ Problem analysis is in-depth, and theoretical motivation is clearly articulated.
Value: ⭐⭐⭐⭐ A practical solution to entropy collapse in RLVR, with broader implications for hybrid-policy training.