COVR: Collaborative Optimization of VLMs and RL Agent for Visual-Based Control¶
Conference: AAAI 2026 arXiv: 2601.06122 Code: Unavailable (as of 2026-03) Area: Multimodal VLM / Agent / Reinforcement Learning Keywords: VLM-RL collaborative optimization, visual reinforcement learning, knowledge distillation, dynamic data filtering, autonomous driving
TL;DR¶
This paper proposes COVR, a bidirectional collaborative optimization framework for VLMs and RL agents: high-quality interaction data generated by RL is used to fine-tune the VLM, while the enhanced VLM in turn guides RL policy learning via action priors, achieving SOTA performance on CARLA and DMControl.
Background & Motivation¶
Visual RL suffers from poor sample efficiency under high-dimensional observation spaces. Existing VLM-assisted RL methods fall into two categories: (1) directly fine-tuning the VLM as a policy network (VPF), which incurs high computational cost and deployment difficulties; (2) freezing the VLM for knowledge distillation (DPL/APL/DGC), transferring VLM priors to lightweight policy networks. However, a critical limitation of the latter is that the VLM itself may lack sufficient domain knowledge for the target task, and a frozen VLM can propagate inaccurate reasoning, leading to negative guidance.
The core insight of this paper is that VLMs and RL agents possess highly complementary strengths — VLMs offer semantic reasoning and generalization, while RL agents can discover high-quality state-action pairs in specific scenarios. Accordingly, the paper argues for establishing a bidirectional enhancement loop rather than unidirectional knowledge transfer.
Core Problem¶
How can VLMs and RL agents mutually reinforce each other during training? Specifically: 1. RL training data is noisy and inconsistent — how can high-quality samples be selected for effective VLM fine-tuning? 2. Random RL exploration leads to vastly different high-reward actions under similar observations (e.g., "accelerate forward" vs. "decelerate and turn right" on a straight road) — how can such inconsistency be prevented from misleading VLM supervised fine-tuning?
Method¶
Overall Architecture¶
COVR is an iterative bidirectional optimization framework consisting of two alternating phases:
Phase 1: VLM-Guided RL The VLM receives the current visual observation and task prompt, infers action semantics, and converts them via string parsing into a continuous action \(a_{v,t}\). The RL policy network generates the raw action \(a_{r,t}\). During training, only \(a_{r,t}\) interacts with the environment, while \(a_{v,t}\) serves as an auxiliary supervision signal. At test time, the VLM is not required; only the lightweight policy network is used for inference, satisfying real-time requirements.
Phase 2: RL-Tuned VLM Trajectory data collected from RL interactions is filtered by EDDF and weighted by RALW, then used to fine-tune the VLM via LoRA, enhancing its semantic reasoning capability on the target task. The fine-tuned VLM subsequently provides more accurate action priors for the next round of RL training.
Key Designs¶
-
Exploration-Driven Dynamic Filter (EDDF): A dynamic data filtering module based on the degree of exploration. A dedicated buffer \(\mathcal{D}_f\) stores trajectory data \((o_i, a_{r,i}, g_i)\). The core procedure is: (a) apply Z-score normalization to return values; (b) dynamically adjust the filtering threshold \(\tau = \text{Median}(\mathcal{G}_z) + \text{Sigmoid}(\varepsilon_t) \cdot \text{IQR}(\mathcal{G}_z)\) based on policy entropy \(\varepsilon_t\). When policy entropy is high in early training, the threshold is relaxed to retain more potentially valuable low-return samples; as entropy decreases in later stages, the threshold tightens to prioritize high-return trajectories. This design is more flexible than fixed top-k filtering and adapts to different stages of RL training.
-
Return-Aware Adaptive Loss Weight (RALW): An adaptive loss weighting module conditioned on returns. Returns of filtered samples are normalized to \([-1, 1]\); samples with negative returns receive zero weight (excluded from learning), while positive-return samples receive higher weights. Formally: \(\mathcal{L}_{\text{RALW}} = \frac{1}{N_v}\sum_{b=1}^{B} w_b \sum_{t=1}^{T} -\log p(y_{b,t} | \mathbf{x}_{b,<t})\), where \(w_b = \max(\bar{g}_b, 0)\). This encourages the model to prioritize learning high-reward behaviors while preserving the VLM's original capabilities.
-
Adaptive Progressive Fine-Tuning: A progressive fine-tuning strategy where the fine-tuning interval \(\psi_c\) grows linearly with the iteration count: \(\psi_{c+1} = \psi_c + \psi_c \cdot c\). The VLM is updated frequently when the policy is unstable in early training, and less frequently as the policy converges. The buffer \(\mathcal{D}_f\) is cleared after each fine-tuning round.
Loss & Training¶
- RL policy loss: Standard SAC loss + VLM regularization term \(\tilde{\mathcal{L}}_\pi = \mathcal{L}_\pi + \lambda \|a_{v,t} - a_{r,t}\|_2^2\), with \(\lambda=2.0\) (CARLA) / \(1.0\) (DMControl)
- VLM fine-tuning loss: RALW-weighted autoregressive NLL loss + label smoothing
- LoRA fine-tuning: rank=128, alpha=256 (CARLA) / 16 (DMControl), at most 8.26% of parameters are trainable
- Cold-start strategy: In CARLA, VLM guidance is enabled only after a 2-round delay; in DMControl, after 4 rounds
- VLM inference frequency: One inference per 10 frames during training to reduce latency
- VLM backbone: Qwen2.5-VL-3B
Key Experimental Results¶
CARLA (#HW Highway Scenario)¶
| Method Type | Method | Episode Reward | Driving Distance |
|---|---|---|---|
| Vanilla RL | SAC | 69 ± 46 | 91 ± 56 |
| Vanilla RL | ResAct | 227 ± 36 | 236 ± 40 |
| VLM-assisted | DGC | 208 ± 13 | 234 ± 15 |
| VLM-assisted | DPL | 113 ± 63 | 124 ± 67 |
| Only VLM | VBE | -11 ± 5 | 11 ± 4 |
| Ours | COVR | 248 ± 81 | 259 ± 85 |
CARLA (#GP Ghost Pedestrian Scenario)¶
| Method Type | Method | Episode Reward | Driving Distance |
|---|---|---|---|
| Vanilla RL | ResAct | 212 ± 54 | 216 ± 55 |
| VLM-assisted | DGC | 146 ± 14 | 169 ± 18 |
| Ours | COVR | 235 ± 89 | 237 ± 89 |
DMControl (6 Standard Tasks, 100K Steps)¶
| Task | COVR | ResAct | SVEA | DrQ |
|---|---|---|---|---|
| Cartpole, Swingup | 872 ± 2 | 819 ± 44 | 727 ± 86 | 759 ± 92 |
| Reacher, Easy | 969 ± 18 | 917 ± 59 | 811 ± 115 | 601 ± 213 |
| Cheetah, Run | 504 ± 13 | 503 ± 42 | 375 ± 54 | 344 ± 67 |
| Walker, Walk | 802 ± 25 | 772 ± 65 | 747 ± 65 | 612 ± 164 |
| Finger, Spin | 976 ± 9 | 974 ± 42 | 859 ± 77 | 901 ± 104 |
| Ball in cup, Catch | 960 ± 23 | 948 ± 44 | 915 ± 71 | 913 ± 53 |
DMControl Hard Tasks (500K Steps)¶
| Task | COVR | ResAct | TACO |
|---|---|---|---|
| Hopper, Hop | 188 ± 9 | 99 ± 49 | 112 ± 42 |
| Walker, Run | 485 ± 25 | 467 ± 27 | 355 ± 89 |
| Pendulum, Swingup | 792 ± 82 | 618 ± 380 | 485 ± 167 |
Cross-Baseline Compatibility (Applying COVR on Different Bases)¶
| Baseline | Cartpole | Cheetah | Walker |
|---|---|---|---|
| SAC | 237→740 | 118→156 | 95→194 |
| DeepMDP | 389→793 | 306→352 | 384→397 |
| RAD | 694→872 | 364→504 | 552→802 |
Ablation Study¶
On the #HW scenario (full COVR: ER=248, DD=259):
- Removing EDDF (random filtering): ER drops to 144 (−104), demonstrating the critical importance of dynamic filtering
- Fixed top-80%/90%/95% replacing EDDF: ER of 204/217/192, respectively, all inferior to the dynamic approach
- Removing Z-score normalization: ER drops to 210
- Using immediate reward instead of cumulative return: ER drops to 221
- Using Q-value instead of return: ER drops to 200 (Q-values are unstable in early training)
- Removing RALW: ER drops to 204 (−44)
- Random weights instead of return-based weights: ER drops to 184
- Training RL directly with high-return samples (no VLM): ER=183/175, confirming that the VLM's generalization guidance is irreplaceable
Iterative VLM Performance Improvement¶
VLM inference performance improves progressively across fine-tuning iterations: from Iteration 0 to 5, ER increases from −13 to 97; the RL policy performance correspondingly improves from 19 (Iter 1) to 248 (final).
Comparison Across Different VLMs¶
| VLM | ER | DD |
|---|---|---|
| Qwen2-VL-2B | 236 ± 69 | 246 ± 72 |
| Qwen2.5-VL-3B | 248 ± 81 | 259 ± 85 |
| LLaVA-1.5-7B | 228 ± 115 | 244 ± 116 |
Notably, the 7B LLaVA underperforms the 3B Qwen2.5-VL, suggesting that the VLM's fundamental visual understanding capability matters more than parameter count.
Highlights & Insights¶
- The bidirectional optimization loop is the core contribution: it breaks the existing "frozen VLM → unidirectional distillation" paradigm, enabling VLMs and RL agents to mutually reinforce each other in a virtuous cycle. This idea is both elegant and effective.
- No VLM required at test time: the framework leverages rich VLM knowledge during training while deploying only a lightweight policy network at inference, balancing performance and efficiency (policy network: 10M memory, 0.0012s inference vs. VLM: 8344M, 4.4s).
- Exploration-aware design in EDDF: using policy entropy to dynamically adjust the filtering threshold is clever — in early training with high entropy, more samples are retained to avoid discarding potentially valuable low-return data; in later stages, stricter filtering is applied.
- Progressive fine-tuning reduces computational overhead: the linearly growing interval is simple yet effective, avoiding unnecessary frequent updates in later training stages.
- Strong cross-baseline compatibility: COVR can be applied as a plug-and-play enhancement to diverse baselines such as SAC, DeepMDP, and RAD.
Limitations & Future Work¶
- High variance: COVR's standard deviation on CARLA (±81/±89) is notably larger than DGC (±13/±14) and ResAct (±36/±40); despite achieving the highest mean, its stability is insufficient, with considerable variation across seeds.
- Limited VLM scale: due to computational constraints, only VLMs up to 3B parameters are used. The authors acknowledge that larger VLMs may provide richer priors.
- Action-level guidance only: the framework extracts only action-level guidance from the VLM without exploiting its internal chain of reasoning, potentially underutilizing the VLM's intermediate reasoning capabilities.
- No temporal modeling: VLM inference is based on single frames without modeling consecutive observation sequences, which may produce inconsistent reasoning in dynamic scenarios.
- Dependence on initial exploration quality: despite cold-start strategies, early RL data inevitably contains noise, and LoRA-fine-tuned VLMs remain imperfect.
- Limited evaluation environments: validation is primarily conducted on CARLA and DMControl, lacking evaluation on real physical tasks such as robotic manipulation.
- Code not released: reproducibility remains uncertain.
Related Work & Insights¶
| Dimension | COVR | DGC (CVPR25) | VPF (Direct VLM Fine-tuning) |
|---|---|---|---|
| VLM Role | Iterative fine-tuning + RL guidance | Frozen VLM distillation | VLM serves as the policy network |
| VLM Updated? | ✅ LoRA fine-tuned with RL data | ❌ Frozen | ✅ Fine-tuned with RL loss |
| Test-time Inference Cost | Low (policy network only) | Low (policy network only) | High (VLM required online) |
| Data Utilization | Bidirectional: RL→VLM→RL | Unidirectional: VLM→RL | Internal VLM training |
| #HW ER | 248 | 208 | 91 |
The key distinction from DGC is that COVR dynamically updates the VLM, progressively enriching its domain knowledge as RL training proceeds. The key distinction from VPF is that COVR ultimately deploys a lightweight policy network rather than the VLM itself.
The bidirectional collaborative optimization paradigm generalizes to other VLM + downstream task settings — e.g., VLM-guided robotic grasping, where success/failure data is fed back to fine-tune the VLM. The policy-entropy-based threshold control in EDDF is broadly applicable to any setting requiring sample selection from noisy data. The return-weighted loss in RALW resembles advantage weighting in RL but represents an interesting cross-domain application to VLM SFT.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The bidirectional collaborative optimization paradigm is a meaningful contribution, though individual components (LoRA fine-tuning, return-weighted loss, dynamic thresholding) are not particularly novel on their own
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Experiments are comprehensive, covering CARLA, DMControl, and CarRacing, with ablations, cross-baseline, cross-scenario generalization, VLM comparison, and parameter analysis
- Writing Quality: ⭐⭐⭐⭐ — Structure is clear and formulations are complete, though some notation and prose are slightly verbose
- Value: ⭐⭐⭐⭐ — Provides a concise and effective paradigm for VLM-RL integration, with high variance and unavailable code as drawbacks