COVR: Collaborative Optimization of VLMs and RL Agent for Visual-Based Control¶

Conference: AAAI 2026 arXiv: 2601.06122 Code: Unavailable (as of 2026-03) Area: Multimodal VLM / Agent / Reinforcement Learning Keywords: VLM-RL collaborative optimization, visual reinforcement learning, knowledge distillation, dynamic data filtering, autonomous driving

TL;DR¶

This paper proposes COVR, a bidirectional collaborative optimization framework for VLMs and RL agents: high-quality interaction data generated by RL is used to fine-tune the VLM, while the enhanced VLM in turn guides RL policy learning via action priors, achieving SOTA performance on CARLA and DMControl.

Background & Motivation¶

Visual RL suffers from poor sample efficiency under high-dimensional observation spaces. Existing VLM-assisted RL methods fall into two categories: (1) directly fine-tuning the VLM as a policy network (VPF), which incurs high computational cost and deployment difficulties; (2) freezing the VLM for knowledge distillation (DPL/APL/DGC), transferring VLM priors to lightweight policy networks. However, a critical limitation of the latter is that the VLM itself may lack sufficient domain knowledge for the target task, and a frozen VLM can propagate inaccurate reasoning, leading to negative guidance.

The core insight of this paper is that VLMs and RL agents possess highly complementary strengths — VLMs offer semantic reasoning and generalization, while RL agents can discover high-quality state-action pairs in specific scenarios. Accordingly, the paper argues for establishing a bidirectional enhancement loop rather than unidirectional knowledge transfer.

Core Problem¶

How can VLMs and RL agents mutually reinforce each other during training? Specifically: 1. RL training data is noisy and inconsistent — how can high-quality samples be selected for effective VLM fine-tuning? 2. Random RL exploration leads to vastly different high-reward actions under similar observations (e.g., "accelerate forward" vs. "decelerate and turn right" on a straight road) — how can such inconsistency be prevented from misleading VLM supervised fine-tuning?

Method¶

Overall Architecture¶

COVR is an iterative bidirectional optimization framework consisting of two alternating phases:

Phase 1: VLM-Guided RL The VLM receives the current visual observation and task prompt, infers action semantics, and converts them via string parsing into a continuous action \(a_{v,t}\). The RL policy network generates the raw action \(a_{r,t}\). During training, only \(a_{r,t}\) interacts with the environment, while \(a_{v,t}\) serves as an auxiliary supervision signal. At test time, the VLM is not required; only the lightweight policy network is used for inference, satisfying real-time requirements.

Phase 2: RL-Tuned VLM Trajectory data collected from RL interactions is filtered by EDDF and weighted by RALW, then used to fine-tune the VLM via LoRA, enhancing its semantic reasoning capability on the target task. The fine-tuned VLM subsequently provides more accurate action priors for the next round of RL training.

Key Designs¶

Exploration-Driven Dynamic Filter (EDDF): A dynamic data filtering module based on the degree of exploration. A dedicated buffer \(\mathcal{D}_f\) stores trajectory data \((o_i, a_{r,i}, g_i)\). The core procedure is: (a) apply Z-score normalization to return values; (b) dynamically adjust the filtering threshold \(\tau = \text{Median}(\mathcal{G}_z) + \text{Sigmoid}(\varepsilon_t) \cdot \text{IQR}(\mathcal{G}_z)\) based on policy entropy \(\varepsilon_t\). When policy entropy is high in early training, the threshold is relaxed to retain more potentially valuable low-return samples; as entropy decreases in later stages, the threshold tightens to prioritize high-return trajectories. This design is more flexible than fixed top-k filtering and adapts to different stages of RL training.
Return-Aware Adaptive Loss Weight (RALW): An adaptive loss weighting module conditioned on returns. Returns of filtered samples are normalized to \([-1, 1]\); samples with negative returns receive zero weight (excluded from learning), while positive-return samples receive higher weights. Formally: \(\mathcal{L}_{\text{RALW}} = \frac{1}{N_v}\sum_{b=1}^{B} w_b \sum_{t=1}^{T} -\log p(y_{b,t} | \mathbf{x}_{b,<t})\), where \(w_b = \max(\bar{g}_b, 0)\). This encourages the model to prioritize learning high-reward behaviors while preserving the VLM's original capabilities.
Adaptive Progressive Fine-Tuning: A progressive fine-tuning strategy where the fine-tuning interval \(\psi_c\) grows linearly with the iteration count: \(\psi_{c+1} = \psi_c + \psi_c \cdot c\). The VLM is updated frequently when the policy is unstable in early training, and less frequently as the policy converges. The buffer \(\mathcal{D}_f\) is cleared after each fine-tuning round.

Loss & Training¶

RL policy loss: Standard SAC loss + VLM regularization term \(\tilde{\mathcal{L}}_\pi = \mathcal{L}_\pi + \lambda \|a_{v,t} - a_{r,t}\|_2^2\), with \(\lambda=2.0\) (CARLA) / \(1.0\) (DMControl)
VLM fine-tuning loss: RALW-weighted autoregressive NLL loss + label smoothing
LoRA fine-tuning: rank=128, alpha=256 (CARLA) / 16 (DMControl), at most 8.26% of parameters are trainable
Cold-start strategy: In CARLA, VLM guidance is enabled only after a 2-round delay; in DMControl, after 4 rounds
VLM inference frequency: One inference per 10 frames during training to reduce latency
VLM backbone: Qwen2.5-VL-3B

Key Experimental Results¶

CARLA (#HW Highway Scenario)¶

Method Type	Method	Episode Reward	Driving Distance
Vanilla RL	SAC	69 ± 46	91 ± 56
Vanilla RL	ResAct	227 ± 36	236 ± 40
VLM-assisted	DGC	208 ± 13	234 ± 15
VLM-assisted	DPL	113 ± 63	124 ± 67
Only VLM	VBE	-11 ± 5	11 ± 4
Ours	COVR	248 ± 81	259 ± 85

CARLA (#GP Ghost Pedestrian Scenario)¶

Method Type	Method	Episode Reward	Driving Distance
Vanilla RL	ResAct	212 ± 54	216 ± 55
VLM-assisted	DGC	146 ± 14	169 ± 18
Ours	COVR	235 ± 89	237 ± 89

DMControl (6 Standard Tasks, 100K Steps)¶

Task	COVR	ResAct	SVEA	DrQ
Cartpole, Swingup	872 ± 2	819 ± 44	727 ± 86	759 ± 92
Reacher, Easy	969 ± 18	917 ± 59	811 ± 115	601 ± 213
Cheetah, Run	504 ± 13	503 ± 42	375 ± 54	344 ± 67
Walker, Walk	802 ± 25	772 ± 65	747 ± 65	612 ± 164
Finger, Spin	976 ± 9	974 ± 42	859 ± 77	901 ± 104
Ball in cup, Catch	960 ± 23	948 ± 44	915 ± 71	913 ± 53

DMControl Hard Tasks (500K Steps)¶

Task	COVR	ResAct	TACO
Hopper, Hop	188 ± 9	99 ± 49	112 ± 42
Walker, Run	485 ± 25	467 ± 27	355 ± 89
Pendulum, Swingup	792 ± 82	618 ± 380	485 ± 167

Cross-Baseline Compatibility (Applying COVR on Different Bases)¶

Baseline	Cartpole	Cheetah	Walker
SAC	237→740	118→156	95→194
DeepMDP	389→793	306→352	384→397
RAD	694→872	364→504	552→802

Ablation Study¶

On the #HW scenario (full COVR: ER=248, DD=259):

Removing EDDF (random filtering): ER drops to 144 (−104), demonstrating the critical importance of dynamic filtering
Fixed top-80%/90%/95% replacing EDDF: ER of 204/217/192, respectively, all inferior to the dynamic approach
Removing Z-score normalization: ER drops to 210
Using immediate reward instead of cumulative return: ER drops to 221
Using Q-value instead of return: ER drops to 200 (Q-values are unstable in early training)
Removing RALW: ER drops to 204 (−44)
Random weights instead of return-based weights: ER drops to 184
Training RL directly with high-return samples (no VLM): ER=183/175, confirming that the VLM's generalization guidance is irreplaceable

Iterative VLM Performance Improvement¶

VLM inference performance improves progressively across fine-tuning iterations: from Iteration 0 to 5, ER increases from −13 to 97; the RL policy performance correspondingly improves from 19 (Iter 1) to 248 (final).

Comparison Across Different VLMs¶

VLM	ER	DD
Qwen2-VL-2B	236 ± 69	246 ± 72
Qwen2.5-VL-3B	248 ± 81	259 ± 85
LLaVA-1.5-7B	228 ± 115	244 ± 116

Notably, the 7B LLaVA underperforms the 3B Qwen2.5-VL, suggesting that the VLM's fundamental visual understanding capability matters more than parameter count.

Highlights & Insights¶

The bidirectional optimization loop is the core contribution: it breaks the existing "frozen VLM → unidirectional distillation" paradigm, enabling VLMs and RL agents to mutually reinforce each other in a virtuous cycle. This idea is both elegant and effective.
No VLM required at test time: the framework leverages rich VLM knowledge during training while deploying only a lightweight policy network at inference, balancing performance and efficiency (policy network: 10M memory, 0.0012s inference vs. VLM: 8344M, 4.4s).
Exploration-aware design in EDDF: using policy entropy to dynamically adjust the filtering threshold is clever — in early training with high entropy, more samples are retained to avoid discarding potentially valuable low-return data; in later stages, stricter filtering is applied.
Progressive fine-tuning reduces computational overhead: the linearly growing interval is simple yet effective, avoiding unnecessary frequent updates in later training stages.
Strong cross-baseline compatibility: COVR can be applied as a plug-and-play enhancement to diverse baselines such as SAC, DeepMDP, and RAD.

Limitations & Future Work¶

High variance: COVR's standard deviation on CARLA (±81/±89) is notably larger than DGC (±13/±14) and ResAct (±36/±40); despite achieving the highest mean, its stability is insufficient, with considerable variation across seeds.
Limited VLM scale: due to computational constraints, only VLMs up to 3B parameters are used. The authors acknowledge that larger VLMs may provide richer priors.
Action-level guidance only: the framework extracts only action-level guidance from the VLM without exploiting its internal chain of reasoning, potentially underutilizing the VLM's intermediate reasoning capabilities.
No temporal modeling: VLM inference is based on single frames without modeling consecutive observation sequences, which may produce inconsistent reasoning in dynamic scenarios.
Dependence on initial exploration quality: despite cold-start strategies, early RL data inevitably contains noise, and LoRA-fine-tuned VLMs remain imperfect.
Limited evaluation environments: validation is primarily conducted on CARLA and DMControl, lacking evaluation on real physical tasks such as robotic manipulation.
Code not released: reproducibility remains uncertain.

Dimension	COVR	DGC (CVPR25)	VPF (Direct VLM Fine-tuning)
VLM Role	Iterative fine-tuning + RL guidance	Frozen VLM distillation	VLM serves as the policy network
VLM Updated?	✅ LoRA fine-tuned with RL data	❌ Frozen	✅ Fine-tuned with RL loss
Test-time Inference Cost	Low (policy network only)	Low (policy network only)	High (VLM required online)
Data Utilization	Bidirectional: RL→VLM→RL	Unidirectional: VLM→RL	Internal VLM training
#HW ER	248	208	91

The key distinction from DGC is that COVR dynamically updates the VLM, progressively enriching its domain knowledge as RL training proceeds. The key distinction from VPF is that COVR ultimately deploys a lightweight policy network rather than the VLM itself.

The bidirectional collaborative optimization paradigm generalizes to other VLM + downstream task settings — e.g., VLM-guided robotic grasping, where success/failure data is fed back to fine-tune the VLM. The policy-entropy-based threshold control in EDDF is broadly applicable to any setting requiring sample selection from noisy data. The return-weighted loss in RALW resembles advantage weighting in RL but represents an interesting cross-domain application to VLM SFT.

Rating¶

Novelty: ⭐⭐⭐⭐ — The bidirectional collaborative optimization paradigm is a meaningful contribution, though individual components (LoRA fine-tuning, return-weighted loss, dynamic thresholding) are not particularly novel on their own
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Experiments are comprehensive, covering CARLA, DMControl, and CarRacing, with ablations, cross-baseline, cross-scenario generalization, VLM comparison, and parameter analysis
Writing Quality: ⭐⭐⭐⭐ — Structure is clear and formulations are complete, though some notation and prose are slightly verbose
Value: ⭐⭐⭐⭐ — Provides a concise and effective paradigm for VLM-RL integration, with high variance and unavailable code as drawbacks