CVPR2026 Video Understanding Visual-Inertial Odometry Reinforcement Learning Adaptive Fusion Computation Scheduling IMU Bias Estimation PPO

Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual-Inertial Odometry¶

Conference: CVPR2026 arXiv: 2511.21083 Code: To be confirmed Area: Video Understanding / Visual Odometry Keywords: Visual-Inertial Odometry, Reinforcement Learning, Adaptive Fusion, Computation Scheduling, IMU Bias Estimation, PPO

TL;DR¶

This paper proposes a dual-agent reinforcement learning framework comprising a Select Agent (which decides whether to activate the visual front-end based on IMU signals) and a Fusion Agent (which adaptively fuses visual-inertial states). Without completely removing VIBA, the framework substantially reduces its invocation frequency and computational overhead, achieving a superior accuracy–efficiency–memory trade-off.

Background & Motivation¶

Root Cause of VIO: Filter-based methods (MSCKF, ROVIO) are efficient but suffer from significant drift; optimization-based methods (VINS-Mono, ORB-SLAM3) achieve high accuracy but incur heavy VIBA computation, making deployment on resource-constrained edge devices difficult.

Limitations of Prior Work — End-to-End Deep Learning: Methods such as VINet and SelfVIO directly regress poses but remain inferior to mature optimization frameworks in accuracy and generalizability.

Hybrid Methods Do Not Resolve the Core Bottleneck: Works such as iSLAM and DPVO introduce learning modules to enhance feature matching or optimization, yet are still bottlenecked by VIBA computation.

Inefficient Keyframe Selection: Conventional strategies must process images before determining frame utility, precluding skip decisions prior to visual computation.

Inflexible Fixed Fusion Weights: Static EKF or fixed-gain fusion cannot dynamically adjust the trust in visual versus inertial measurements based on motion intensity and sensor reliability.

Shallow Application of RL in Odometry: Prior work (Messikommer et al.) applies RL only to optimize internal heuristics within VO, without considering high-level scheduling of whether to activate the entire VO pipeline.

Method¶

Overall Architecture¶

The system consists of four decoupled modules: - IMU Preprocess: IMU bias encoder + pre-integration, outputting inter-frame inertial states \((\Delta\mathbf{p}, \Delta\mathbf{q}, \Delta\mathbf{v}, \Delta t)\) - Select Agent: An RL policy that decides whether to activate the VO module based solely on IMU states - Visual Odometry: Patch-based recurrent optimization built on DPVO, activated only when the Select Agent outputs 1 - Fusion Agent: A composite module consisting of MLP1 (supervised velocity estimation) and MLP2 (RL policy for adaptive full-state fusion)

Key Designs¶

1. IMU Bias Estimator - Two lightweight encoder networks are trained: \(f_{bias}^g\) (gyroscope) and \(f_{bias}^a\) (accelerometer), taking raw sensor sequences and noise parameters as input and outputting three-axis bias estimates. - Fixed bias is estimated rather than stochastic noise, as the bias model is better suited to capturing slowly varying dominant patterns.

2. Select Agent (RL Scheduling) - State space: A compact IMU-only state \(s_t^{sel} = \{\Delta\mathbf{p}_t, \Delta\mathbf{q}_t, \Delta\mathbf{v}_t, \Delta t_t^{vo}\}\), requiring no visual features. - Action space: Binary decision \(a_t^{sel} \in \{0, 1\}\), where 0 = skip VO and 1 = run VO. - Reward function: Terminal ATE reward + dense per-step penalty + VO invocation cost: \(R_{episode} = \frac{A}{ATE + \epsilon} - B \cdot N_f\) - Trained with PPO using an MLP-parameterized policy.

3. Fusion Agent (Adaptive Fusion) - MLP1 (supervised): Estimates metric velocity from scaled VO pose and IMU pre-integration. - MLP2 (RL policy): Outputs a 7-dimensional per-axis fusion weight \(\mathbf{w} \in [0,1]^7\), performing convex combination over position and velocity, and SLERP interpolation over orientation. - When VO is skipped, \(\mathbf{w}\) defaults to zero, yielding pure IMU propagation. - Reward: \(r_k = -\|\mathbf{p}_k - \mathbf{p}_{gt}\|_2^2 - \lambda \text{Tr}(\mathbf{\Sigma}_k)\), jointly driving low error and well-calibrated uncertainty.

4. Scale Initialization - The world coordinate frame is aligned with the initial IMU body frame. An overdetermined linear system is constructed using IMU pre-integration and VO relative translations within a sliding window, and the global metric scale \(s\) is solved via least squares.

Loss & Training¶

Two-stage training: supervised pre-training of MLP1, followed by supervised initialization then PPO fine-tuning of MLP2.
PPO is trained in a Gym-style replay environment using real IMU–VO pairs, naturally incorporating sensor drift and noise.

Key Experimental Results¶

Main Results¶

EuRoC MAV Dataset (Table 1): Comparison with classical CPU-based VIO methods

Method	MH2	MH3	MH4	MH5	V11	V12	V13	V21	V22	V23	Avg
MSCKF	0.45	0.23	0.37	0.48	0.34	0.20	0.67	0.10	0.16	1.13	0.413
VINS-MONO	0.15	0.22	0.32	0.30	0.079	0.11	0.18	0.080	0.16	0.27	0.187
DM-VIO	0.044	0.097	0.102	0.096	0.048	0.045	0.069	0.029	0.050	0.114	0.069
ORB-SLAM3	0.037	0.046	0.075	0.057	0.049	0.015	0.037	0.042	0.021	0.027	0.041
Ours	0.064	0.119	0.112	0.112	0.047	0.125	0.073	0.055	0.036	0.179	0.092

GPU-Based Method Comparison (Table 4): Joint evaluation of accuracy, efficiency, and memory

Method	Avg ATE	FPS	VRAM (GB)
DPVO	0.106	22	4.92
iSLAM	0.529	31	6.47
DROID-VO	0.188	14	8.63
Ours	0.092	39	4.37

Ablation Study¶

IMU Bias Estimator: Jointly estimating gyroscope and accelerometer biases (Omega+Accel) yields the best performance; the fixed bias model outperforms the stochastic noise model.
Select Agent: Under aggressive frame-skipping (75%–87.5%), the IMU-only prior scheduling exhibits a more gradual degradation curve compared to heuristic and RL-gating (KF) baselines; at a 50% skip target, the IMU-only policy achieves significantly higher FPS with less than 3% increase in ATE.
Fusion Agent: RL fusion (ATE 0.112 m) outperforms EKF fusion (0.127 m) and fixed-weight fusion (0.143 m); the fusion policy remains effective when switching to a DROID-VO front-end (0.399→0.237 m), demonstrating cross-backend generalization.
Robustness: Under 5%/10% image blur degradation, the proposed method consistently achieves lower ATE (0.138/0.153 m) than DPVO (0.174/0.192 m).

Key Findings¶

The proposed method achieves the best average ATE (0.092 m) among GPU-based methods, while reaching 39 FPS (1.77× that of DPVO) and consuming only 4.37 GB VRAM (49.4% less than DROID-SLAM).
CPU-side BA/VIBA time is reduced from 121 ms (ORB-SLAM3) to 12.77 ms, structurally lowering optimization overhead.
Accuracy remains acceptable relative to classical optimization-based VIO (Avg 0.092 vs. DM-VIO 0.069), with a substantial computational efficiency advantage.
On the TUM-VI dataset, the average ATE is 0.80 m, marginally higher than DM-VIO's 0.77 m, while remaining competitive.

Highlights & Insights¶

IMU-only prior scheduling is the core innovation: skip decisions are made before visual computation, saving substantially more computation than optimizing only internal VO heuristics.
The dual-agent design decouples scheduling and fusion, with each agent optimizing a distinct objective, yielding a clean architecture.
The Fusion Agent generalizes across VO backends (DPVO→DROID-VO), demonstrating that the learned policy depends on physical quantities rather than architecture-specific features.
Training PPO via Gym-style replay on real data avoids the sim-to-real gap.

Limitations & Future Work¶

Accuracy remains noticeably below top optimization-based methods such as ORB-SLAM3 (Avg 0.092 vs. 0.041).
Scale initialization relies on a sliding window with sufficient motion; stationary start-up scenarios may fail.
Evaluation is limited to two indoor datasets (EuRoC and TUM-VI); outdoor and driving scenarios are not assessed.
The robustness of the Select Agent's frame-skipping strategy under extreme conditions (e.g., rapid consecutive rotations) is not thoroughly discussed.
Training requires ground-truth sequences, and zero-shot generalization to new environments is not validated.

Classical VIO: MSCKF, ROVIO (filter-based); VINS-Mono, ORB-SLAM3, DM-VIO (optimization-based).
Deep Learning VO/VIO: VINet, SelfVIO (end-to-end); DPVO, DROID-SLAM (hybrid); iSLAM (learning-enhanced classical pipeline).
Adaptive Visual Selection: VS-VIO and similar methods dynamically reweight features but still run the visual encoder on every frame.
RL in VO: Messikommer et al. apply RL to replace internal keyframe selection heuristics within VO; this paper elevates RL to high-level pipeline scheduling.
Most Relevant: This work is the first to model "whether to run the entire VO pipeline" as a prior RL decision.

Rating¶

Novelty: ⭐⭐⭐⭐ — The dual-agent RL architecture and IMU-only prior scheduling are conceptually novel, though RL in robotics is not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐ — Ablations are comprehensive; cross-backend generalization and robustness are both evaluated, though only on two indoor datasets.
Writing Quality: ⭐⭐⭐⭐ — Structure is clear, motivation is well articulated, and derivations are complete.
Value: ⭐⭐⭐⭐ — The accuracy–efficiency trade-off is practically meaningful for deployment, though the accuracy gap relative to SOTA optimization methods limits applicability in high-precision scenarios.