Adaptive Vision-Language Model Routing for Computer Use Agents¶

Conference: CVPR 2026 arXiv: 2603.12823 Code: GitHub Area: Multimodal VLM / GUI Agents / Model Routing Keywords: VLM routing, CUA, confidence probing, memory augmentation, cost optimization

TL;DR¶

This paper proposes the Adaptive VLM Routing (AVR) framework, which inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. Through three mechanisms — multimodal difficulty classification, logprob confidence probing, and historical memory injection — AVR dynamically selects the most cost-efficient model for each action, reducing inference cost by up to 78% with an accuracy drop of no more than 2 percentage points.

Background & Motivation¶

Background: Computer Use Agents (CUA) interpret screenshots and execute GUI actions (clicks, typing, scrolling) via VLMs. Systems such as OpenAI CUA, Claude Computer Use, and UFO2 are representative examples. Current systems apply a single fixed VLM to all operations; a 20-step task accumulates approximately 400K input tokens, costing $0.10–$0.40.

Limitations of Prior Work: ScreenSpot-Pro data reveals that GPT-4o (~1.8T parameters) achieves only 0.8% accuracy on GUI grounding, whereas OS-Atlas (7B) reaches 18.9%. Scaling Qwen2.5-VL from 3B to 72B (24× parameters) improves accuracy only from 24.2% to 43.6% (1.8×). Model size is thus an unreliable predictor of grounding accuracy.

Key Challenge: CUA action difficulty varies dramatically — clicking a large button is trivial, while localizing a small icon in a dense IDE toolbar is highly challenging. Yet current systems apply the same model uniformly, wasting compute on easy actions and potentially failing on hard ones. Cross-application accuracy variance (>35% for VS Code vs. <15% for Premiere Pro) far exceeds inter-model variance.

Goal: Reformulate CUA inference as a dynamic model routing problem, selecting the most cost-efficient yet sufficiently reliable VLM for each individual action.

Key Insight: Routing is formalized as a cost–accuracy constrained optimization problem: $\min_\pi \sum c_{\pi(i)}$ s.t. accuracy $\geq \tau_{acc}$, introducing latent difficulty variables and threshold-based policies.

Core Idea: A three-stage routing pipeline — difficulty estimation, confidence probing, and memory injection — directs the majority of CUA actions to smaller models, escalating to larger models only for difficult or high-risk operations.

Method¶

Overall Architecture¶

AVR operates as a transparent proxy layer that intercepts each tool call and sequentially performs: safety check (Visual Confused Deputy guardrail) → difficulty classification (multimodal embeddings) → small-model confidence probing (logprob) → routing decision. This yields a three-tier policy: low difficulty + high confidence → small model (~78% of traffic); high difficulty or low confidence → large model (~17%); high risk → large model + guardrail verification (~5%).

Key Designs¶

Multimodal Difficulty Classifier
Function: Estimates the difficulty score of each GUI action as $d(t_i) = \max(d_{vis}, d_{sem})$.
Mechanism: Crops a 100×100-pixel region around the predicted coordinates and encodes it into a 384-dimensional shared space using a 120M-parameter model (SigLIP + MiniLM-L6-v2). Visual embeddings are matched against a difficulty knowledge base (easy/hard UI element prototypes derived from ScreenSpot-Pro) via nearest-neighbor cosine similarity; text descriptions are matched similarly. The maximum of the two scores is taken as a conservative estimate.
Design Motivation: The lightweight (120M) pre-screening stage allows clearly easy actions to bypass probing and route directly to the small model, while clearly hard actions route directly to the large model.
Logprob Confidence Probing with Adaptive Thresholds
Function: Invokes the small VLM in non-streaming mode to obtain output token log-probabilities and computes a normalized confidence score: $\text{conf}(t_i) = (\bar{\ell}(t_i) + |\ell_{min}|) / |\ell_{min}|$.
Mechanism: Routing decisions combine difficulty-adaptive thresholds — easy actions ($\hat{d}<0.3$) use a lower threshold $\tau_{easy}=0.80$; hard actions ($\hat{d}>0.7$) use a higher threshold $\tau_{hard}=0.92$; intermediate values are linearly interpolated. Actions exceeding the threshold remain on the small model; others are escalated.
Design Motivation: A fixed threshold either under-escalates on hard actions (threshold too low) or over-escalates on easy ones (threshold too high). Adaptive thresholds align with the difficulty profile of each action.
Memory-Compensated Routing (Warm Start)
Function: For agents with existing interaction history, relevant memories (UI element positions, navigation paths, toolbar layouts) are injected into the small VLM probing prompt.
Mechanism: Memory injection yields far greater confidence gains for the small model than for the large model ($\Delta\text{conf}_S(\mathcal{M}) \gg \Delta\text{conf}_L(\mathcal{M})$). In OpenClaw experiments, small-model confidence rises from 0.83 to 0.96, retaining all actions on the small model.
Design Motivation: Large models possess sufficient internal knowledge, so memory provides only marginal gains; small models lack domain knowledge, and explicit context bridges the capability gap — forming a virtuous cycle in which the system becomes cheaper as usage accumulates.

Loss & Training¶

No end-to-end training is required. Routing is derived from threshold-based policies. The cost model is: $E[c] = (1-\alpha) c_S + \alpha (c_S^{probe} + c_L)$. When the small model is 10× cheaper ($c_S/c_L=0.1$) and only 20% of actions are escalated, a 70% cost reduction is achievable.

Key Experimental Results¶

Main Results¶

Routing Scenario	Escalation Rate α	Effective Accuracy	Cost per Call	Savings
Full 72B (baseline)	1.0	43.6%	$0.27	—
Cold-start AVR	0.35	42.1%	$0.13	52%
Warm-start AVR	0.15	41.3%	$0.08	70%
Warm-start + difficulty classification	0.10	42.8%	$0.06	78%

Note: The above figures are analytically derived from ScreenSpot-Pro accuracy data and OpenClaw confidence distributions, not obtained via end-to-end empirical evaluation.

Ablation Study¶

Analysis Dimension	Key Metric	Description
Memory injection effect	7B confidence: 0.83→0.96	Creates a bimodal distribution; cold-state below threshold, warm-state well above
OpenClaw cost	86% savings (warm start)	100% of actions remain on 7B, matching 139B quality
Application warm-up curve	Largest gains within first 5–10 interactions	Logarithmic shape with diminishing returns
Threshold sensitivity	Default 0.93→tuned 0.85	Agent workload confidence is compressed into a narrow band, requiring threshold reduction

Key Findings¶

Model size is a weak predictor of GUI grounding accuracy: GPT-4o (1.8T) achieves only 0.8%, while OS-Atlas-7B reaches 18.9%.
Memory injection has an asymmetric effect — far more beneficial for small models than large ones — making memory a "model size equalizer."
Safety, cost, and accuracy objectives can be unified within a single routing layer; the Visual Confused Deputy guardrail achieves F1 = 0.915 with zero additional overhead by reusing the same multimodal encoder.

Highlights & Insights¶

Reframes CUA inference from "fixed cost" to "adaptive resource allocation" — a conceptually novel perspective.
The Memory Equalization Hypothesis carries notable theoretical depth as a finding.
The analysis is intellectually honest: the paper explicitly distinguishes empirically measured results from analytically derived estimates, avoiding overclaiming.
The unified routing framework addressing three objectives simultaneously (cost + accuracy + safety) is elegantly designed.

Limitations & Future Work¶

Core CUA grounding cost savings are extrapolated from OpenClaw text-only tasks rather than validated via end-to-end CUA evaluation.
For very short tasks (2–3 steps), probing overhead may offset routing gains.
The difficulty knowledge base must cover target applications; routing effectiveness for new applications during cold start remains uncertain.
Memory benefits may vary substantially across applications with different UI complexity levels; application-category-level analysis is lacking.

vs. FrugalGPT: Text-level cascade routing; AVR extends this paradigm to multimodal CUA scenarios, additionally accounting for visual grounding uncertainty and action risk.
vs. HybridLLM: Trains a router to predict difficulty for dispatch; AVR further introduces memory compensation and safety override mechanisms.
vs. Visual Confused Deputy: A pure post-hoc safety filter; AVR integrates safety signals into the upstream routing decision.
vs. ScreenSpot-Pro: Provides model grounding capability data without offering a routing framework; AVR leverages its data to construct routing strategies.

Rating¶

Novelty: ⭐⭐⭐⭐ — The unified framework combining routing, memory, and safety is genuinely novel; the Memory Equalization concept carries theoretical depth.
Experimental Thoroughness: ⭐⭐⭐ — Analysis is detailed, but core CUA cost savings are analytically derived rather than end-to-end validated; OpenClaw involves text-only tasks.
Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear, mathematical derivations are complete, and limitations are candidly discussed.
Value: ⭐⭐⭐ — Practically significant for large-scale CUA deployment; the routing framework generalizes to other multi-model scheduling scenarios.