A4VL: A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning¶
Conference: CVPR 2026 arXiv: 2603.14052 Code: https://github.com/git-disl/A4VL Area: Multimodal VLM / Video Understanding Keywords: Multi-Agent Alliance, Long Video Reasoning, Perception-Action Exploration, Cross-Examination Consensus, Event-Driven Segmentation, Agent Pruning
TL;DR¶
This paper proposes A4VL, a training-free multi-agent perception-action alliance framework in which multiple heterogeneous VLM agents perform iterative perception exploration (event-based segmentation + CLIP-guided clue alignment for keyframe localization) and action exploration (independent reasoning → cross-scoring → consensus/pruning). A4VL comprehensively outperforms 18 VLMs and 11 long-video-specialized methods across 5 VideoQA benchmarks, with significantly lower inference latency (74s vs. GPT-4o's 127s on MLVU).
Background & Motivation¶
Long video reasoning faces dual challenges of efficiency and quality: - Computational overhead: Videos contain many frames; the memory and time complexity of attention mechanisms scales quadratically with frame count. GPT-4o requires 150s+ per question on average on Video-MME. - Information sparsity: Critical information is scattered across long sequences; naively increasing sampling density introduces redundant noise and dilutes attention with irrelevant frames. - Single-agent limitations: Existing agent methods (e.g., VideoAgent) typically rely on a single MLLM for decision-making, do not support multi-agent collaboration, and depend on video grounding models that perform poorly on complex queries. VideoAgent requires 10+ minutes to process a one-hour video.
Core Problem¶
How to efficiently process real-world long videos under a limited frame budget while maintaining high-quality video reasoning?
Method¶
Overall Architecture¶
A4VL consists of three core components: Agent Teaming, Perception Exploration, and Action Exploration. The latter two are executed iteratively until consensus is reached.
1. Agent Teaming¶
From a pool of 8 candidate MLLMs (ranging from 7B to 78B open-source models), the 3 most collaborative agents are selected: - \(K\) unlabeled video-question pairs are randomly sampled. - Each agent independently runs the perception and answering pipeline, recording the selection frequency \(f_{qr}\) of each option for each question. - Each agent's score is computed as the average frequency of its chosen options across all agents. - The top-3 agents by score form the team.
Key insight: No labels are required; inter-agent agreement serves as a proxy signal. Different benchmarks yield different teams (e.g., NeXT-QA/EgoSchema/LongVideoBench all select InternVL3-78B + InternVL3.5-38B + QwenVL-2.5-72B, while MLVU substitutes LLaVA-Video-72B).
2. Perception Exploration¶
Consists of two stages:
Stage 1 — Clue Generation: Each agent randomly samples \(N_1 = 4\) frames from the full video for a preview, then generates a textual perception clue in conjunction with the question and options, describing the key visual content to be located in the video. - Random sampling is preferred over event-based sampling here, as only coarse-grained coverage is needed; uniform random sampling provides higher temporal coverage.
Stage 2 — Clue-Guided Block-Aligned Sampling: - Event-driven segmentation partitions the video into at most \(B\) semantic blocks: DINOv2 embeddings combined with HSV/motion/sharpness pixel cues detect scene changes; KTS, PELT, and SSM-novelty generate candidate boundaries; NMS deduplicates them, retaining the top \(B-1\) boundaries (completed in under 2s for most videos). - CLIP computes the similarity between each block and each agent's perception clue. - If all block similarities fall below \(\rho = 0.8\): only \(N_2 = 16\) frames are sampled from the single most relevant block. - Otherwise, blocks with similarity \(> \rho\) are retained, and frame counts are allocated proportionally via softmax-normalized scores, totaling \(N_2 = 16\) frames. - Each agent may receive a different set of sampled frames due to differing perception clues.
3. Action Exploration¶
Consists of two stages:
Stage 1 — Independent Reasoning: Each agent independently generates an answer \(a_{i,j}\) and reasoning rationale \(R_{i,j}\) based on its own \(N_2\) frames.
Stage 2 — Consensus and Pruning: - Consensus check: If all agents agree (Full Consensus), a summarizer aggregates all reasoning processes and outputs the final answer with explanation. - Cross-scoring: If consensus is not reached, each agent scores all agents' answers (including its own) on a scale of 1–10. - Agent pruning: The agent with the lowest total score is removed from the team. - Clue refinement: Remaining agents generate more precise new clues \(P_{i,j+1}\) based on the current round's answer set, reasoning set, and information from the pruned agent. - The process returns to Perception Exploration Stage 2 for a new round (up to 3 rounds, given 3 agents).
Ablation Validation of Design Choices (EgoSchema)¶
| Design Dimension | Option | Accuracy | Notes |
|---|---|---|---|
| Sampling strategy | RR (fully random) | 80.2% | |
| RE (random perception + event action) | 82.2% | A4VL default | |
| ER (event perception + random action) | 79.6% | ||
| Consensus condition | Majority consensus | 81.4% (26s) | |
| Full consensus | 82.2% (37s) | A4VL default; more rounds yield higher confidence | |
| Pruning | No pruning (Sum) | 80.8% (60s) | |
| No pruning (Maj) | 79.4% (60s) | ||
| A4VL pruning | 82.2% (37s) | Pruning improves both accuracy and efficiency |
Key Experimental Results¶
Main Results (5 Benchmarks, 28+ Methods)¶
| Benchmark | A4VL | Strongest Baseline | Gain |
|---|---|---|---|
| NeXT-QA | 85.1% | InternVL3-78B 84.0% | +1.1 |
| EgoSchema | 82.2% | LVAgent 78.4% | +3.8 |
| LongVideoBench | 72.2% | GPT-4o 66.7% | +5.5 |
| MLVU-Test | 58.0% | InternVL3.5-38B 56.1% | +1.9 |
| Video-MME (w/o sub) | 77.2% | Gemini 1.5 Pro 75.0% | +2.2 |
- The largest gain is on LongVideoBench (+5.5), surpassing GPT-4o entirely with open-source models.
- EgoSchema is the only method to break 80%.
Inference Efficiency (Average Time per Sample)¶
| Method | NeXT-QA | EgoSchema | MLVU |
|---|---|---|---|
| GPT-4o | 23s | 54s | 127s |
| InternVL3-78B | 15s | 50s | 204s |
| VideoAgent | 20s | 83s | 175s |
| TraveLER | 101s | 94s | 450s |
| A4VL | 18s | 37s | 74s |
On MLVU, A4VL is 42% faster than GPT-4o and 6× faster than TraveLER. The advantage grows with video length.
Multi-Round Collaboration Statistics¶
Harder datasets lead agents to engage in more rounds of collaboration. On MLVU, approximately 40% of questions require 3 rounds; on NeXT-QA, approximately 13%.
Highlights & Insights¶
- Heterogeneous multi-agent collaboration: Leverages complementary strengths of different MLLMs combined with cross-verification, yielding more reliable outputs than any single model.
- Perception-action decoupling: Generating clues from very few frames (4 frames) before precise localization avoids the enormous overhead of processing entire videos.
- Event-driven segmentation: DINOv2-based unsupervised scene partitioning is semantically meaningful and extremely fast (<2s).
- Dynamic pruning: Not only improves accuracy by removing erroneous agents but also reduces computational cost in subsequent rounds.
- Fully training-free: Requires no fine-tuning of any model; directly combines existing open-source VLMs.
Limitations & Future Work¶
- Requires simultaneous deployment of multiple large models (experiments use 6× H200 GPUs), imposing high hardware requirements.
- The agent teaming stage requires a small amount of task data (albeit unlabeled), limiting cold-start applicability.
- Only visual information from video is utilized; the audio modality is not exploited.
- CLIP as a clue-block similarity model is relatively simple; stronger cross-modal matching could further improve localization quality.
- Fixed frame budgets of \(N_1 = 4\) and \(N_2 = 16\) may not be sufficiently adaptive for videos of varying length or complexity.
Rating¶
- Novelty: ⭐⭐⭐⭐ The multi-agent perception-action alliance design is novel; the pipeline of perception clues → event segmentation → CLIP alignment is elegantly constructed.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive comparison across 5 benchmarks and 28+ methods; ablations cover sampling, consensus, pruning, and round count; efficiency data are complete.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, rigorous formal descriptions, and intuitive case visualizations in Figure 4.
- Value: ⭐⭐⭐⭐ Training-free, open-source models surpassing GPT-4o; an practically deployable long video reasoning solution.