Skip to content

A4VL: A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

Conference: CVPR 2026 arXiv: 2603.14052 Code: https://github.com/git-disl/A4VL Area: Multimodal VLM / Video Understanding Keywords: Multi-Agent Alliance, Long Video Reasoning, Perception-Action Exploration, Cross-Examination Consensus, Event-Driven Segmentation, Agent Pruning

TL;DR

This paper proposes A4VL, a training-free multi-agent perception-action alliance framework in which multiple heterogeneous VLM agents perform iterative perception exploration (event-based segmentation + CLIP-guided clue alignment for keyframe localization) and action exploration (independent reasoning → cross-scoring → consensus/pruning). A4VL comprehensively outperforms 18 VLMs and 11 long-video-specialized methods across 5 VideoQA benchmarks, with significantly lower inference latency (74s vs. GPT-4o's 127s on MLVU).

Background & Motivation

Long video reasoning faces dual challenges of efficiency and quality: - Computational overhead: Videos contain many frames; the memory and time complexity of attention mechanisms scales quadratically with frame count. GPT-4o requires 150s+ per question on average on Video-MME. - Information sparsity: Critical information is scattered across long sequences; naively increasing sampling density introduces redundant noise and dilutes attention with irrelevant frames. - Single-agent limitations: Existing agent methods (e.g., VideoAgent) typically rely on a single MLLM for decision-making, do not support multi-agent collaboration, and depend on video grounding models that perform poorly on complex queries. VideoAgent requires 10+ minutes to process a one-hour video.

Core Problem

How to efficiently process real-world long videos under a limited frame budget while maintaining high-quality video reasoning?

Method

Overall Architecture

A4VL consists of three core components: Agent Teaming, Perception Exploration, and Action Exploration. The latter two are executed iteratively until consensus is reached.

1. Agent Teaming

From a pool of 8 candidate MLLMs (ranging from 7B to 78B open-source models), the 3 most collaborative agents are selected: - \(K\) unlabeled video-question pairs are randomly sampled. - Each agent independently runs the perception and answering pipeline, recording the selection frequency \(f_{qr}\) of each option for each question. - Each agent's score is computed as the average frequency of its chosen options across all agents. - The top-3 agents by score form the team.

Key insight: No labels are required; inter-agent agreement serves as a proxy signal. Different benchmarks yield different teams (e.g., NeXT-QA/EgoSchema/LongVideoBench all select InternVL3-78B + InternVL3.5-38B + QwenVL-2.5-72B, while MLVU substitutes LLaVA-Video-72B).

2. Perception Exploration

Consists of two stages:

Stage 1 — Clue Generation: Each agent randomly samples \(N_1 = 4\) frames from the full video for a preview, then generates a textual perception clue in conjunction with the question and options, describing the key visual content to be located in the video. - Random sampling is preferred over event-based sampling here, as only coarse-grained coverage is needed; uniform random sampling provides higher temporal coverage.

Stage 2 — Clue-Guided Block-Aligned Sampling: - Event-driven segmentation partitions the video into at most \(B\) semantic blocks: DINOv2 embeddings combined with HSV/motion/sharpness pixel cues detect scene changes; KTS, PELT, and SSM-novelty generate candidate boundaries; NMS deduplicates them, retaining the top \(B-1\) boundaries (completed in under 2s for most videos). - CLIP computes the similarity between each block and each agent's perception clue. - If all block similarities fall below \(\rho = 0.8\): only \(N_2 = 16\) frames are sampled from the single most relevant block. - Otherwise, blocks with similarity \(> \rho\) are retained, and frame counts are allocated proportionally via softmax-normalized scores, totaling \(N_2 = 16\) frames. - Each agent may receive a different set of sampled frames due to differing perception clues.

3. Action Exploration

Consists of two stages:

Stage 1 — Independent Reasoning: Each agent independently generates an answer \(a_{i,j}\) and reasoning rationale \(R_{i,j}\) based on its own \(N_2\) frames.

Stage 2 — Consensus and Pruning: - Consensus check: If all agents agree (Full Consensus), a summarizer aggregates all reasoning processes and outputs the final answer with explanation. - Cross-scoring: If consensus is not reached, each agent scores all agents' answers (including its own) on a scale of 1–10. - Agent pruning: The agent with the lowest total score is removed from the team. - Clue refinement: Remaining agents generate more precise new clues \(P_{i,j+1}\) based on the current round's answer set, reasoning set, and information from the pruned agent. - The process returns to Perception Exploration Stage 2 for a new round (up to 3 rounds, given 3 agents).

Ablation Validation of Design Choices (EgoSchema)

Design Dimension Option Accuracy Notes
Sampling strategy RR (fully random) 80.2%
RE (random perception + event action) 82.2% A4VL default
ER (event perception + random action) 79.6%
Consensus condition Majority consensus 81.4% (26s)
Full consensus 82.2% (37s) A4VL default; more rounds yield higher confidence
Pruning No pruning (Sum) 80.8% (60s)
No pruning (Maj) 79.4% (60s)
A4VL pruning 82.2% (37s) Pruning improves both accuracy and efficiency

Key Experimental Results

Main Results (5 Benchmarks, 28+ Methods)

Benchmark A4VL Strongest Baseline Gain
NeXT-QA 85.1% InternVL3-78B 84.0% +1.1
EgoSchema 82.2% LVAgent 78.4% +3.8
LongVideoBench 72.2% GPT-4o 66.7% +5.5
MLVU-Test 58.0% InternVL3.5-38B 56.1% +1.9
Video-MME (w/o sub) 77.2% Gemini 1.5 Pro 75.0% +2.2
  • The largest gain is on LongVideoBench (+5.5), surpassing GPT-4o entirely with open-source models.
  • EgoSchema is the only method to break 80%.

Inference Efficiency (Average Time per Sample)

Method NeXT-QA EgoSchema MLVU
GPT-4o 23s 54s 127s
InternVL3-78B 15s 50s 204s
VideoAgent 20s 83s 175s
TraveLER 101s 94s 450s
A4VL 18s 37s 74s

On MLVU, A4VL is 42% faster than GPT-4o and 6× faster than TraveLER. The advantage grows with video length.

Multi-Round Collaboration Statistics

Harder datasets lead agents to engage in more rounds of collaboration. On MLVU, approximately 40% of questions require 3 rounds; on NeXT-QA, approximately 13%.

Highlights & Insights

  • Heterogeneous multi-agent collaboration: Leverages complementary strengths of different MLLMs combined with cross-verification, yielding more reliable outputs than any single model.
  • Perception-action decoupling: Generating clues from very few frames (4 frames) before precise localization avoids the enormous overhead of processing entire videos.
  • Event-driven segmentation: DINOv2-based unsupervised scene partitioning is semantically meaningful and extremely fast (<2s).
  • Dynamic pruning: Not only improves accuracy by removing erroneous agents but also reduces computational cost in subsequent rounds.
  • Fully training-free: Requires no fine-tuning of any model; directly combines existing open-source VLMs.

Limitations & Future Work

  • Requires simultaneous deployment of multiple large models (experiments use 6× H200 GPUs), imposing high hardware requirements.
  • The agent teaming stage requires a small amount of task data (albeit unlabeled), limiting cold-start applicability.
  • Only visual information from video is utilized; the audio modality is not exploited.
  • CLIP as a clue-block similarity model is relatively simple; stronger cross-modal matching could further improve localization quality.
  • Fixed frame budgets of \(N_1 = 4\) and \(N_2 = 16\) may not be sufficiently adaptive for videos of varying length or complexity.

Rating

  • Novelty: ⭐⭐⭐⭐ The multi-agent perception-action alliance design is novel; the pipeline of perception clues → event segmentation → CLIP alignment is elegantly constructed.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive comparison across 5 benchmarks and 28+ methods; ablations cover sampling, consensus, pruning, and round count; efficiency data are complete.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure, rigorous formal descriptions, and intuitive case visualizations in Figure 4.
  • Value: ⭐⭐⭐⭐ Training-free, open-source models surpassing GPT-4o; an practically deployable long video reasoning solution.