Skip to content

👥 Multi-Agent

📷 CVPR2026 · 9 paper notes

📌 Same area in other venues: 🧪 ICML2026 (15) · 💬 ACL2026 (39) · 🔬 ICLR2026 (15) · 🤖 AAAI2026 (26) · 🧠 NeurIPS2025 (17)

🔥 Top topics: Agents ×9 · Few-/Zero-Shot Learning ×2 · Reasoning ×2

Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection

A multi-agent system driven by LLMs is used to "act" as both forgers and social network observers, simulating the complete life cycle of face forgery from creation to propagation. It synthesizes training data with text-image consistency annotations, leading to significant performance gains for deepfake detectors in cross-domain and cross-algorithm real-world scenarios (e.g., Celeb-DF AUC improved from the 70% range to 87.1%).

AgentDet: A Shared-Blackboard Multi-Agent Framework for Zero-/Few-Shot Object Detection

AgentDet decomposes zero-/few-shot object detection into four LLM agents: Scout, Pinner, Curator, and Judge. These agents collaborate via a "Shared Blackboard" and a patch-level "Knowledge Base" (KB). The framework fragments visual evidence into the KB, assembles them into holistic textual clues for LLM-based box prediction, and trains only the Judge agent. It achieves competitive results on PASCAL VOC and COCO for both ZSOD and FSOD tasks.

MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

Addressing the gap in structured annotations for "inferring deep mental states from observable behaviors," this paper constructs a multimodal dataset, MOTOR-dataset, from real classroom collaborative learning scenarios (1,440 video clips with behavioral/cognitive/emotional labels). It proposes MOTOR-MAS, a reasoning-based multi-agent framework grounded in Self-Regulated Learning (SRL) theory. Three specialized agents perform cascaded reasoning in the order of "Behavior → Cognition → Emotion," using predictions from previous stages as anchors for subsequent stages. MOTOR-MAS achieves a Macro-F1 of 42.77 under zero-shot settings, outperforming the strongest single-model baseline by 15.93 points.

Paper2Figure: A Multi-Agent Collaborative System for Figure Generation Towards Academic Research Paper

Paper2Figure utilizes a dual multi-agent system comprising "Generator Agents + Refiner Agents." It first translates text descriptions of papers into a self-developed structured intermediate language, FigScript, used for rendering. A closed-loop Critic-Refine agent system then performs self-correction. Coupled with an interactive Web editor that returns control to the author, the system outperforms SVG/Mermaid code generation and text-to-image baselines on the self-built Paper2Figure Bench in accuracy, aesthetics, and completeness (+14.1% overall).

Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

Refer-Agent decomposes Referring Video Object Segmentation (RVOS) into a step-by-step reasoning pipeline of "frame selection → intent analysis → object localization → mask generation." It further integrates a dual-stage Chain-of-Reflection (Existence Reflection + Consistency Reflection) composed of a Questioner-Responder pair to alternate between reasoning and reflection for self-correction. Without any training and using only a 9B open-source MLLM, it outperforms SFT methods and GPT-4o-based zero-shot methods across five RVOS benchmarks.

SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

SciEducator transforms the Deming Cycle (Plan–Do–Study–Act) from management science into a self-evolving multi-agent closed loop. By iteratively performing "planning–execution–review–improvement," the system understands scientific experiment videos and generates multi-modal educational handbooks for children. On the self-constructed SciVBench, it significantly outperforms closed-source MLLMs like GPT-4o and Gemini, as well as existing video agents.

Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

Symphony mimics human cognition by decomposing long-video understanding into multiple specialized agents based on "capability dimensions" (Planning, Reflection, Grounding, Caption, and Visual Perception). It employs an Actor-Critic-style reflection-enhanced dynamic collaboration mechanism to iteratively correct reasoning and introduces a grounding agent that "expands queries first, then scores with VLM" for complex problems. It achieves SOTA on LVBench, LongVideoBench, Video-MME, and MLVU, outperforming the previous best on LVBench by 5.0%.

Tackling Model Bias via Game-theoretic Multi-agent Collaboration Framework for Hateful Meme Classification

GECO organizes three Large Multimodal Models (LMMs), one learnable agent, and one primary decision agent into a regularized game. Driven by a "hybrid reward" system to achieve consensus on correct labels, it suppresses both individual and inter-model cognitive biases, achieving new SOTA results on five hateful meme benchmarks.

Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

MACT decomposes the "monolithic single-model" visual document QA into four agents with distinct roles: planning, execution, judging, and answering. It adaptively allocates test-time compute according to the cognitive load of each agent rather than uniformly increasing parameters. On 15 benchmarks, it consistently ranks in the top three with <30B parameters, achieving an average improvement of 9.9–11.5% over the base models.