Guideline-Consistent Segmentation via Multi-Agent Refinement¶
Conference: AAAI 2026 arXiv: 2509.04687 Code: Project Page Area: Segmentation Keywords: guideline-consistent segmentation, multi-agent, VLM, reinforcement learning, training-free framework
TL;DR¶
A training-free multi-agent framework is proposed that achieves guideline-consistent semantic segmentation through an iterative Worker (segmentation execution) and Supervisor (guideline verification) loop coupled with an RL-based adaptive stopping strategy, surpassing prior SOTA by 8.61 and 5.5 gIoU on Waymo and ReasonSeg, respectively.
Background & Motivation¶
Semantic segmentation requires not only precise pixel-level masks but also strict adherence to text-defined annotation guidelines. However:
Guidelines are complex and lengthy: Waymo's pedestrian guidelines contain more than ten rules—"a person riding a scooter counts as a pedestrian," "exclude mannequins/statues/reflections," "a pedestrian holding an object smaller than 2 m counts as a single label," etc.
Existing methods fail on long-form text: Open-vocabulary segmentation methods such as LISA and GroundedSAM perform well under phrase-level prompts but degrade sharply when faced with paragraph-level guidelines (LISA-7B drops from 43.42 to 23.41 gIoU).
Human annotations are also inconsistent: Waymo's official ground truth itself contains violations of its own guidelines (inconsistent annotation across different timestamps of the same scene).
Retraining is impractical: Annotation rules may be updated, making task-specific retraining infeasible.
Core Goal: Design a training-free framework capable of handling complex paragraph-level guidelines and ensuring guideline consistency through iterative verification.
Method¶
Overall Architecture¶
Pipeline: Input image + text guidelines → Context construction (filtering relevant rules) → Worker–Supervisor iterative loop (segment → verify → correct) → Adaptive stopping controller decides termination → Output guideline-consistent segmentation.
Core components: - Context Construction: Enricher (scene description) + Retriever (guideline retrieval) + Smart Crop - Worker: VLM-driven detection + SAM segmentation - Supervisor: Dual-agent evaluation (Agent 1–evaluation / Agent 2–box generation) + SigLIP verification - AiRC: RL-driven adaptive iterative refinement controller
Key Designs¶
Context Construction¶
Different images require only a relevant subset of rules; feeding all rules causes information overload:
- Enricher: Uses the lightweight multimodal model Gemma3-4B to generate image captions, constructing a query \(Q = \{P, \text{<caption>}, H \times W\}\) that combines the prompt and image resolution.
- Retriever: Encodes the query as a vector using SentenceTransformer and retrieves the top-\(k=8\) most relevant guidelines from a FAISS index.
- Smart Crop: Splits the image into two regions based on object distribution, avoiding object bisection while balancing the number of objects on each side. A 0.8× downsampled image is first passed through OWLv2 to obtain coarse bounding boxes, after which the optimal split line is determined from the spatial distribution.
Worker–Supervisor Iterative Loop¶
Worker: - Initialization: A VLM (Gemini-2.5-flash) detects bounding boxes for target categories, assigns unique IDs, and passes them to frozen SAM2.1 to generate masks. - Iteration: Incorporates Supervisor feedback to correct outputs—adding missed objects, removing false detections, and adjusting imprecise boxes.
Supervisor (dual-agent design): - Agent 1 (Supervisor_eval): Given the image, Worker output, and relevant guidelines, it identifies three types of issues: (i) missed objects, (ii) false positives violating exclusion rules, and (iii) masks requiring fine-grained adjustment. Outputs structured JSON. - Agent 2 (Supervisor_boxgen): Generates candidate bounding boxes based on Agent 1's critique. - SigLIP Validator: Crops candidate regions (with contextual buffer) and verifies them via SigLIP image–text matching; a candidate is accepted if the sigmoid probability \(\geq 0.5\).
Adaptive Iterative Refinement Controller (AiRC)¶
Iterative control is formulated as a finite-horizon MDP solved with tabular Q-learning:
- State space: 6 states \(s = 2d + v\), where \(d \in \{0,1,2\}\) (scene density bucket) and \(v \in \{0,1\}\) (whether unresolved violations exist).
- Action space: \(\{\text{STOP},\ \text{CONTINUE}\}\)
- Issue count: \(I_t = I_{\text{miss}} + I_{\text{false}} + 0.1 \cdot I_{\text{ref}}\) (genuine errors count as 1; minor refinement suggestions count as 0.1).
- Reward design:
- Step cost \(c = 0.02\), early-stop penalty \(p = 2.0\), clean-scene bonus \(b = 1.0\).
- Q-table is persisted across runs; \(\epsilon\)-greedy exploration with \(\epsilon = 0.02\).
- Constraints: MIN_ITERS = 2, MAX_ITERS = 4.
Loss & Training¶
Training-free framework: Both the VLM (Gemini-2.5-flash) and SAM2.1 are frozen; all adaptation is achieved through strategic prompting and context control.
- Worker temperature \(T = 0.5\) (detection flexibility); Supervisor_eval temperature \(T = 0.3\) (deterministic reasoning).
- Average of 2.6 iterations per sample; per-sample cost approximately $0.0088 (Gemini-2.5-flash API).
- To account for VLM non-determinism, each experiment is run 3 times with different random seeds and mean ± standard deviation is reported.
Key Experimental Results¶
Main Results¶
Waymo Guideline-Consistent Dataset (101 manually verified samples, full-length guidelines):
| Method | gIoU | cIoU | mPr | mRec | mDice |
|---|---|---|---|---|---|
| LISA-7B | 23.41 | 18.32 | 24.38 | 79.79 | 33.17 |
| GroundedSAM | 20.43 | 19.36 | 29.35 | 28.11 | 25.63 |
| Gemini-2.5 | 69.02 | 74.24 | 80.91 | 79.89 | 78.75 |
| SegZero | 71.96 | 73.72 | 86.34 | 78.45 | 80.88 |
| Ours | 80.57 | 86.70 | 91.06 | 84.78 | 87.20 |
ReasonSeg val dataset:
| Method | gIoU | cIoU |
|---|---|---|
| Gemini-2.5 | 55.5 | 44.9 |
| SegZero | 62.6 | 62.0 |
| Ours | 68.1 | 66.4 |
Key observation: As guideline length increases (word → phrase → full text), competing methods suffer substantial performance degradation, whereas Gemini-2.5 and the proposed method actually benefit from complete guidelines.
Ablation Study¶
Component ablation (Waymo):
| Configuration | gIoU | cIoU | mPr | mRec |
|---|---|---|---|---|
| Worker only | 69.02 | 74.24 | 80.91 | 79.89 |
| + Context Construction | 73.87 | 78.40 | 88.36 | 80.81 |
| + Context + Supervisor (Full) | 80.57 | 86.70 | 91.06 | 84.78 |
AiRC effect: Dynamic stopping resolves 110% more violations than fixed 2-iteration runs (0.61 vs. 0.29 violations/crop), requiring only 48% of crops to run additional iterations.
Key Findings¶
- Methods such as LISA collapse to gIoU in the low 20s under full-length guidelines, as they cannot process complex rules embedded in long contexts.
- High recall (84.78) indicates successful recovery of missed objects (e.g., umbrellas, backpacks); high precision (91.06) indicates effective removal of false positives (e.g., cyclists, reflections).
- The adaptive stopping controller benefits dense scenes most, as violations are most likely to occur in crowded scenarios.
Highlights & Insights¶
- Novel problem formulation: The paper is the first to explicitly define "guideline-consistent segmentation," elevating annotation rule adherence to a first-order task objective, and reveals inconsistencies in existing ground-truth annotations.
- RL-based stopping strategy: A compact 6-state Q-table realizes adaptive iterative control that is both more efficient and more effective than fixed-iteration alternatives.
- Dual-agent division of labor: Separating evaluation and box generation into distinct agents proves more effective than assigning all responsibilities to a single agent.
- Scalability: Both the VLM and SAM are frozen; guideline updates require no retraining—only modification of the text input.
Limitations & Future Work¶
- Reliance on a closed-source VLM (Gemini-2.5 API) limits local deployment and reproducibility.
- Quantitative constraint interpretation is difficult: the VLM struggles to precisely judge "whether an object exceeds 2 m," which may cause misclassification.
- Bounding boxes serve as the primary SAM input, limiting fine-grained segmentation (e.g., fine structures such as eyelashes).
- Accuracy is prioritized over latency; multiple API calls make the framework unsuitable for real-time applications.
- The method supports only semantic segmentation and has not been extended to instance segmentation.
Related Work & Insights¶
- Multi-agent collaboration paradigm: Inherits ideas from multi-agent frameworks such as MetaGPT and AutoGen—multiple specialized agents collaborating outperforms a single generalist agent.
- VLM segmentation progress: From single-pass inference in LISA to iterative refinement in this work, representing a paradigm shift for VLMs in segmentation from "answering" to "verifying."
- Self-correction mechanism: The Worker–Supervisor loop can be viewed as self-refinement for visual tasks, augmented with an explicit rule-verification step.
- Inspiration: The guideline-consistency paradigm is extensible to domains requiring strict rule adherence, such as medical image annotation and autonomous driving perception annotation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Novel problem formulation; unique training-free solution combining multi-agent collaboration and RL.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Validated on Waymo and ReasonSeg with a manually curated high-quality test set and sufficient ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with rich visualizations, though some content is redundant.
- Value: ⭐⭐⭐⭐ — Guideline consistency addresses a genuine pain point in practical annotation pipelines, offering high applied value.