Guideline-Consistent Segmentation via Multi-Agent Refinement¶

Conference: AAAI 2026 arXiv: 2509.04687 Code: Project Page Area: Segmentation Keywords: guideline-consistent segmentation, multi-agent, VLM, reinforcement learning, training-free framework

TL;DR¶

A training-free multi-agent framework is proposed that achieves guideline-consistent semantic segmentation through an iterative Worker (segmentation execution) and Supervisor (guideline verification) loop coupled with an RL-based adaptive stopping strategy, surpassing prior SOTA by 8.61 and 5.5 gIoU on Waymo and ReasonSeg, respectively.

Background & Motivation¶

Semantic segmentation requires not only precise pixel-level masks but also strict adherence to text-defined annotation guidelines. However:

Guidelines are complex and lengthy: Waymo's pedestrian guidelines contain more than ten rules—"a person riding a scooter counts as a pedestrian," "exclude mannequins/statues/reflections," "a pedestrian holding an object smaller than 2 m counts as a single label," etc.

Existing methods fail on long-form text: Open-vocabulary segmentation methods such as LISA and GroundedSAM perform well under phrase-level prompts but degrade sharply when faced with paragraph-level guidelines (LISA-7B drops from 43.42 to 23.41 gIoU).

Human annotations are also inconsistent: Waymo's official ground truth itself contains violations of its own guidelines (inconsistent annotation across different timestamps of the same scene).

Retraining is impractical: Annotation rules may be updated, making task-specific retraining infeasible.

Core Goal: Design a training-free framework capable of handling complex paragraph-level guidelines and ensuring guideline consistency through iterative verification.

Method¶

Overall Architecture¶

Pipeline: Input image + text guidelines → Context construction (filtering relevant rules) → Worker–Supervisor iterative loop (segment → verify → correct) → Adaptive stopping controller decides termination → Output guideline-consistent segmentation.

Core components: - Context Construction: Enricher (scene description) + Retriever (guideline retrieval) + Smart Crop - Worker: VLM-driven detection + SAM segmentation - Supervisor: Dual-agent evaluation (Agent 1–evaluation / Agent 2–box generation) + SigLIP verification - AiRC: RL-driven adaptive iterative refinement controller

Key Designs¶

Context Construction¶

Different images require only a relevant subset of rules; feeding all rules causes information overload:

Enricher: Uses the lightweight multimodal model Gemma3-4B to generate image captions, constructing a query $Q = \{P, \text{<caption>}, H \times W\}$ that combines the prompt and image resolution.
Retriever: Encodes the query as a vector using SentenceTransformer and retrieves the top-$k=8$ most relevant guidelines from a FAISS index.
Smart Crop: Splits the image into two regions based on object distribution, avoiding object bisection while balancing the number of objects on each side. A 0.8× downsampled image is first passed through OWLv2 to obtain coarse bounding boxes, after which the optimal split line is determined from the spatial distribution.

Worker–Supervisor Iterative Loop¶

Worker: - Initialization: A VLM (Gemini-2.5-flash) detects bounding boxes for target categories, assigns unique IDs, and passes them to frozen SAM2.1 to generate masks. - Iteration: Incorporates Supervisor feedback to correct outputs—adding missed objects, removing false detections, and adjusting imprecise boxes.

Supervisor (dual-agent design): - Agent 1 (Supervisor_eval): Given the image, Worker output, and relevant guidelines, it identifies three types of issues: (i) missed objects, (ii) false positives violating exclusion rules, and (iii) masks requiring fine-grained adjustment. Outputs structured JSON. - Agent 2 (Supervisor_boxgen): Generates candidate bounding boxes based on Agent 1's critique. - SigLIP Validator: Crops candidate regions (with contextual buffer) and verifies them via SigLIP image–text matching; a candidate is accepted if the sigmoid probability $\geq 0.5$.

Iterative control is formulated as a finite-horizon MDP solved with tabular Q-learning:

State space: 6 states $s = 2d + v$, where $d \in \{0,1,2\}$ (scene density bucket) and $v \in \{0,1\}$ (whether unresolved violations exist).
Action space: $\{\text{STOP},\ \text{CONTINUE}\}$
Issue count: $I_t = I_{\text{miss}} + I_{\text{false}} + 0.1 \cdot I_{\text{ref}}$ (genuine errors count as 1; minor refinement suggestions count as 0.1).
Reward design:

\[r(s,a,s') = \begin{cases}(I_t - I_{t+1}) - c + b[I_{t+1}=0], & a_t = \text{CONTINUE} \\ 0, & a_t = \text{STOP},\ I_t = 0 \\ -p, & a_t = \text{STOP},\ I_t > 0\end{cases}\]

Step cost $c = 0.02$, early-stop penalty $p = 2.0$, clean-scene bonus $b = 1.0$.
Q-table is persisted across runs; $\epsilon$-greedy exploration with $\epsilon = 0.02$.
Constraints: MIN_ITERS = 2, MAX_ITERS = 4.

Loss & Training¶

Training-free framework: Both the VLM (Gemini-2.5-flash) and SAM2.1 are frozen; all adaptation is achieved through strategic prompting and context control.

Worker temperature $T = 0.5$ (detection flexibility); Supervisor_eval temperature $T = 0.3$ (deterministic reasoning).
Average of 2.6 iterations per sample; per-sample cost approximately $0.0088 (Gemini-2.5-flash API).
To account for VLM non-determinism, each experiment is run 3 times with different random seeds and mean ± standard deviation is reported.

Key Experimental Results¶

Main Results¶

Waymo Guideline-Consistent Dataset (101 manually verified samples, full-length guidelines):

Method	gIoU	cIoU	mPr	mRec	mDice
LISA-7B	23.41	18.32	24.38	79.79	33.17
GroundedSAM	20.43	19.36	29.35	28.11	25.63
Gemini-2.5	69.02	74.24	80.91	79.89	78.75
SegZero	71.96	73.72	86.34	78.45	80.88
Ours	80.57	86.70	91.06	84.78	87.20

ReasonSeg val dataset:

Method	gIoU	cIoU
Gemini-2.5	55.5	44.9
SegZero	62.6	62.0
Ours	68.1	66.4

Key observation: As guideline length increases (word → phrase → full text), competing methods suffer substantial performance degradation, whereas Gemini-2.5 and the proposed method actually benefit from complete guidelines.

Ablation Study¶

Component ablation (Waymo):

Configuration	gIoU	cIoU	mPr	mRec
Worker only	69.02	74.24	80.91	79.89
+ Context Construction	73.87	78.40	88.36	80.81
+ Context + Supervisor (Full)	80.57	86.70	91.06	84.78

AiRC effect: Dynamic stopping resolves 110% more violations than fixed 2-iteration runs (0.61 vs. 0.29 violations/crop), requiring only 48% of crops to run additional iterations.

Key Findings¶

Methods such as LISA collapse to gIoU in the low 20s under full-length guidelines, as they cannot process complex rules embedded in long contexts.
High recall (84.78) indicates successful recovery of missed objects (e.g., umbrellas, backpacks); high precision (91.06) indicates effective removal of false positives (e.g., cyclists, reflections).
The adaptive stopping controller benefits dense scenes most, as violations are most likely to occur in crowded scenarios.

Highlights & Insights¶

Novel problem formulation: The paper is the first to explicitly define "guideline-consistent segmentation," elevating annotation rule adherence to a first-order task objective, and reveals inconsistencies in existing ground-truth annotations.
RL-based stopping strategy: A compact 6-state Q-table realizes adaptive iterative control that is both more efficient and more effective than fixed-iteration alternatives.
Dual-agent division of labor: Separating evaluation and box generation into distinct agents proves more effective than assigning all responsibilities to a single agent.
Scalability: Both the VLM and SAM are frozen; guideline updates require no retraining—only modification of the text input.

Limitations & Future Work¶

Reliance on a closed-source VLM (Gemini-2.5 API) limits local deployment and reproducibility.
Quantitative constraint interpretation is difficult: the VLM struggles to precisely judge "whether an object exceeds 2 m," which may cause misclassification.
Bounding boxes serve as the primary SAM input, limiting fine-grained segmentation (e.g., fine structures such as eyelashes).
Accuracy is prioritized over latency; multiple API calls make the framework unsuitable for real-time applications.
The method supports only semantic segmentation and has not been extended to instance segmentation.

Multi-agent collaboration paradigm: Inherits ideas from multi-agent frameworks such as MetaGPT and AutoGen—multiple specialized agents collaborating outperforms a single generalist agent.
VLM segmentation progress: From single-pass inference in LISA to iterative refinement in this work, representing a paradigm shift for VLMs in segmentation from "answering" to "verifying."
Self-correction mechanism: The Worker–Supervisor loop can be viewed as self-refinement for visual tasks, augmented with an explicit rule-verification step.
Inspiration: The guideline-consistency paradigm is extensible to domains requiring strict rule adherence, such as medical image annotation and autonomous driving perception annotation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Novel problem formulation; unique training-free solution combining multi-agent collaboration and RL.
Experimental Thoroughness: ⭐⭐⭐⭐ — Validated on Waymo and ReasonSeg with a manually curated high-quality test set and sufficient ablation studies.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with rich visualizations, though some content is redundant.
Value: ⭐⭐⭐⭐ — Guideline consistency addresses a genuine pain point in practical annotation pipelines, offering high applied value.