ESCA: Contextualizing Embodied Agents via Scene-Graph Generation¶

Conference: NeurIPS 2025 (Spotlight)
arXiv: 2510.15963
Code: SGCLIP / ESCA
Area: Graph Learning / Embodied AI
Keywords: Scene graph generation, embodied agents, CLIP, vision-language models, neuro-symbolic learning

TL;DR¶

This paper proposes the ESCA framework, which provides structured visual understanding context for MLLM-driven embodied agents via open-vocabulary scene graph generation (the SGClip model), substantially reducing perception error rates and improving task completion rates.

Background & Motivation¶

Multimodal large language models (MLLMs) have seen rapid adoption in embodied agents, yet existing MLLMs suffer from fundamental deficiencies in the following areas:

Insufficient fine-grained visual-semantic grounding: MLLMs struggle to reliably establish connections between low-level visual features and high-level textual semantics, resulting in weak spatial and temporal visual localization.

Perception errors as the primary failure cause: Empirical analysis shows that up to 69% of agent failures stem from perception errors, such as object hallucination, entity misidentification, and incorrect spatial relationships.

Limitations of existing visual augmentation modules: Object detection models such as Grounding DINO and YOLO focus primarily on object recognition while neglecting semantic attributes, inter-object relationships, and temporal consistency.

Method¶

Overall Architecture¶

ESCA (Embodied and Scene-Graph Contextualized Agent) provides context to MLLMs through four modular stages:

Selective Concept Extraction: The MLLM extracts structured concepts—including entity categories (e.g., car, knife), attributes (e.g., red, small), and relations (e.g., behind, cutting)—based on instructions and interaction history.
Object Identification: A Grounding DINO + SAM2 pipeline grounds extracted concepts to specific image regions, producing precise segmentation masks.
Scene Graph Prediction: The SGClip model generates a probabilistic scene graph containing unary facts (object attributes) and binary facts (inter-object relations).
Visual Summarization and Verification: The scene graph is converted into natural language descriptions, and consistency between visual feedback and the scene graph is verified.

Key Designs¶

SGClip model architecture: A CLIP-based scene graph generation model supporting three inference modes: - Entity category inference: Applies softmax normalization over candidate categories. - Attribute inference: Constructs attribute–negation pairs (e.g., "red" vs. "not red") for binary contrastive probability estimation. - Binary relation inference: Colors target regions to mark subject/object roles, and augments relation phrases with entity categories (e.g., "(robot, cutting, cabbage)").

ESCA-Video-87K dataset: Constructed from LLaVA-Video-178K, containing 87K video data points, each represented as a quintuple \((\\bar{I}, L_{cap}, \\Sigma, \\bar{c}, \\phi)\) comprising video, caption, object trajectories, concept set, and spatiotemporal procedural specification.

Transfer Protocol: Downstream adaptation to different embodied environments is achieved via two customized prompt templates—a concept extraction prompt and a visual summarization prompt—without retraining the core system.

Loss & Training¶

SGClip is trained with a neuro-symbolic learning pipeline comprising three losses: - Contrastive loss: Distinguishes matched from unmatched video–specification pairs, using a chunked event training strategy (at most 3 events per chunk). - Temporal loss: Improves the precision of event-to-video-segment temporal alignment. - Semantic loss: Leverages commonsense negation knowledge (e.g., a bed is unlikely in an outdoor scene) by sampling semantically distant words from the top-5,000 high-frequency keywords as negative examples.

Training configuration: learning rate \(1 \times 10^{-6}\), batch size 2, 1 FPS sampling, trained for 3 epochs on 10 H100 GPUs (~10 days).

Key Experimental Results¶

Main Results¶

EB-Navigation environment (success rate %):

Model	Base	+ GD	+ ESCA
InternVL-2.5	47.33	47.67	51.66
Gemini-2.0	40.68	40.53	42.00
Qwen2.5	44.99	48.27	49.33
GPT-4o	51.33	53.33	54.67

EB-Manipulation environment (success rate %):

Model	Base	+ YOLO	+ ESCA
InternVL-2.5	19.31	19.30	24.30
GPT-4o	23.47	28.48	34.44

Key finding: InternVL-2.5 + ESCA surpasses the performance of vanilla GPT-4o on EB-Navigation.

Ablation Study¶

SGClip zero-shot generalization (Recall metrics): - SGClip consistently outperforms vanilla CLIP on three out-of-domain datasets: OpenPVSG, Action Genome, and VidVRD. - Performance improves steadily as training data scales from 1K to 10K to 87K samples.

ActivityNet action recognition:

Method	Data	Accuracy
SGClip (zero-shot)	0%	76.34%
CLIP (zero-shot)	0%	74.37%
SGClip (few-shot)	5%	92.10%
InternVL-6B (full)	100%	95.90%

With only 5% of training data, SGClip approaches the performance of fully supervised InternVL-6B.

VidVRD scene graph relation annotation (after fine-tuning):

Model	P@1	R@1	P@5	R@5	P@10	R@10
SGClip-CLIP	0.469	0.085	0.321	0.250	0.246	0.353
SGClip	0.495	0.087	0.350	0.270	0.278	0.385

Key Findings¶

Error decomposition analysis: ESCA reduces InternVL's perception error rate on EB-Navigation from 69% to 30%.
Cross-environment generalization: ESCA yields consistent improvements on EB-Habitat and EB-Alfred as well.
Comparison with GD/YOLO: Although Grounding DINO and YOLO also improve over the baseline, ESCA provides additional and significant gains beyond them.

Highlights & Insights¶

Selective scene graphs: Rather than injecting a complete scene graph (which may degrade performance), the MLLM first identifies the most instruction-relevant concept subset, then generates a targeted scene graph.
Probabilistic prediction: Each fact in the scene graph is associated with a confidence score, enabling uncertainty capture.
Model-driven self-supervision: Learning signals are derived from GPT-4-generated captions and spatiotemporal specifications, requiring no human annotation.
Elegant Transfer Protocol design: Adaptation to four distinct embodied environments is achieved through only two prompt templates.

Limitations & Future Work¶

Insufficient real-time capability: High-level LLM planning introduces latency, making the framework unsuitable for low-level real-time control.
2D-only input: The absence of 3D representations (e.g., point clouds) limits depth reasoning and spatial precision.
Lack of state verification: No formal mechanism is provided to verify intermediate and final states during execution.

ESCA shares a neuro-symbolic learning pipeline with LASER (ICLR 2025); SGClip can be viewed as an application of that pipeline to the embodied domain.
The Scallop programming language enables differentiable symbolic alignment and serves as a key tool in the neuro-symbolic paradigm.
The design philosophy of the Transfer Protocol generalizes to other visual understanding systems requiring cross-task adaptation.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of selective scene graphs and neuro-symbolic self-supervision is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four embodied environments × four MLLMs, plus independent scene graph evaluation.
Practicality: ⭐⭐⭐⭐ — A plug-and-play framework applicable to diverse MLLMs.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with rich illustrations.