Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention¶

Conference: NeurIPS 2025 arXiv: 2512.24323 Code: N/A Area: Video Understanding / Segmentation Keywords: Egocentric video segmentation, causal inference, backdoor adjustment, frontdoor adjustment, depth guidance

TL;DR¶

This paper proposes CERES, a framework that addresses language bias and visual confusion in egocentric referring video object segmentation (Ego-RVOS) via dual-modal causal intervention — language backdoor adjustment to eliminate dataset statistical bias, and depth-guided visual frontdoor adjustment to construct causal mediators — achieving SOTA on VISOR/VOST/VSCOS.

Background & Motivation¶

Background: Egocentric referring video object segmentation (Ego-RVOS) requires segmenting objects participating in specific actions based on natural language descriptions (e.g., "knife used to cut carrot") in egocentric video. Existing methods such as ActionVOS fine-tune pre-trained RVOS models with action descriptions to distinguish positive from negative objects.

Limitations of Prior Work: Existing methods tend to learn spurious correlations rather than causal relationships, resulting in poor robustness. This manifests at two levels: (1) Language bias — frequent co-occurrence of object-action pairs (e.g., "knife-cut") in datasets leads models to rely on statistical shortcuts rather than genuinely understanding instructions; (2) Visual confusion — the inherent rapid motion, frequent occlusion, and perspective distortion in egocentric video mislead models, especially given the domain shift from third-person pre-training data.

Key Challenge: Language bias stems from observable dataset-level statistical confounders (amenable to backdoor adjustment), while visual confusion arises from unobservable intrinsic factors (requiring frontdoor adjustment). Existing causal learning frameworks address either backdoor or frontdoor adjustment, but not both in a unified manner.

Goal: To simultaneously address two qualitatively different confounders — observable language bias and unobservable visual bias — within a unified framework.

Key Insight: The authors formulate Ego-RVOS as a causal graph with two classes of confounders via structural causal models (SCMs), applying distinct causal interventions to the textual and visual pathways. The key insight is leveraging depth information to construct mediating variables that are more robust to egocentric visual distortions.

Core Idea: Apply backdoor adjustment to eliminate linguistic statistical bias, and depth-guided frontdoor adjustment to bypass unobservable visual confounders, jointly achieving robust egocentric video segmentation.

Method¶

Overall Architecture¶

CERES is a plug-and-play causal framework built on top of pre-trained RVOS models (e.g., ReferFormer). Given an egocentric video frame sequence and a natural language query, it produces per-frame segmentation masks. The framework comprises two core modules: (1) a Language Backdoor Deconfounder (LBD) operating on text representations, and (2) a Visual Frontdoor Deconfounder (VFD) operating on visual features. After independently performing causal intervention on the textual and visual pathways, the debiased multimodal features are fed into the downstream components of the RVOS model (classification head, positive-sample discrimination head, mask decoder) to produce final segmentation results.

Key Designs¶

Language Backdoor Deconfounder (LBD):
- Function: Eliminate spurious language-segmentation associations caused by dataset statistical bias.
- Mechanism: Based on Pearl's backdoor adjustment, the intervention distribution \(P(\mathcal{Y}|\text{do}(\mathcal{T}))\) is estimated. All unique verb-noun pairs (e.g., cut-knife) in the training set are enumerated as a confounder dictionary \(\{z_i\}_{i=1}^K\); their embeddings \(\mathbf{f}_\mathcal{Z}(z_i)\) are obtained via a text encoder and averaged by empirical frequency \(P(z_i)\) to yield a fixed vector \(\bar{\mathbf{f}}_\mathcal{Z}\). At inference, the debiased representation is obtained by adding this constant bias to the original text features: \(\mathbf{f}'_\mathcal{T}(t) = \mathbf{f}_\mathcal{T}(t) + \bar{\mathbf{f}}_\mathcal{Z}\). The Softmax expectation is approximated via NWGM, assuming additive score decomposability.
- Design Motivation: The confounder \(\mathcal{Z}\) is observable (derived from training data statistics), so backdoor adjustment directly severs the spurious path \(\mathcal{T} \leftarrow \mathcal{Z} \rightarrow \mathcal{Y}\). The implementation is lightweight — requiring only one precomputation and one addition.
Visual Frontdoor Deconfounder (VFD) — Depth-Guided Attention (DAttn):
- Function: Construct causal mediating variables that are robust to egocentric visual distortions.
- Mechanism: Two feature streams are extracted from visual input \(\mathcal{X}\): semantic visual features \(\mathcal{M}_v\) (from the RGB encoder) and geometric depth features \(\mathcal{M}_d\) (from a frozen Depth Anything V2 encoder). The core operation is a cross-modal attention using depth features as Query and visual features as Key/Value: \(\hat{\mathbf{M}}(x) = \text{Attn}(Q=\hat{\mathbf{M}}_d, K=V=\mathbf{M}_v)\), which constitutes an MMSE estimator under the Attention-Linear-Family. Depth information is inherently more robust to motion blur and occlusion; using it to guide semantic feature aggregation reduces the influence of confounder \(\mathcal{U}\).
- Design Motivation: Visual confounder \(\mathcal{U}\) (rapid motion, occlusion, and other intrinsic egocentric properties) is unobservable, precluding backdoor adjustment. Frontdoor adjustment requires a well-chosen mediator; a purely visual mediator readily inherits the influence of \(\mathcal{U}\), whereas geometric depth information is more stable under these distortions and serves as a reliable "guide" for selecting informative semantic features.
Visual Frontdoor Deconfounder (VFD) — Temporal Memory Attention (MAttn):
- Function: Estimate the general contextual distribution \(\mathbb{E}_{\mathcal{X}'}[\mathbf{X}']\) over visual inputs, completing the second expectation in the frontdoor adjustment formula.
- Mechanism: A sliding-window memory bank \(\mathcal{B}_t = \{x_{t-\tau}\}_{\tau=1}^W\) (\(W=5\)) is maintained. Under a short-term stationarity assumption, the frame distribution within the window is approximately stationary; the expectation is approximated via attention-weighted aggregation over current and historical frames. This is more suitable for dynamic egocentric video streams than static global dictionary approaches, and converges to the true expectation by the law of large numbers.
- Design Motivation: Standard frontdoor adjustment implementations precompute visual context expectations using a static global dictionary, which is impractical for long, dynamically varying egocentric video. The sliding-window scheme maintains theoretical consistency while adapting to temporal dynamics.

Loss & Training¶

The overall loss employs standard segmentation objectives. During training, auxiliary segmentation losses are computed on features from the last 3 layers of the RGB encoder to provide richer training signal. At inference, only the debiased visual features from the final layer are used. Visual and mediator features are fused via gated residual connection: \(\mathbf{f}'_\mathcal{X}(x_t) = \sigma \cdot \text{MLP}([\hat{\mathbf{M}}; \hat{\mathbf{X}}_t]) + (1-\sigma) \cdot \mathbf{X}_t\).

Key Experimental Results¶

Main Results¶

Dataset	Method	mIoU⊕↑	cIoU⊕↑	gIoU↑	Acc↑
VISOR (R101)	ActionVOS	59.9	67.2	69.9	73.4
VISOR (R101)	CERES	64.0	72.8	72.4	76.3
VISOR (SwinL)	ActionVOS	66.3	71.9	68.7	73.4
VISOR (SwinL)	CERES	67.0	73.6	71.8	75.2
VISOR-Novel	ActionVOS	55.3	62.8	65.8	69.4
VISOR-Novel	CERES	60.0	69.9	67.9	72.2
VSCOS	ActionVOS	52.5	57.7	—	—
VSCOS	CERES	55.3	62.5	—	—
VOST	ActionVOS	30.2	17.6	—	—
VOST	CERES	32.0	21.7	—	—

Ablation Study¶

Configuration	mIoU⊕↑	mIoU⊖↓	gIoU↑	Acc↑
Baseline (ActionVOS)	59.9	16.3	69.9	73.4
+ LBD only	61.2	16.0	71.4	74.8
+ DAttn (MLP depth)	62.1	17.5	70.5	73.6
+ DAttn (cross-attn)	63.3	15.8	71.8	75.3
+ DAttn + MAttn	63.1	14.9	72.1	76.1
Full CERES	64.0	15.3	72.4	76.3

Key Findings¶

DAttn (cross-attention depth fusion) contributes most, improving mIoU⊕ by 3.4% when introduced alone and outperforming MLP-based depth fusion, validating the superiority of the causal mediator design.
Adding MAttn reduces mIoU⊖ to its lowest value (14.9%), indicating that temporal context helps better discriminate positive from negative samples.
On the rare-concept subset, CERES outperforms ActionVOS by 3.9% mIoU⊕ (62.3% vs. 58.4%), confirming LBD's effectiveness in eliminating statistical bias.
A temporal window of \(W=5\) is the optimal trade-off; further increases yield diminishing returns.

Highlights & Insights¶

Unified dual-modal causal framework: A single framework simultaneously addresses two qualitatively distinct confounders — observable language bias (backdoor) and unobservable visual confusion (frontdoor) — offering a more comprehensive treatment than prior causal methods that handle only one type.
Depth information as causal mediator: Rather than naively concatenating RGB and depth features, the authors ground the use of depth in causal theory, demonstrating its robustness to visual confounders and using it to guide semantic feature aggregation. This paradigm is transferable to any visual task requiring domain-shift handling.
Plug-and-play design: CERES functions as a modular plugin compatible with multiple RVOS backbones (R101/VSwinB/SwinL), offering strong practical utility.

Limitations & Future Work¶

The confounder dictionary in LBD relies on verb-noun pair statistics from the training set, which may lack flexibility in truly open-vocabulary scenarios.
The short-term stationarity assumption underlying frontdoor adjustment may break down during rapid scene transitions.
Validation is limited to kitchen-scene datasets; generalization to other egocentric settings (e.g., outdoor activities, industrial operations) remains unexplored.
Depth information depends on a pre-trained monocular depth estimation model (Depth Anything V2), which may be unreliable under extreme conditions.

vs. ActionVOS: ActionVOS incorporates action descriptions but still learns spurious correlations; CERES eliminates bias at its source via causal intervention, consistently outperforming it across all metrics.
vs. GOAT: GOAT applies causal methods to handle confounders from vision, language, and action history, but is not designed for RVOS. CERES is the first to introduce dual-modal causal intervention into RVOS with depth-based mediation.
vs. single-adjustment causal methods: Prior work employs either backdoor adjustment (e.g., visual captioning) or frontdoor adjustment (e.g., VQA) in isolation; CERES is the first to unify both within a single framework.

Rating¶

Novelty: ⭐⭐⭐⭐ The dual-modal causal intervention framework is novel, though backdoor/frontdoor adjustment individually have prior applications in other tasks.
Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, three backbones, and detailed ablations; lacks validation on more diverse scenarios.
Writing Quality: ⭐⭐⭐⭐ The causal modeling derivations are clear, with explicit correspondence between theory and implementation.
Value: ⭐⭐⭐⭐ Practically valuable for egocentric video understanding; the causal framework design is transferable.