Skip to content

SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

Conference: CVPR 2026 arXiv: 2603.12764 Code: GitHub Area: Video Understanding Keywords: Cross-view, Imitation error detection, Adaptive sampling, View embedding, Bidirectional cross-attention

TL;DR

This paper formalizes the Ego→Exo imitation error detection task and proposes the SAVA-X (Align–Fuse–Detect) framework, which jointly addresses three core challenges—temporal misalignment, video redundancy, and cross-view domain gap—through three modules: adaptive sampling, scene-adaptive view embeddings, and bidirectional cross-attention fusion.

Background & Motivation

Error detection is critical in scenarios such as industrial training, medical procedure assessment, and assembly quality control. A common real-world setup involves evaluating first-person (ego) imitation execution against third-person (exo) demonstrations. However, existing methods mostly assume a single-view setting and cannot handle cross-view scenarios.

Core Challenges:

  1. Temporal Misalignment: Ego/Exo videos are recorded asynchronously with different durations and execution tempos (though duration differences do not constitute errors).
  2. Severe Redundancy: Long videos contain large amounts of uninformative content that dilutes attention mechanisms and amplifies false positives.
  3. Significant Cross-View Domain Gap: The ego view emphasizes local hand-object interactions, while the exo view captures global body pose and scene layout; their appearance and motion statistics differ substantially, making direct fusion unreliable.

Existing dense video captioning (DVC) and temporal action localization (TAL) baselines struggle in such cross-view settings.

Method

Overall Architecture

SAVA-X adopts a unified Align–Fuse–Detect framework design:

  1. A frozen video encoder extracts frame-level features from Exo/Ego streams separately.
  2. Adaptive sampling selects key temporal segments to reduce redundancy.
  3. Scene-adaptive view embeddings inject view-conditioned information to mitigate the domain gap.
  4. Bidirectional cross-attention fusion aligns and aggregates complementary cues.
  5. The fused sequence is fed into a deformable Transformer encoder-decoder to generate first-person temporal segments and imitation correctness predictions.

TSP (pre-trained on ActivityNet) is used as the frozen feature extractor with feature dimension \(d=512\).

Key Designs

  1. Gated Adaptive Sampling:
  2. The Exo stream computes saliency scores via self-attention and FFN; the Ego stream uses Exo-conditioned cross-attention scores (leveraging the demonstration as a keyframe reference).
  3. During training, hard indices are generated via a Gumbel Top-K straight-through estimator, with a residual gate providing a differentiable gradient path: \(\mathbf{g}^{exo} = \mathbf{1} + \alpha(\text{Norm}(\boldsymbol{s}_x) - \mathbf{1})\)
  4. Downstream modules process only the small set of hard-selected keyframes, while gradients are propagated through soft scores.
  5. A selection entropy regularizer \(\mathcal{L}_{sel}\) prevents selection collapse, and a VICReg-style regularizer \(\mathcal{L}_{vic}\) suppresses dimensional collinearity.

  6. Scene-aware Dictionary View Embeddings:

  7. A shared view-scene dictionary \(\mathbf{D} \in \mathbb{R}^{M \times d}\) is maintained, whose row vectors capture common view sub-factors (e.g., "close-range hand-object interaction," "whole-body motion structure").
  8. Per-frame features query the dictionary via temperature-scaled multi-head cross-attention: \(\mathbf{VE}^u = \text{CrossAttn}(\hat{\mathbf{Z}}^u / \tau, \mathbf{D})\)
  9. Embeddings are injected at two points: before fusion (intra-domain alignment) and at each encoder layer (multi-level modulation).
  10. An attention entropy regularizer prevents overly peaked dictionary queries: \(\mathcal{L}_\text{view-ent} = \frac{1}{\log M} \mathbb{E}_t [KL(\alpha_t | U_M)]\)
  11. A dictionary diversity regularizer suppresses prototype redundancy: \(\mathcal{L}_\text{dict-div} = \|\hat{\mathbf{D}} \hat{\mathbf{D}}^\top - \mathbf{I}_M\|_F^2\)

  12. Bidirectional Cross-Attention Fusion:

  13. Symmetric bidirectional cross-attention is computed in parallel: the Ego stream retrieves global boundary/step cues from Exo, while the Exo stream retrieves hand-object details and local causal relationships from Ego.
  14. A learnable gated residual mixture prevents either stream from dominating: \(\mathbf{F}^{ego} = (1-\boldsymbol{\gamma}^e)\tilde{\mathbf{Z}}^{ego} + \boldsymbol{\gamma}^e \mathbf{E}^\star\)
  15. The gate parameter \(\boldsymbol{\gamma}^e = \sigma(\mathbf{W_e}[\tilde{\mathbf{Z}}^{ego}; \mathbf{E}^\star])\) relies more on cross-view evidence at action boundaries and key interactions.
  16. Final fusion: \(\tilde{\mathbf{Z}}^{fused} = \frac{1}{2}(\mathbf{F}^{ego} + \mathbf{F}^{exo})\)

Loss & Training

Two losses are jointly optimized: - DVC loss \(\mathcal{L}_{DVC}\): Dense video captioning loss following the PDVC configuration. - Imitation discrimination loss \(\mathcal{L}_{Imit}\): Weight \(\lambda_{Imit} = 0.5\).

Training uses Hungarian set matching to establish one-to-one correspondence between predictions and ground-truth segments. Optimizer: AdamW; batch size: 16; learning rate: \(10^{-4}\). Regularization weights are in the range \([0.01, 0.05]\).

Key Experimental Results

Main Results

Evaluated on the EgoMe dataset (7,902 asynchronous Exo-Ego video pairs, approximately 82.8 hours):

Method Category Val AUPRC@0.3 Val AUPRC@0.5 Val AUPRC@0.7 Val Mean Val tIoU Test Mean Test tIoU
PDVC DVC 28.21 20.48 7.95 18.88 58.58 16.20 57.98
Exo2EgoDVC DVC 31.33 20.27 7.49 19.69 59.06 15.99 58.15
ActionFormer TAL 31.37 15.41 2.63 16.47 48.89 14.08 48.25
TriDet TAL 30.04 14.61 2.44 15.70 49.05 13.77 49.02
PDVC (Ego only) DVC 19.35 13.91 5.11 12.79 57.63 13.94 57.19
SAVA-X Ours 33.56 24.04 9.48 22.36 59.31 18.50 58.32

Ablation Study

AS SVE BiX AUPRC@0.3 AUPRC@0.5 AUPRC@0.7 Mean tIoU
28.21 20.48 7.95 18.88 58.58
30.90 22.60 9.21 20.90 58.88
31.64 22.87 9.37 21.29 59.27
33.08 21.86 8.23 21.06 58.27
30.89 24.26 10.32 21.82 58.96
29.98 22.27 8.70 20.32 58.14
35.09 22.58 9.31 22.33 58.76
33.56 24.04 9.48 22.36 59.31

Key Findings

  1. Complementarity of three modules: Each module independently yields a relative improvement of +10.7%–+12.8%; their combination achieves the best performance, demonstrating that redundancy removal, domain gap mitigation, and bidirectional fusion address distinct bottlenecks.
  2. SVE+BiX is the strongest pairing: Reducing the domain gap combined with bidirectional cross-validation yields the best result.
  3. Unidirectional ablation: The Exo→Ego direction performs comparably to the full bidirectional setting, while Ego→Exo is weaker—consistent with the task objective of detecting errors in the Ego stream, which relies on boundary and ordering cues from the demonstration to guide imitation assessment.
  4. Ego-only input leads to substantial degradation (Mean AUPRC 12.79 vs. 18.88), validating the necessity of third-person demonstration signals for reducing false positives.
  5. Frame rate and Top-K analysis: At low frame rates, more frames must be retained to avoid information loss; at high frame rates, retaining a small number of high-scoring frames suffices.
  6. SVE domain gap analysis: After injecting SVE, the cross-view similarity distribution shifts rightward and becomes more concentrated, effectively mitigating the domain gap.

Highlights & Insights

  • Task formalization contribution: This work is the first to systematically formalize the Ego→Exo imitation error detection task, clearly defining inputs, outputs, and evaluation protocols.
  • Module design precisely maps to challenges: AS→redundancy, SVE→domain gap, BiX→cross-view fusion; each design directly targets one of the three core challenges.
  • Gumbel Top-K + residual gating: The approach elegantly combines the sparsity of discrete selection with the gradient signal of a continuous path, resolving the sparse-gradient problem inherent in hard sampling.
  • Dictionary view embeddings outperform fixed tokens: Attention-driven dictionary queries adapt to different scenes, whereas fixed learned tokens yield limited gains.
  • Highly thorough ablations: Beyond module-level ablations, fine-grained analyses cover frame rate, Top-K ratio, dictionary size, regularization weights, and fusion variants.

Limitations & Future Work

  1. Evaluated on EgoMe only: Generalization to other cross-view datasets remains unknown.
  2. Frozen feature extractor: End-to-end fine-tuning may yield further improvements at the cost of increased computational overhead.
  3. No semantic/textual information utilized: Incorporating step descriptions could enable more precise error categorization.
  4. Single-layer cross-attention: Stacking multiple layers may improve cross-view alignment quality.
  5. Limited tIoU improvement: Gains in temporal localization quality are relatively modest compared to AUPRC improvements; boundary prediction remains a direction for further optimization.
  • PDVC is the dominant DVC method; SAVA-X adopts its encoder-decoder configuration.
  • ActionFormer / TriDet are strong TAL baselines but struggle in cross-view settings.
  • Exo2EgoDVC is a pioneering work on cross-view captioning, employing view-invariant adversarial learning.
  • Insight: In cross-view tasks, naively concatenating Ego/Exo features is insufficient; explicit domain gap modeling and information interaction mechanisms are required.

Rating

Dimension Score
Novelty ⭐⭐⭐⭐
Theoretical Depth ⭐⭐⭐⭐
Experimental Thoroughness ⭐⭐⭐⭐⭐
Practical Value ⭐⭐⭐⭐
Writing Quality ⭐⭐⭐⭐
Overall ⭐⭐⭐⭐