SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion¶
Conference: CVPR 2026 arXiv: 2603.12764 Code: GitHub Area: Video Understanding Keywords: Cross-view, Imitation error detection, Adaptive sampling, View embedding, Bidirectional cross-attention
TL;DR¶
This paper formalizes the Ego→Exo imitation error detection task and proposes the SAVA-X (Align–Fuse–Detect) framework, which jointly addresses three core challenges—temporal misalignment, video redundancy, and cross-view domain gap—through three modules: adaptive sampling, scene-adaptive view embeddings, and bidirectional cross-attention fusion.
Background & Motivation¶
Error detection is critical in scenarios such as industrial training, medical procedure assessment, and assembly quality control. A common real-world setup involves evaluating first-person (ego) imitation execution against third-person (exo) demonstrations. However, existing methods mostly assume a single-view setting and cannot handle cross-view scenarios.
Core Challenges:
- Temporal Misalignment: Ego/Exo videos are recorded asynchronously with different durations and execution tempos (though duration differences do not constitute errors).
- Severe Redundancy: Long videos contain large amounts of uninformative content that dilutes attention mechanisms and amplifies false positives.
- Significant Cross-View Domain Gap: The ego view emphasizes local hand-object interactions, while the exo view captures global body pose and scene layout; their appearance and motion statistics differ substantially, making direct fusion unreliable.
Existing dense video captioning (DVC) and temporal action localization (TAL) baselines struggle in such cross-view settings.
Method¶
Overall Architecture¶
SAVA-X adopts a unified Align–Fuse–Detect framework design:
- A frozen video encoder extracts frame-level features from Exo/Ego streams separately.
- Adaptive sampling selects key temporal segments to reduce redundancy.
- Scene-adaptive view embeddings inject view-conditioned information to mitigate the domain gap.
- Bidirectional cross-attention fusion aligns and aggregates complementary cues.
- The fused sequence is fed into a deformable Transformer encoder-decoder to generate first-person temporal segments and imitation correctness predictions.
TSP (pre-trained on ActivityNet) is used as the frozen feature extractor with feature dimension \(d=512\).
Key Designs¶
- Gated Adaptive Sampling:
- The Exo stream computes saliency scores via self-attention and FFN; the Ego stream uses Exo-conditioned cross-attention scores (leveraging the demonstration as a keyframe reference).
- During training, hard indices are generated via a Gumbel Top-K straight-through estimator, with a residual gate providing a differentiable gradient path: \(\mathbf{g}^{exo} = \mathbf{1} + \alpha(\text{Norm}(\boldsymbol{s}_x) - \mathbf{1})\)
- Downstream modules process only the small set of hard-selected keyframes, while gradients are propagated through soft scores.
-
A selection entropy regularizer \(\mathcal{L}_{sel}\) prevents selection collapse, and a VICReg-style regularizer \(\mathcal{L}_{vic}\) suppresses dimensional collinearity.
-
Scene-aware Dictionary View Embeddings:
- A shared view-scene dictionary \(\mathbf{D} \in \mathbb{R}^{M \times d}\) is maintained, whose row vectors capture common view sub-factors (e.g., "close-range hand-object interaction," "whole-body motion structure").
- Per-frame features query the dictionary via temperature-scaled multi-head cross-attention: \(\mathbf{VE}^u = \text{CrossAttn}(\hat{\mathbf{Z}}^u / \tau, \mathbf{D})\)
- Embeddings are injected at two points: before fusion (intra-domain alignment) and at each encoder layer (multi-level modulation).
- An attention entropy regularizer prevents overly peaked dictionary queries: \(\mathcal{L}_\text{view-ent} = \frac{1}{\log M} \mathbb{E}_t [KL(\alpha_t | U_M)]\)
-
A dictionary diversity regularizer suppresses prototype redundancy: \(\mathcal{L}_\text{dict-div} = \|\hat{\mathbf{D}} \hat{\mathbf{D}}^\top - \mathbf{I}_M\|_F^2\)
-
Bidirectional Cross-Attention Fusion:
- Symmetric bidirectional cross-attention is computed in parallel: the Ego stream retrieves global boundary/step cues from Exo, while the Exo stream retrieves hand-object details and local causal relationships from Ego.
- A learnable gated residual mixture prevents either stream from dominating: \(\mathbf{F}^{ego} = (1-\boldsymbol{\gamma}^e)\tilde{\mathbf{Z}}^{ego} + \boldsymbol{\gamma}^e \mathbf{E}^\star\)
- The gate parameter \(\boldsymbol{\gamma}^e = \sigma(\mathbf{W_e}[\tilde{\mathbf{Z}}^{ego}; \mathbf{E}^\star])\) relies more on cross-view evidence at action boundaries and key interactions.
- Final fusion: \(\tilde{\mathbf{Z}}^{fused} = \frac{1}{2}(\mathbf{F}^{ego} + \mathbf{F}^{exo})\)
Loss & Training¶
Two losses are jointly optimized: - DVC loss \(\mathcal{L}_{DVC}\): Dense video captioning loss following the PDVC configuration. - Imitation discrimination loss \(\mathcal{L}_{Imit}\): Weight \(\lambda_{Imit} = 0.5\).
Training uses Hungarian set matching to establish one-to-one correspondence between predictions and ground-truth segments. Optimizer: AdamW; batch size: 16; learning rate: \(10^{-4}\). Regularization weights are in the range \([0.01, 0.05]\).
Key Experimental Results¶
Main Results¶
Evaluated on the EgoMe dataset (7,902 asynchronous Exo-Ego video pairs, approximately 82.8 hours):
| Method | Category | Val AUPRC@0.3 | Val AUPRC@0.5 | Val AUPRC@0.7 | Val Mean | Val tIoU | Test Mean | Test tIoU |
|---|---|---|---|---|---|---|---|---|
| PDVC | DVC | 28.21 | 20.48 | 7.95 | 18.88 | 58.58 | 16.20 | 57.98 |
| Exo2EgoDVC | DVC | 31.33 | 20.27 | 7.49 | 19.69 | 59.06 | 15.99 | 58.15 |
| ActionFormer | TAL | 31.37 | 15.41 | 2.63 | 16.47 | 48.89 | 14.08 | 48.25 |
| TriDet | TAL | 30.04 | 14.61 | 2.44 | 15.70 | 49.05 | 13.77 | 49.02 |
| PDVC (Ego only) | DVC | 19.35 | 13.91 | 5.11 | 12.79 | 57.63 | 13.94 | 57.19 |
| SAVA-X | Ours | 33.56 | 24.04 | 9.48 | 22.36 | 59.31 | 18.50 | 58.32 |
Ablation Study¶
| AS | SVE | BiX | AUPRC@0.3 | AUPRC@0.5 | AUPRC@0.7 | Mean | tIoU |
|---|---|---|---|---|---|---|---|
| 28.21 | 20.48 | 7.95 | 18.88 | 58.58 | |||
| ✓ | 30.90 | 22.60 | 9.21 | 20.90 | 58.88 | ||
| ✓ | 31.64 | 22.87 | 9.37 | 21.29 | 59.27 | ||
| ✓ | 33.08 | 21.86 | 8.23 | 21.06 | 58.27 | ||
| ✓ | ✓ | 30.89 | 24.26 | 10.32 | 21.82 | 58.96 | |
| ✓ | ✓ | 29.98 | 22.27 | 8.70 | 20.32 | 58.14 | |
| ✓ | ✓ | 35.09 | 22.58 | 9.31 | 22.33 | 58.76 | |
| ✓ | ✓ | ✓ | 33.56 | 24.04 | 9.48 | 22.36 | 59.31 |
Key Findings¶
- Complementarity of three modules: Each module independently yields a relative improvement of +10.7%–+12.8%; their combination achieves the best performance, demonstrating that redundancy removal, domain gap mitigation, and bidirectional fusion address distinct bottlenecks.
- SVE+BiX is the strongest pairing: Reducing the domain gap combined with bidirectional cross-validation yields the best result.
- Unidirectional ablation: The Exo→Ego direction performs comparably to the full bidirectional setting, while Ego→Exo is weaker—consistent with the task objective of detecting errors in the Ego stream, which relies on boundary and ordering cues from the demonstration to guide imitation assessment.
- Ego-only input leads to substantial degradation (Mean AUPRC 12.79 vs. 18.88), validating the necessity of third-person demonstration signals for reducing false positives.
- Frame rate and Top-K analysis: At low frame rates, more frames must be retained to avoid information loss; at high frame rates, retaining a small number of high-scoring frames suffices.
- SVE domain gap analysis: After injecting SVE, the cross-view similarity distribution shifts rightward and becomes more concentrated, effectively mitigating the domain gap.
Highlights & Insights¶
- Task formalization contribution: This work is the first to systematically formalize the Ego→Exo imitation error detection task, clearly defining inputs, outputs, and evaluation protocols.
- Module design precisely maps to challenges: AS→redundancy, SVE→domain gap, BiX→cross-view fusion; each design directly targets one of the three core challenges.
- Gumbel Top-K + residual gating: The approach elegantly combines the sparsity of discrete selection with the gradient signal of a continuous path, resolving the sparse-gradient problem inherent in hard sampling.
- Dictionary view embeddings outperform fixed tokens: Attention-driven dictionary queries adapt to different scenes, whereas fixed learned tokens yield limited gains.
- Highly thorough ablations: Beyond module-level ablations, fine-grained analyses cover frame rate, Top-K ratio, dictionary size, regularization weights, and fusion variants.
Limitations & Future Work¶
- Evaluated on EgoMe only: Generalization to other cross-view datasets remains unknown.
- Frozen feature extractor: End-to-end fine-tuning may yield further improvements at the cost of increased computational overhead.
- No semantic/textual information utilized: Incorporating step descriptions could enable more precise error categorization.
- Single-layer cross-attention: Stacking multiple layers may improve cross-view alignment quality.
- Limited tIoU improvement: Gains in temporal localization quality are relatively modest compared to AUPRC improvements; boundary prediction remains a direction for further optimization.
Related Work & Insights¶
- PDVC is the dominant DVC method; SAVA-X adopts its encoder-decoder configuration.
- ActionFormer / TriDet are strong TAL baselines but struggle in cross-view settings.
- Exo2EgoDVC is a pioneering work on cross-view captioning, employing view-invariant adversarial learning.
- Insight: In cross-view tasks, naively concatenating Ego/Exo features is insufficient; explicit domain gap modeling and information interaction mechanisms are required.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ |
| Theoretical Depth | ⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Practical Value | ⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Overall | ⭐⭐⭐⭐ |