SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion¶
Conference: CVPR 2026 arXiv: 2603.12764 Code: jack1ee/SAVAX Area: Video Understanding Keywords: cross-view imitation error detection, adaptive sampling, scene-aware view embedding, bidirectional cross-attention fusion, egocentric-exocentric video
TL;DR¶
This paper proposes SAVA-X, a framework comprising three complementary modules—adaptive sampling, scene-aware view embedding, and bidirectional cross-attention fusion—to address cross-view temporal error detection in the exocentric-demonstration-to-egocentric-imitation setting, achieving comprehensive improvements over existing baselines on the EgoMe benchmark.
Background & Motivation¶
Strong practical demand: In industrial assembly, medical training, and robot imitation learning, operators (first-person/ego) must execute actions based on third-person (exo) demonstrations, making error detection critical for quality control.
Prior methods limited to single-view: Existing error detection approaches (e.g., PREGO) assume single-view input and cannot handle the realistic scenario where demonstrations and executions come from different viewpoints.
Temporal misalignment: Ego/Exo videos are recorded asynchronously with different durations and pacing; direct feature alignment causes false positives, as duration differences alone do not constitute errors.
Severe redundancy interference: Long videos contain substantial irrelevant content that dilutes attention mechanisms and increases false positives; experiments show that baseline methods degrade in performance as the number of input frames increases.
Significant viewpoint domain gap: The ego view focuses on local hand-object interactions while the exo view captures whole-body posture and scene layout; their appearance and motion statistics differ substantially, making direct feature fusion unreliable.
Lack of unified evaluation protocol: The task had not been formally defined, and the absence of standardized baselines and evaluation frameworks has impeded research progress.
Method¶
Overall Architecture: Align–Fuse–Detect¶
SAVA-X employs a frozen video encoder (TSP, pretrained on ActivityNet) to extract per-frame features, which are then processed sequentially through three core modules and a deformable Transformer encoder-decoder for final prediction:
- Adaptive Sampling (AS) → redundancy removal + temporal alignment
- Scene-aware View-dictionary Embedding (SVE) → reducing the cross-view domain gap
- Bidirectional Cross-attention Fusion (BiX) → complementary evidence aggregation
- Decoder outputs egocentric temporal segments with imitation correctness judgments
Key Design 1: Gated Adaptive Sampling (AS)¶
- Exo side: Computes saliency scores via self-attention + FFN; retains keyframes via Gumbel Top-K hard selection.
- Ego side: Scores frames via cross-attention with already-sampled Exo features as Key/Value, making Ego sampling sensitive to demonstration keypoints.
- Residual gating: Introduces soft gating alongside hard selection via \(\mathbf{g} = \mathbf{1} + \alpha(\text{Norm}(\mathbf{s}) - \mathbf{1})\) to provide stable gradients to the scorer.
- Regularization: Selection entropy \(\mathcal{L}_{\text{sel}}\) prevents selection collapse; VICReg-style \(\mathcal{L}_{\text{vic}}\) prevents dimensional collinearity.
Key Design 2: Scene-Aware View-Dictionary Embedding (SVE)¶
- Maintains a shared view-scene dictionary \(\mathbf{D} \in \mathbb{R}^{M \times d}\), whose row vectors capture common view sub-factors (e.g., "close-range hand-object interaction," "whole-body motion structure").
- Each view stream retrieves adaptive view embeddings from the dictionary via temperature-scaled cross-attention: \(\mathbf{VE}^u = \text{CrossAttn}(\hat{\mathbf{Z}}^u / \tau, \mathbf{D})\).
- Two-stage injection: Injected once into each Ego/Exo stream before fusion, and again at multiple temporal levels within the encoder.
- Attention entropy regularization \(\mathcal{L}_{\text{view-ent}}\): Prevents overly sharp attention distributions and encourages uniform dictionary coverage.
- Dictionary diversity regularization \(\mathcal{L}_{\text{dict-div}}\): Enforces approximate orthogonality among normalized dictionary rows.
Key Design 3: Bidirectional Cross-Attention Fusion (BiX)¶
- Symmetric bidirectional cross-attention: Ego→Exo and Exo→Ego computed in parallel.
- Learnable gated residual mixing: \(\mathbf{F}^{ego} = (1-\boldsymbol{\gamma}^e)\tilde{\mathbf{Z}}^{ego} + \boldsymbol{\gamma}^e \mathbf{E}^\star\), with gate values generated by sigmoid over concatenated features.
- Final fusion: \(\tilde{\mathbf{Z}}^{fused} = \frac{1}{2}(\mathbf{F}^{ego} + \mathbf{F}^{exo})\).
- The Exo→Ego direction provides boundary and step-ordering cues; the Ego→Exo direction contributes hand-object details and local causal information.
Loss & Training¶
- DVC loss \(\mathcal{L}_{\text{DVC}}\) (Hungarian set prediction, following PDVC configuration)
- Imitation discrimination loss \(\mathcal{L}_{\text{Imit}}\) (weight \(\lambda_{\text{Imit}}=0.5\))
- Regularization terms: \(\mathcal{L}_{\text{sel}}\), \(\mathcal{L}_{\text{vic}}\), \(\mathcal{L}_{\text{view-ent}}\), \(\mathcal{L}_{\text{dict-div}}\) (weights 0.01–0.05)
Key Experimental Results¶
Dataset & Setup¶
- EgoMe dataset: 7,902 asynchronous Exo-Ego video pairs (~82.8 hours); train/val/test = 4,777/997/2,128
- Feature extraction: TSP (frozen), feature dimension \(d=512\)
- Optimizer: AdamW, learning rate 1e-4, batch size 16
Main Results (Table 1)¶
| Method | Val AUPRC@0.3 | Val AUPRC@0.5 | Val AUPRC@0.7 | Val Mean | Val tIoU |
|---|---|---|---|---|---|
| PDVC | 28.21 | 20.48 | 7.95 | 18.88 | 58.58 |
| Exo2EgoDVC | 31.33 | 20.27 | 7.49 | 19.69 | 59.06 |
| ActionFormer | 31.37 | 15.41 | 2.63 | 16.47 | 48.89 |
| TriDet | 30.04 | 14.61 | 2.44 | 15.70 | 49.05 |
| PDVC (Ego-only) | 19.35 | 13.91 | 5.11 | 12.79 | 57.63 |
| SAVA-X | 33.56 | 24.04 | 9.48 | 22.36 | 59.31 |
SAVA-X achieves a Mean AUPRC improvement of +2.67 (+13.56%) over the strongest baseline Exo2EgoDVC, attaining top performance at all thresholds.
Ablation Study (Table 2)¶
| AS | SVE | BiX | Mean AUPRC | tIoU |
|---|---|---|---|---|
| 18.88 | 58.58 | |||
| ✓ | 20.90 | 58.88 | ||
| ✓ | 21.29 | 59.27 | ||
| ✓ | 21.06 | 58.27 | ||
| ✓ | ✓ | 21.82 | 58.96 | |
| ✓ | ✓ | 20.32 | 58.14 | |
| ✓ | ✓ | 22.33 | 58.76 | |
| ✓ | ✓ | ✓ | 22.36 | 59.31 |
Key Findings¶
- Three modules are complementary: AS, SVE, and BiX individually yield gains of +10.7%, +12.8%, and +11.6%, respectively, with their combination achieving optimal performance.
- SVE+BiX is the strongest pairwise combination: Among two-module combinations, SVE+BiX performs best, indicating that domain gap reduction combined with bidirectional verification is most critical.
- AS+BiX is relatively weak: Direct fusion without view adaptation is susceptible to domain shift and noise.
- Single-view input degrades substantially: Ego-only PDVC drops to a Mean AUPRC of 12.79, confirming the necessity of exocentric demonstration information.
- Adaptive sampling is more effective at higher frame rates: Greater redundancy at high frame rates means retaining a small number of high-scoring frames suffices to improve performance.
- SVE outperforms fixed view embeddings: Fixed learnable tokens yield limited gains, whereas the adaptive dictionary covers cross-scene variation.
- Exo→Ego direction is more critical: Unidirectional ablations show that Exo→Ego alone approaches the performance of the full bidirectional design, as the task objective is error detection on the ego stream.
Highlights & Insights¶
- The paper is the first to formally define the ego-to-exo imitation error detection task and establish a unified evaluation protocol.
- Each of the three modules addresses one core challenge (redundancy / domain gap / fusion) in an orthogonal and complementary manner.
- The scene-aware dictionary view embedding is a creative design that achieves cross-scene adaptability via a learnable dictionary.
- Gated adaptive sampling balances the efficiency of hard selection with the gradient stability of soft gating.
- The ablation and component analyses are highly thorough, covering frame rate, Top-K ratio, dictionary size, fusion variants, and domain gap visualization.
Limitations & Future Work¶
- Validation is limited to the single EgoMe dataset; generalizability remains unknown.
- The frozen TSP feature extractor (pretrained on ActivityNet) may not be sufficiently adapted to egocentric video.
- Absolute performance remains low (Mean AUPRC of only 22.36), leaving a substantial gap to practical deployment.
- Large-scale video foundation models (e.g., InternVideo, VideoMAE v2) are not explored for feature extraction.
- Dictionary size and regularization weights require manual tuning, with no automatic selection mechanism proposed.
- Inference speed and computational cost are not discussed.
Related Work & Insights¶
- Temporal action localization: TAL methods such as ActionFormer and TriDet perform poorly in the cross-view setting (Mean AUPRC only 14–16).
- Dense video captioning: PDVC serves as a strong base architecture; Exo2EgoDVC introduces view-invariant adversarial learning.
- Ego-Exo transfer: Ego-Exo (Li et al. 2021) investigates representation transfer from third-person to first-person perspectives.
- Procedural error detection: PREGO (single-view online error detection); Lee et al. 2024 (error-free prototype-based detection).
- Adaptive frame selection: Buch et al. 2025 propose flexible frame selection for efficient video inference.
Rating¶
- Novelty: ⭐⭐⭐⭐ (novel task formulation, targeted three-module design)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (highly detailed ablations, comprehensive component analysis)
- Writing Quality: ⭐⭐⭐⭐ (clear structure, motivation and methodology well articulated)
- Value: ⭐⭐⭐⭐ (clear practical application scenarios, though absolute performance is limited)