Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency¶
Conference: CVPR2026
arXiv: 2603.09798
Code: ZhaofengSHI/DCPGN
Area: Robotics / Ego-Exo View Adaptation
Keywords: test-time adaptation, Ego-Exo, Action Anticipation, Multi-Label, Prototype Learning, CLIP, dual-clue consistency
TL;DR¶
This paper introduces the Test-time Ego-Exo Adaptation for Action Anticipation (TE2A3) task and proposes the DCPGN network, which leverages multi-label prototype growing and dual-clue (visual + textual) consistency to online-adapt a source-view trained model to the target view at test time for action anticipation, substantially outperforming existing TTA methods.
Background & Motivation¶
- Human cross-view capability: Humans can seamlessly switch between egocentric (Ego) and exocentric (Exo) perspectives via mirror neurons to anticipate future actions—a capability critical for human-robot collaboration and embodied AI.
- Existing methods require target-view data: Most Ego-Exo adaptation approaches (pretrain-finetune / UDA) access target-view data during training, incurring additional computational and data collection costs.
- Single-view model degradation across views: Models trained on one viewpoint suffer significant performance drops when directly applied to the other, due to large differences in camera angle and visual style.
- Opportunity for test-time adaptation (TTA): TTA methods can online-adjust models without labeled target-view data, yet existing TTA methods are designed for single-label tasks and struggle with multi-action candidate scenarios.
- Multi-action candidate challenge: Real-world events often involve multiple concurrent atomic actions, whereas entropy-based TTA methods tend to focus on the single highest-confidence class, leading to suboptimal performance.
- Ego-Exo spatiotemporal gap: The two viewpoints differ substantially in spatial layout (inconsistent scenes and distractors) and temporal dynamics (asynchronous action progression), making naive domain adaptation insufficient.
Method¶
Overall Architecture¶
DCPGN (Dual-Clue enhanced Prototype Growing Network) comprises two core modules:
- ML-PGM (Multi-Label Prototype Growing Module): Progressively accumulates multi-label knowledge to learn unbiased prototypes.
- DCCM (Dual-Clue Consistency Module): Integrates visual and textual clues to enforce dual-clue consistency.
During training, a BCE loss is used on labeled source-view data. At test time, the visual encoder (CLIP ViT-L/14) is frozen, while learnable prompts and prototypes are updated online.
ML-PGM¶
- Multi-label pseudo-label assignment: Top-K predictions from the logits of each test sample are selected as pseudo-labels (K=3 on EgoMe-anti; K=5 on EgoExoLearn), avoiding overconfidence induced by single-label strategies.
- Entropy-priority queue strategy: A memory bank of capacity N=500 is maintained per class. When the bank is full, only the N samples with the lowest entropy (least uncertainty) are retained, ensuring reliability increases over time.
- Confidence-weighted prototype computation: Class prototypes are computed as confidence-weighted sums over stored representations: \(p_i^T = \sum_{k=1}^{N'} \eta(l_{i,k}^T) \cdot \bar{f}_{v,k}^T\), suppressing noise from negative-class samples.
- Prototype-based classification: Similarities between test sample representations and all class prototypes yield prototype logits \(L_p\).
DCCM¶
- Visual clue: The last frame of the observed video is extracted as a visual clue, carrying object-level spatial information about the scene.
- Textual clue: A lightweight narrator (GRU + attention) generates descriptive text from frame features, serving as a temporal indicator of action progression. The narrator is trained on open-source video-text pairs and frozen at test time.
- CLIP inference: Frozen CLIP visual and text encoders extract features from visual and textual clues respectively; similarities with learnable-prompt action class features yield visual logits \(L_v\) and textual logits \(L_t\).
- Dual-clue consistency loss: A symmetric KL divergence is applied between the softmax distributions of \(L_v\) and \(L_t\): \(L_C = KL(P_v \| P_t) + KL(P_t \| P_v)\), constraining cross-modal clue consistency and explicitly bridging the Ego-Exo spatiotemporal gap.
Final Prediction and Loss¶
At test time, learnable prompts are optimized online via SGD without any data augmentation.
Key Experimental Results¶
Main Results (class-mean Top-5 recall)¶
| Method | EgoMe-anti Exo2Ego Noun/Verb | EgoMe-anti Ego2Exo Noun/Verb | EgoExoLearn Exo2Ego Noun/Verb | EgoExoLearn Ego2Exo Noun/Verb |
|---|---|---|---|---|
| No Adaptation | 71.94 / 32.46 | 64.24 / 30.07 | 31.91 / 34.36 | 35.28 / 33.03 |
| ML-TTA | 77.11 / 36.92 | 69.46 / 34.39 | 36.35 / 37.67 | 42.96 / 40.43 |
| DCPGN (Ours) | 79.03 / 43.84 | 72.01 / 40.10 | 46.26 / 42.98 | 48.48 / 46.51 |
On EgoExoLearn Exo2Ego, DCPGN outperforms ML-TTA by 9.91% on Noun and 5.31% on Verb.
Ablation Study¶
| Configuration | EgoMe-anti E2E Noun | EgoExoLearn E2E Noun |
|---|---|---|
| Full DCPGN | 79.03 | 46.26 |
| w/o consistency loss \(L_C\) | 78.67 (−0.36) | 44.80 (−1.46) |
| w/o visual clue | 76.92 (−2.11) | 41.32 (−4.94) |
| w/o textual clue | 77.56 (−1.47) | 41.94 (−4.32) |
| w/o entire DCCM | 76.11 (−2.92) | 38.43 (−7.83) |
| w/o confidence weighting | 74.63 (−4.40) | 37.76 (−8.50) |
| single-label assignment only | 72.74 (−6.29) | 34.70 (−11.56) |
Key findings: visual clues contribute more to Noun prediction, while textual clues are more critical for Verb prediction; multi-label assignment yields substantial gains over single-label assignment.
Model Complexity¶
| Component | FLOPs (G) | Params (MB) |
|---|---|---|
| Baseline | 367.55 | 251.18 |
| ML-PGM | +0.00 | +8.54 |
| Narrator | +0.03 | +2.38 |
| Textual clue encoding | +4.06 | +54.04 |
ML-PGM introduces virtually zero additional computation, and the narrator is extremely lightweight.
Highlights & Insights¶
- First TE2A3 task: This work is the first to introduce TTA into cross-view Ego-Exo action anticipation, requiring no target-view training data.
- Multi-label prototype growing mechanism: Top-K pseudo-label assignment combined with entropy-priority queues and confidence weighting effectively addresses the class-balance problem in multi-action candidate scenarios.
- Complementary dual-clue design: Visual clues capture spatial object information while textual clues capture temporal action progression; KL divergence consistency explicitly bridges the Ego-Exo spatiotemporal gap.
- New benchmark EgoMe-anti: A new benchmark suitable for this task is constructed based on the EgoMe dataset.
- Significant performance gains: DCPGN surpasses the second-best method by 9.91% on Noun in EgoExoLearn, with thorough experiments and in-depth analysis.
Limitations & Future Work¶
- Narrator requires additional training data: The narrator must be pretrained on open-source video-text pairs, introducing a prerequisite dependency.
- Manual tuning of K: The optimal K differs across datasets (3 vs. 5), lacking an adaptive selection mechanism.
- Fixed memory bank capacity: N=500 is manually set and may be suboptimal when class-wise data distributions vary widely.
- Only Noun/Verb classification evaluated: Finer-grained temporal localization or full event prediction is not addressed.
- Lack of real-time analysis: Despite claiming online adaptation, inference latency and practical deployment feasibility are not reported.
Related Work & Insights¶
- vs. Tent/TPT/TDA and other traditional TTA: These methods are designed for single-label tasks and over-focus on high-confidence classes in multi-action candidate scenarios; DCPGN's multi-label mechanism addresses this fundamental limitation.
- vs. ML-TTA: Although ML-TTA targets multi-label settings, it is designed for image-level classification and lacks video-level spatiotemporal modeling and Ego-Exo viewpoint gap handling.
- vs. UDA methods (Sync, GCEAN): UDA methods require access to unlabeled target-view data during training, whereas DCPGN adapts entirely online at test time.
- vs. pretrain-finetune methods (AE2, Exo2EgoDVC): These methods require labeled target-view data for finetuning, which DCPGN does not.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to introduce the TE2A3 task; the combination of multi-label prototype growing and dual-clue consistency is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Two benchmarks, four settings, comprehensive ablations, and visualization analyses are all well covered.
- Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear, method description is rigorous, and figures and tables are well presented.
- Value: ⭐⭐⭐⭐ — Provides a practical paradigm for cross-view online adaptation in human-robot collaboration and embodied AI.