Spherical World-Locking for Audio-Visual Localization in Egocentric Videos¶
Conference: ECCV 2024
arXiv: 2408.05364
Code: Available
Area: Audio & Speech
Keywords: Egocentric Video, Audio-Visual Localization, Spherical World-Locking, Multisensory Fusion, Transformer
TL;DR¶
This paper proposes the Spherical World-Locking (SWL) framework, which implicitly transforms multimodal perception streams into a world-locked spherical coordinate system to eliminate the challenges posed by self-motion, thereby achieving more precise audio-visual localization in egocentric videos.
Background & Motivation¶
Egocentric videos provide comprehensive contextual information from a personal perspective, but their most prominent characteristic—self-motion—poses significant challenges:
Coordinate Frame Shifts: Head motion causes continuous changes in the target's relative position within the head-locked coordinate system.
Field-of-View (FoV) Limitations: A limited field-of-view combined with frequent motion leads to targets of interest frequently entering and exiting the frame.
Motion Drift: Accumulated motion makes temporal alignment challenging.
However, self-motion is also a crucial cue for scene understanding; humans can effectively stabilize perception and utilize head motion to enhance attention. Most existing works treat self-motion purely as a challenge, overlooking its potential as an behavioral proxy. This paper proposes relocating audio-visual streams to a global reference frame via Spherical World-Locking (SWL), which not only eliminates the interference of self-motion but also leverages motion information to enhance scene understanding.
Method¶
Overall Architecture¶
The framework is built around two core concepts:
| Concept | Head-Locked | Spherical World-Locking (SWL) |
|---|---|---|
| Coordinate Frame | Head-centric 2D plane | World-centric 3D sphere |
| Self-motion Processing | Model must learn to compensate from raw data | Implicitly compensated via IMU measurements |
| Spatial Synchronization | Difficult spatial alignment across different modalities | Natural multimodal spatial alignment |
| Computational Overhead | No extra overhead | Negligible extra overhead |
Two SWL implementations are proposed: - Explicit SWL: Maps videos to a 360° panorama. - Implicit SWL (recommended): Retains original inputs and decouples semantic and positional information via spherical positional embeddings.
Key Designs¶
Implicit Spherical World-Locking: Instead of directly transforming the input data, spherical positional coordinates are assigned to each patch/token, maintaining the decoupling of semantic and positional embeddings.
Multi-Classification Tokens (Multi-CLS): Multiple classification tokens are deployed, each parameterized as a point on the sphere \(c_i = \mathbf{W}_c p_i + \mathbf{b}_c\) to capture semantic information around that location. A sparse \(5 \times 10\) grid is used during training.
MuST Encoder (Multisensory Spherical World-Locked Transformer) incorporates two key innovations:
-
Spherical Spatial Similarity: Computes pairwise spatial relations between multimodal embeddings based on rotation quaternions: $\(\mathbf{P}_{ij}^l = \text{Linear}(\text{GELU}(\text{Linear}([1+p_i \cdot p_j, p_i \times p_j])))\)$ The rotation quaternions only need to be computed once and can be reused across all layers.
-
Modality-Aware Operations (M-ops): Applies independent LayerNorm and q/k/v projections to each modality (M-LN + M-Attn) to facilitate cross-modal interaction while preserving modality-specific characteristics.
MuST Decoder: Supports three flexible decoding strategies: - Sparse Decoding: Uses an MLP to predict for each CLS Token. - Dense Decoding: Generates a dense output map using a lightweight deconvolutional network. - Horizon Decoding: Employs only the tokens near the equator, leveraging the gravity-alignment property of SWL.
Loss & Training¶
- Trained using Binary Cross-Entropy (BCE) loss.
- Adam optimizer with a learning rate of 1e-4 and no learning rate scheduler.
- End-to-end training for 10 epochs.
- The model architecture is based on DeiT-S (a small ViT variant), with slightly fewer parameters than ResNet-50.
Key Experimental Results¶
Main Results¶
EasyCom Dataset: Audio-Visual Active Speaker Localization:
| Method | mAP↑ |
|---|---|
| DOA (Signal Processing Method) | 52.62 |
| TalkNet | 69.13 |
| AVLN | 85.11 |
| MAVASL_C+E | 86.32 |
| MuST | 89.88 |
| Oracle (Upper Bound) | 91.03 |
MuST outperforms the previous state-of-the-art by 3.6 percentage points, approaching the Oracle upper bound (which uses near-field microphones) with only a 1.2%p gap.
Ablation Study¶
Ablation of Encoder Components (EasyCom):
| Variant | mAP↑ |
|---|---|
| MuST w/o pose (No pose information) | 87.76 |
| MuST w/o rotation (No rotation information) | 88.83 |
| MuST w/o M-ops (No modality-aware operations) | 88.53 |
| MuST M-LN only | 89.67 |
| MuST M-LN+M-Attn (Full version) | 89.88 |
Analysis of Modality Contributions:
| Input Modality Combination | mAP↑ |
|---|---|
| Pose only | 47.95 |
| Pose + Monaural Audio | 68.57 |
| Pose + Visual | 68.78 |
| Pose + Multichannel Audio + Visual | 89.88 |
Key Findings¶
- Positional and Rotational Information of SWL Provide Significant Contributions: Leading to a +2.1%p performance improvement.
- Multichannel Microphone Array is the Largest Driver of Performance: From monaural to multichannel, the mAP leaps from 73.47 to 89.88.
- Modality-Aware Operations are Crucial: Performance drops below the visual-free counterpart when they are removed.
- Superior Performance in Spherical Audio Localization: On the RLR-CHAT dataset, the Mean Absolute Error (MAE) drops from 44.90° (second best) to 12.67°.
- Efficiency of Sparse Decoding: Using only the equatorial tokens (a 5x reduction in quantities) leads to a performance drop of only about 1°.
Highlights & Insights¶
- Paradigm Shift: Transforms self-motion from "noise requiring learned compensation" to a "directly exploitable valuable signal," achieved via IMU with near-zero overhead.
- Unified Multimodal Representation: Audio, visual, and behavioral information are naturally aligned on the world-locked sphere, eliminating the need for complex coordinate transformations.
- Clever Application of Rotation Quaternions: Spaces between two points on a sphere are encoded using rotation quaternions, which is both mathematically consistent and computationally efficient.
- Flexible Decoder Design: Multi-CLS combined with various decoding strategies (sparse, dense, horizon) flexibly adapts to different task requirements.
Limitations & Future Work¶
- Only rotation is considered, ignoring translation, which limits applicability in translation-heavy mobile scenarios.
- Short temporal windows (<1 second) are utilized, leaving long-term dependencies unmodeled.
- Reliance on IMU data restricts deployment on devices without IMU sensors.
- Explicit SWL suffers from distortion issues during panoramic projection.
- Extending SWL to other egocentric video tasks (such as action recognition and object interaction) warrants exploration.
Related Work & Insights¶
- Unlike traditional 360° video spherical representations, SWL handles the implicit representation of planar videos on a sphere.
- The Multi-CLS Token design can inspire other vision tasks requiring spatialized predictions.
- Modality-aware operations (M-LN, M-Attn) offer effective engineering practices for multimodal Transformer configurations.
Rating¶
| Dimension | Rating (1-5) | Explanation |
|---|---|---|
| Novelty | 4.5 | The concept of spherical world-locking is novel, with an excellent idea of transforming self-motion into useful signals. |
| Technical Depth | 4.5 | The designs of quaternion spherical encoding and modality-aware operations are rigorous and elegant. |
| Experimental Thoroughness | 4.5 | Three different tasks/datasets along with rich ablations fully validate the effectiveness and generalizability of the proposed method. |
| Practicality | 4 | Relies on IMU data, though modern AR/VR devices are commonly equipped with them. |
| Overall | 4.5 | A paradigm-shifting work in the field of egocentric video understanding. |