🎵 Audio & Speech¶

🎞️ ECCV2024 · 8 paper notes

📌 Same area in other venues: 📷 CVPR2026 (22) · 🔬 ICLR2026 (80) · 💬 ACL2026 (72) · 🧪 ICML2026 (36) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (47)

🔥 Top topics: Speech & Audio ×4

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos: AV-LDM is proposed to implicitly disentangle foreground action sounds and background ambient sounds by introducing audio from different time segments of the same video as an ambient sound condition during training. Combined with retrieval-augmented generation (RAG) to select appropriate ambient sound conditions during inference, it significantly outperforms existing methods on Ego4D and EPIC-KITCHENS.
Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation: The Beat-It framework is proposed to achieve beat-synchronized and keyframe-controllable 3D dance generation by decoupling beat conditions from music and designing a hierarchical multi-condition fusion mechanism, significantly outperforming existing methods on AIST++.
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing: This work proposes the CoLeaF dual-branch learning framework, which explicitly optimizes the integration of cross-modal context through event-aware contrastive learning, achieving an average improvement of 1.9% F-score on the weakly supervised audio-visual video parsing task.
ControlLLM: Augment Language Models with Tools by Searching on Graphs: This paper proposes the ControlLLM framework, which plans multimodal tool execution by performing graph search (Thoughts-on-Graph) on a pre-built Tool Graph. This significantly improves the accuracy of tool selection and parameter assignment in complex tasks.
Label-Anticipated Event Disentanglement for Audio-Visual Video Parsing: This paper proposes the LEAP (Label semantic-based Projection) decoding paradigm, which utilizes the text embeddings of event categories as semantic anchors. Using a cross-modal attention mechanism, potentially overlapping event semantics within audio/visual latent features are disentangled into independent label embeddings. Combined with an EIoU-based audio-visual semantic similarity loss, LEAP achieves SOTA performance on the AVVP task.
Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics: This paper proposes the Latent-INR framework. By learning an implicit latent code for each video frame and combining it with a hypernetwork for low-rank weight modulation, the framework decouples the spatial and temporal modeling of video INR. While maintaining competitive compression performance, it equips video representations with semantic discriminative capabilities, supporting various downstream tasks such as retrieval, video frame interpolation, and arbitrary-resolution inference.
Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation: This paper proposes CSTS (Contrastive Spatial-Temporal Separable), an audio-visual fusion method that introduces audio signals to egocentric gaze anticipation for the first time. It models spatial co-occurrence and temporal correlation of audio-visual signals separately through spatial and temporal separable fusion modules, and enhances representations using post-fusion contrastive learning, surpassing SOTA on Ego4D and Aria datasets.
Siamese Vision Transformers are Scalable Audio-Visual Learners: The AVSiam framework is proposed, which uses a single weight-shared ViT backbone to simultaneously process both audio and visual inputs. Combined with a multi-ratio random masking strategy and a dual-objective pre-training scheme (contrastive plus reconstruction), AVSiam achieves state-of-the-art (SOTA) performance on audio-visual classification and retrieval at an extremely low cost (28.9 times faster than MAViL).