Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras¶
Conference: NeurIPS 2025 arXiv: 2507.17664 Code: None Area: Robotics Keywords: event camera, visual grounding, multimodal, mixture of experts, autonomous driving, benchmark
TL;DR¶
Talk2Event introduces the first large-scale visual grounding benchmark for event cameras (30,690 annotated referring expressions across four grounding attribute types), and proposes the EventRefer framework, which employs a Mixture of Event-Attribute Experts (MoEE) to dynamically fuse appearance, status, viewer-relation, and inter-object-relation features. EventRefer surpasses existing methods across all three evaluation settings: event-only, frame-only, and fusion.
Background & Motivation¶
- Event cameras lack language grounding: Event cameras offer microsecond-level latency and robustness to motion blur, making them well-suited for dynamic scene perception; however, connecting asynchronous event streams with natural language remains an unexplored problem.
- No event-domain datasets for visual grounding: Existing visual grounding benchmarks (RefCOCO, ScanRefer, etc.) are built on RGB frames or point clouds, covering only static scenes and lacking temporally dynamic information under high-speed or low-light conditions.
- Absence of attribute-level annotations: Most existing benchmarks focus solely on appearance attributes and lack structured descriptions of object motion states, viewer-relative spatial relations, and inter-object relations, limiting interpretability.
- Difficulty of grounding in dynamic scenes: Fast-moving objects (pedestrians, cyclists) are prone to blur in conventional frames; while event cameras capture motion effectively, they lack texture information, necessitating cross-modal fusion solutions.
- Event-based detection methods lack language capability: Existing object detection methods for event cameras (RVT, LEOD, etc.) produce only category predictions and do not support free-text referring expression comprehension.
- Rigid fusion strategies: Prior event-frame fusion methods lack adaptive mechanisms for exploiting multi-attribute information and cannot dynamically reallocate attention based on scene dynamics.
Method¶
Overall Architecture¶
- Function: Construct the Talk2Event benchmark and propose the EventRefer model to localize target objects from event streams (with optional RGB frame fusion) based on natural language descriptions.
- Design Motivation: Event cameras hold unique advantages in dynamic scenes, yet both a dataset and a method for language-guided grounding are absent; this work addresses both gaps simultaneously.
- Mechanism: A structured annotation pipeline is built upon the DSEC driving dataset, leveraging temporal context frames to generate rich referring expressions. An attribute-aware grounding framework is designed by introducing mixture-of-experts fusion on top of a DETR-style architecture.
Key Designs¶
Four-Attribute Annotation Scheme¶
- Function: Each referring expression is annotated with four grounding attributes—Appearance, Status, Relation-to-Viewer, and Relation-to-Others.
- Design Motivation: Appearance descriptions alone are insufficient for precise grounding in dynamic scenes; motion state ("turning left"), egocentric spatial relations ("front-left"), and inter-object relations ("next to the bus") provide critical disambiguation cues.
- Mechanism: Two context frames within \(t_0 \pm 200\text{ms}\) are used; Qwen2-VL generates three distinct referring expressions per instance (averaging 34.1 words each). Attribute labels are assigned via fuzzy matching combined with a language model and validated by human annotators.
Positive Word Matching (PWM)¶
- Function: Maps attribute-relevant text spans within referring expressions to token-level binary positive maps.
- Design Motivation: Manual annotation of token ranges is infeasible across four attribute types; PWM enables attribute-aware supervision at the token level during training.
- Mechanism: For each attribute's cue phrases (e.g., "moving left" for Status), a fuzzy matcher locates all matching positions within the expression, projects character-level spans to token indices, and constructs a softmax-normalized positive map \(m_i\).
MoEE: Mixture of Event-Attribute Experts¶
- Function: Four attribute-specific experts independently extract attribute-aware features, which are then adaptively weighted and fused into a final representation.
- Design Motivation: The informativeness of each attribute varies across scenes (motion cues dominate at night; appearance is more useful in daylight), and static fusion cannot adapt to such variation.
- Mechanism: (1) Attribute-aware masking: binary masks \(m_i^{\text{att}}\) are applied to encoder hidden states, retaining attribute-relevant tokens alongside shared context; (2) each attribute's features are projected via an FFN to produce expert features \(H_i^{\text{exp}}\); (3) mean-pooled descriptors from all four experts are concatenated and passed through a learnable projection \(W\) to generate gating weights \(\lambda\); (4) Gaussian noise is injected to prevent expert collapse, and the final fused representation is \(H^{\text{fuse}} = \sum \lambda_i \cdot H_i^{\text{exp}}\).
Multi-Attribute Fusion Training and Inference¶
- Function: Treats the four attributes as co-grounding pseudo-targets for multi-target matching during training; at inference, scores from all four attributes are aggregated to select the final predicted box.
- Design Motivation: This exploits supervision signals from all attributes without increasing decoder complexity, ensuring dense supervision at every attribute dimension.
- Mechanism: During training, the GT box is replicated four times (one per attribute) and assigned to queries via Hungarian matching. At inference, each query computes a softmax dot-product score against all four attribute maps, and the box with the highest score is selected as the prediction.
Key Experimental Results¶
Main Results (Val Set, mAcc%)¶
| Method | Modality | mAcc | Ped | Rider | Car | Bus | Truck | mIoU |
|---|---|---|---|---|---|---|---|---|
| BUTD-DETR | Frame | 48.91 | 22.66 | 20.44 | 61.94 | 33.93 | 35.93 | 84.30 |
| GroundingDINO | Frame | 44.50 | 15.62 | 8.62 | 57.70 | 32.52 | 41.20 | 68.67 |
| EventRefer | Frame | 55.47 | 27.64 | 51.10 | 65.76 | 47.02 | 32.22 | 85.76 |
| EvRT-DETR | Event | 29.34 | 15.45 | 5.50 | 39.24 | 7.74 | 9.26 | 75.66 |
| EventRefer | Event | 31.96 | 12.09 | 25.00 | 40.83 | 15.48 | 16.30 | 76.46 |
| FlexEvent | Fusion | 59.40 | 30.39 | 33.50 | 71.34 | 47.85 | 38.58 | 86.83 |
| EventRefer | Fusion | 61.82 | 31.15 | 44.23 | 73.85 | 41.07 | 41.70 | 87.32 |
Key Findings: EventRefer achieves the best mAcc and mIoU across all three modality settings. In the frame-only setting, the Rider category shows the most significant improvement (+24.4%), indicating that attribute-level reasoning is particularly effective for small, dynamic objects. Event-only performance remains lower than frame-based methods (due to limited texture), but fusion achieves the highest overall mAcc of 61.82%.
Ablation Study (Event-only, Val Set, mAcc%)¶
| Configuration | mAcc |
|---|---|
| Baseline (w/o PWM / MAF / MoEE) | 22.07 |
| + PWM | 26.38 (+4.31) |
| + MAF | 27.01 (+4.94) |
| + PWM + MAF | 29.66 (+7.59) |
| + PWM + MAF + MoEE | 31.96 (+9.89) |
Key Findings: All three components contribute independently and complementarily—PWM provides token-level attribute supervision, MAF enables independent attribute reasoning, and MoEE contributes an additional +2.3% through adaptive fusion. Among fusion strategies, MoEE (31.96%) substantially outperforms attention-based fusion (29.66%), additive fusion (28.39%), and concatenation fusion (27.50%).
Attribute Contribution Analysis¶
Single-attribute experiments show Status (28.90%) > Appearance (27.98%) > Viewer (27.03%) > Others (26.97%), while combining all four attributes yields the optimal 31.96%, confirming their complementarity. Visualization of MoEE gating weights reveals that Rider/Bike categories rely more heavily on Status cues, while Bus/Truck categories favor Appearance and Viewer Relation, demonstrating the adaptive nature of the mechanism.
Highlights & Insights¶
- First visual grounding benchmark for event cameras: 5,567 scenes and 30,690 referring expressions averaging 34.1 words each, making it one of the most linguistically rich grounding datasets.
- The four-attribute annotation scheme provides structured and interpretable grounding dimensions covering spatiotemporal and relational reasoning.
- The adaptive gating mechanism in MoEE dynamically adjusts attribute weights according to scene dynamics and lighting conditions, offering both performance gains and interpretability.
- Support for three modality settings (event-only, frame-only, fusion) affords broad research flexibility.
Limitations & Future Work¶
- The dataset is built exclusively on DSEC (urban driving in Switzerland), limiting scene diversity and excluding extreme weather, nighttime, and crowded scenarios.
- Event-only performance remains substantially below frame-based methods (31.96% vs. 55.47%), with insufficient semantic information in event streams being the fundamental bottleneck.
- Referring expressions are generated by Qwen2-VL and subsequently validated by human annotators, which may introduce generation bias.
- No direct comparison is conducted against recent large-scale vision-language models (e.g., GPT-4V, Qwen2.5-VL) on this task.
Related Work & Insights¶
| Dimension | Talk2Event | RefCOCO / RefCOCOg |
|---|---|---|
| Sensor | Event camera (optional RGB frame) | RGB frame |
| Scene Type | Dynamic driving scenes | Static images |
| Attribute Annotation | 4 types (Appearance + Status + Viewer + Others) | Appearance only |
| Expression Length | 34.1 words (rich) | 3.5–8.4 words (short) |
| Temporal Information | ✓ (motion, trajectory) | ✗ |
| Dimension | Talk2Event | ScanRefer |
|---|---|---|
| Sensor | Event camera + RGB | RGB-D point cloud |
| Scene Type | Dynamic outdoor driving | Static indoor |
| Attribute Annotation | 4 attribute types | Appearance + inter-object relation only |
| Dataset Scale | 30,690 expressions | 51,583 expressions |
| Dynamic Support | ✓ (motion state, temporal reasoning) | ✗ |
Rating¶
- ⭐⭐⭐⭐⭐ Novelty: First introduction of language grounding to the event camera domain; the four-attribute design is highly original.
- ⭐⭐⭐⭐ Technical Quality: MoEE is well-motivated; ablation studies are thorough; evaluation spans three modality settings comprehensively.
- ⭐⭐⭐⭐ Value: Directly advances multimodal perception research in autonomous driving and robotics.
- ⭐⭐⭐⭐ Writing Quality: The paper is clearly structured with informative figures and complete mathematical derivations.