Skip to content

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Conference: NeurIPS 2025 arXiv: 2507.17664 Code: None Area: Robotics Keywords: event camera, visual grounding, multimodal, mixture of experts, autonomous driving, benchmark

TL;DR

Talk2Event introduces the first large-scale visual grounding benchmark for event cameras (30,690 annotated referring expressions across four grounding attribute types), and proposes the EventRefer framework, which employs a Mixture of Event-Attribute Experts (MoEE) to dynamically fuse appearance, status, viewer-relation, and inter-object-relation features. EventRefer surpasses existing methods across all three evaluation settings: event-only, frame-only, and fusion.

Background & Motivation

  1. Event cameras lack language grounding: Event cameras offer microsecond-level latency and robustness to motion blur, making them well-suited for dynamic scene perception; however, connecting asynchronous event streams with natural language remains an unexplored problem.
  2. No event-domain datasets for visual grounding: Existing visual grounding benchmarks (RefCOCO, ScanRefer, etc.) are built on RGB frames or point clouds, covering only static scenes and lacking temporally dynamic information under high-speed or low-light conditions.
  3. Absence of attribute-level annotations: Most existing benchmarks focus solely on appearance attributes and lack structured descriptions of object motion states, viewer-relative spatial relations, and inter-object relations, limiting interpretability.
  4. Difficulty of grounding in dynamic scenes: Fast-moving objects (pedestrians, cyclists) are prone to blur in conventional frames; while event cameras capture motion effectively, they lack texture information, necessitating cross-modal fusion solutions.
  5. Event-based detection methods lack language capability: Existing object detection methods for event cameras (RVT, LEOD, etc.) produce only category predictions and do not support free-text referring expression comprehension.
  6. Rigid fusion strategies: Prior event-frame fusion methods lack adaptive mechanisms for exploiting multi-attribute information and cannot dynamically reallocate attention based on scene dynamics.

Method

Overall Architecture

  • Function: Construct the Talk2Event benchmark and propose the EventRefer model to localize target objects from event streams (with optional RGB frame fusion) based on natural language descriptions.
  • Design Motivation: Event cameras hold unique advantages in dynamic scenes, yet both a dataset and a method for language-guided grounding are absent; this work addresses both gaps simultaneously.
  • Mechanism: A structured annotation pipeline is built upon the DSEC driving dataset, leveraging temporal context frames to generate rich referring expressions. An attribute-aware grounding framework is designed by introducing mixture-of-experts fusion on top of a DETR-style architecture.

Key Designs

Four-Attribute Annotation Scheme

  • Function: Each referring expression is annotated with four grounding attributes—Appearance, Status, Relation-to-Viewer, and Relation-to-Others.
  • Design Motivation: Appearance descriptions alone are insufficient for precise grounding in dynamic scenes; motion state ("turning left"), egocentric spatial relations ("front-left"), and inter-object relations ("next to the bus") provide critical disambiguation cues.
  • Mechanism: Two context frames within \(t_0 \pm 200\text{ms}\) are used; Qwen2-VL generates three distinct referring expressions per instance (averaging 34.1 words each). Attribute labels are assigned via fuzzy matching combined with a language model and validated by human annotators.

Positive Word Matching (PWM)

  • Function: Maps attribute-relevant text spans within referring expressions to token-level binary positive maps.
  • Design Motivation: Manual annotation of token ranges is infeasible across four attribute types; PWM enables attribute-aware supervision at the token level during training.
  • Mechanism: For each attribute's cue phrases (e.g., "moving left" for Status), a fuzzy matcher locates all matching positions within the expression, projects character-level spans to token indices, and constructs a softmax-normalized positive map \(m_i\).

MoEE: Mixture of Event-Attribute Experts

  • Function: Four attribute-specific experts independently extract attribute-aware features, which are then adaptively weighted and fused into a final representation.
  • Design Motivation: The informativeness of each attribute varies across scenes (motion cues dominate at night; appearance is more useful in daylight), and static fusion cannot adapt to such variation.
  • Mechanism: (1) Attribute-aware masking: binary masks \(m_i^{\text{att}}\) are applied to encoder hidden states, retaining attribute-relevant tokens alongside shared context; (2) each attribute's features are projected via an FFN to produce expert features \(H_i^{\text{exp}}\); (3) mean-pooled descriptors from all four experts are concatenated and passed through a learnable projection \(W\) to generate gating weights \(\lambda\); (4) Gaussian noise is injected to prevent expert collapse, and the final fused representation is \(H^{\text{fuse}} = \sum \lambda_i \cdot H_i^{\text{exp}}\).

Multi-Attribute Fusion Training and Inference

  • Function: Treats the four attributes as co-grounding pseudo-targets for multi-target matching during training; at inference, scores from all four attributes are aggregated to select the final predicted box.
  • Design Motivation: This exploits supervision signals from all attributes without increasing decoder complexity, ensuring dense supervision at every attribute dimension.
  • Mechanism: During training, the GT box is replicated four times (one per attribute) and assigned to queries via Hungarian matching. At inference, each query computes a softmax dot-product score against all four attribute maps, and the box with the highest score is selected as the prediction.

Key Experimental Results

Main Results (Val Set, mAcc%)

Method Modality mAcc Ped Rider Car Bus Truck mIoU
BUTD-DETR Frame 48.91 22.66 20.44 61.94 33.93 35.93 84.30
GroundingDINO Frame 44.50 15.62 8.62 57.70 32.52 41.20 68.67
EventRefer Frame 55.47 27.64 51.10 65.76 47.02 32.22 85.76
EvRT-DETR Event 29.34 15.45 5.50 39.24 7.74 9.26 75.66
EventRefer Event 31.96 12.09 25.00 40.83 15.48 16.30 76.46
FlexEvent Fusion 59.40 30.39 33.50 71.34 47.85 38.58 86.83
EventRefer Fusion 61.82 31.15 44.23 73.85 41.07 41.70 87.32

Key Findings: EventRefer achieves the best mAcc and mIoU across all three modality settings. In the frame-only setting, the Rider category shows the most significant improvement (+24.4%), indicating that attribute-level reasoning is particularly effective for small, dynamic objects. Event-only performance remains lower than frame-based methods (due to limited texture), but fusion achieves the highest overall mAcc of 61.82%.

Ablation Study (Event-only, Val Set, mAcc%)

Configuration mAcc
Baseline (w/o PWM / MAF / MoEE) 22.07
+ PWM 26.38 (+4.31)
+ MAF 27.01 (+4.94)
+ PWM + MAF 29.66 (+7.59)
+ PWM + MAF + MoEE 31.96 (+9.89)

Key Findings: All three components contribute independently and complementarily—PWM provides token-level attribute supervision, MAF enables independent attribute reasoning, and MoEE contributes an additional +2.3% through adaptive fusion. Among fusion strategies, MoEE (31.96%) substantially outperforms attention-based fusion (29.66%), additive fusion (28.39%), and concatenation fusion (27.50%).

Attribute Contribution Analysis

Single-attribute experiments show Status (28.90%) > Appearance (27.98%) > Viewer (27.03%) > Others (26.97%), while combining all four attributes yields the optimal 31.96%, confirming their complementarity. Visualization of MoEE gating weights reveals that Rider/Bike categories rely more heavily on Status cues, while Bus/Truck categories favor Appearance and Viewer Relation, demonstrating the adaptive nature of the mechanism.

Highlights & Insights

  • First visual grounding benchmark for event cameras: 5,567 scenes and 30,690 referring expressions averaging 34.1 words each, making it one of the most linguistically rich grounding datasets.
  • The four-attribute annotation scheme provides structured and interpretable grounding dimensions covering spatiotemporal and relational reasoning.
  • The adaptive gating mechanism in MoEE dynamically adjusts attribute weights according to scene dynamics and lighting conditions, offering both performance gains and interpretability.
  • Support for three modality settings (event-only, frame-only, fusion) affords broad research flexibility.

Limitations & Future Work

  • The dataset is built exclusively on DSEC (urban driving in Switzerland), limiting scene diversity and excluding extreme weather, nighttime, and crowded scenarios.
  • Event-only performance remains substantially below frame-based methods (31.96% vs. 55.47%), with insufficient semantic information in event streams being the fundamental bottleneck.
  • Referring expressions are generated by Qwen2-VL and subsequently validated by human annotators, which may introduce generation bias.
  • No direct comparison is conducted against recent large-scale vision-language models (e.g., GPT-4V, Qwen2.5-VL) on this task.
Dimension Talk2Event RefCOCO / RefCOCOg
Sensor Event camera (optional RGB frame) RGB frame
Scene Type Dynamic driving scenes Static images
Attribute Annotation 4 types (Appearance + Status + Viewer + Others) Appearance only
Expression Length 34.1 words (rich) 3.5–8.4 words (short)
Temporal Information ✓ (motion, trajectory)
Dimension Talk2Event ScanRefer
Sensor Event camera + RGB RGB-D point cloud
Scene Type Dynamic outdoor driving Static indoor
Attribute Annotation 4 attribute types Appearance + inter-object relation only
Dataset Scale 30,690 expressions 51,583 expressions
Dynamic Support ✓ (motion state, temporal reasoning)

Rating

  • ⭐⭐⭐⭐⭐ Novelty: First introduction of language grounding to the event camera domain; the four-attribute design is highly original.
  • ⭐⭐⭐⭐ Technical Quality: MoEE is well-motivated; ablation studies are thorough; evaluation spans three modality settings comprehensively.
  • ⭐⭐⭐⭐ Value: Directly advances multimodal perception research in autonomous driving and robotics.
  • ⭐⭐⭐⭐ Writing Quality: The paper is clearly structured with informative figures and complete mathematical derivations.