Skip to content

EventFlash: Towards Efficient MLLMs for Event-Based Vision

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=QuvGqzLwf6
Code: https://github.com/XduSyL/EventFlash
Area: Multimodal Large Language Models / Event-based Vision
Keywords: Event Camera, MLLM, Spatiotemporal Token Sparsification, Long-sequence Understanding, Curriculum Learning

TL;DR

EventFlash leverages the inherent spatiotemporal sparsity of event streams by designing two token sparsification modules: Adaptive Time Window Aggregation and Sparsity-Driven Guided Attention. These modules increase inference throughput by 12.4× and extend the processable event bin capacity from 5 (in EventGPT) to 1000.

Background & Motivation

Background: Event cameras output pixel-level brightness changes asynchronously with microsecond temporal resolution and high dynamic range, making them ideal for high-speed motion and low-light scenarios. Current methods for extending MLLMs to event vision typically convert event streams into dense, image-like representations before feeding them into off-the-shelf models like LLaVA or Qwen (e.g., EventGPT, EventVL, LLaFEA).

Limitations of Prior Work: This "densification" paradigm completely ignores the spatiotemporal sparsity of event streams—many pixel locations contain no events, and forcing them into dense tokens introduces massive redundancy. This leads to two consequences: extreme computational overhead and severe limitations on the length of processable event sequences (EventGPT can only handle 5 bins, effectively capturing only a "moment" and precluding long-range understanding).

Key Challenge: Redundancy in event streams differs from video. Video redundancy stems from spatial repetition on a regular patch grid, whereas event streams consist of irregularly distributed, sparse spatiotemporal points with high density variance. Redundancy arises from non-uniform temporal sampling, making existing video token sparsification methods expensive and ineffective.

Goal: Rather than prioritizing peak inference accuracy, this work specifically addresses three challenges for efficient MLLMs: (i) Temporal Inefficiency: Microsecond resolution generates massive tokens over long periods; (ii) Spatial Inefficiency: Sparsity leads to numerous empty tokens receiving uniform attention; (iii) Data Gap: Existing instruction datasets are private, limited in scenarios, and contain only short sequences.

Core Idea: Density-aware Spatiotemporal Token Sparsification—compressing tokens via adaptive window aggregation in the temporal dimension and filtering empty/low-density regions via density-guided attention in the spatial dimension, coupled with a short-to-long curriculum learning strategy and a self-constructed dataset of 500,000 instructions named EventMind.

Method

Overall Architecture

EventFlash is an end-to-end event MLLM pipeline consisting of five modules: raw event streams are first processed by Adaptive Time Window Aggregation (ATWA) to slice continuous streams into fine bins and adaptively merge them based on similarity/density. These compressed bins are sent to an Event Encoder (CLIP-ViT) to extract semantic embeddings. In parallel, Sparsity-Driven Guided Attention (SDGA) filters informative tokens and suppresses blank areas in the spatial dimension. Subsequently, an Event-Language Projector aligns event tokens to the text space, which are then fed into the LLM Decoder (Qwen2.5) along with text tokens for generation. The entire system is trained using a three-stage curriculum learning strategy from short to long sequences.

flowchart LR
    A[Raw Event Stream] --> B[ATWA<br/>Adaptive Time Window Aggregation<br/>Temporal Compression]
    B --> C[Event Encoder<br/>CLIP-ViT]
    C --> D[SDGA<br/>Sparsity-Driven Guided Attention<br/>Spatial Filtering]
    D --> E[Event-Language Projector]
    F[Text Instruction] --> G[LLM Decoder<br/>Qwen2.5]
    E --> G
    G --> H[Multimodal Generation]

Key Designs

1. Adaptive Time Window Aggregation (ATWA): Eliminating temporal redundancy via two-level density-guided merging. ATWA compresses the explosion of temporal tokens from microsecond streams while preserving motion dynamics. The first level performs "Density-Guided Physical Merging": slicing the stream into fine bins, treating each as a spatiotemporal point process with polarity, characterized by an intensity function \(\lambda_B(x,y,t,p)=\sum_{e_n\in B} f(p_n)\cdot \exp\!\big(-\frac{(x-x_n)^2}{2\sigma_x^2}-\frac{(y-y_n)^2}{2\sigma_y^2}-\frac{(t-t_n)^2}{2\sigma_t^2}\big)\). The distance between adjacent bins is the norm of the intensity difference \(D(B_i,B_{i+1})=\lVert\lambda_{B_i}-\lambda_{B_{i+1}}\rVert_2\); bins are iteratively merged into meta-windows if the distance is below a threshold \(\tau\). The second level performs "Semantic-Aware Merging": each window passes through ViT to obtain a CLS token \(z_i\), and cosine similarity \(S_i=\frac{z_i^\top z_{i+1}}{\lVert z_i\rVert\lVert z_{i+1}\rVert}\) is calculated. A normalized density factor \(r_i\) yields the final merging score \(A_i=S_i\cdot\exp(-\alpha\cdot r_i)\)—windows with higher semantic similarity and lower density are more likely to be merged.

2. Sparsity-Driven Guided Attention (SDGA): Modulating attention by density to eliminate blank tokens. After temporal compression, spatial redundancy remains due to non-uniform event distribution. SDGA applies standard multi-head attention \(\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^\top}{\sqrt{d_k}})V\) to patch-level features \(\{x_j\}\), but injects density signals into the scores: the event density \(D_j\) of each token region is transformed via a "Linear + GELU" density encoding unit into a soft modulation signal \(f(D_j)=\text{GELU}(\text{Linear}(D_j))\). This is added to the attention scores \(\tilde A_{ij}=\frac{Q_iK_j^\top}{\sqrt{d_k}}+f(D_j)\), biasing the model toward denser regions. Finally, a Token Selector ranks responses and discards low-importance tokens: \(\hat x_i=\text{TokenSelector}\big(\sum_j \text{softmax}(\tilde A_{ij})\cdot V_j\big)\).

3. Short-to-Long Curriculum Learning: Scaling from alignment to long-range reasoning. Unlike the stage-wise modular training of EventVL/EventGPT, EventFlash progresses by sequence duration. Stage 1 trains the event-language alignment (learning rate \(2\times10^{-3}\), batch 64) using 200k 0–50 ms short sequences for basic cross-modal understanding. Stage 2 unfills all parameters (learning rate \(2\times10^{-5}\)) using 110k 50–5000 ms medium sequences to learn complex actions and event QA. Stage 3 utilizes 190k 5000–20000 ms long sequences for rich scene description and open-ended generation.

4. EventMind Dataset: Closing the gap in long-sequence, multi-scenario instruction data. To support curriculum training, a large-scale dataset of 500k instructions across 7 task types was constructed. Raw events come from real captures (DSEC, N-ImageNet, HARDVS, E2VID) and synthesis (using the V2E simulator to convert Kinetics-700 and UCF-101 videos, pre-filtered for quality using GPT-4o based on captions). Text instructions are derived via two paths: refining existing labels with GPT-4o to remove static attributes (texture/color) and automatically generating from video using Qwen-VL-Max followed by human quality control.

Key Experimental Results

Main Results

Comparison with video MLLMs and event MLLMs on EventMind and EventChat-Sub (selection):

Model Parameters Max Bins Throughput (Token/s) GDC FGQA HAQA MCQA
Qwen2.5-VL 3B 768 20.6 41.7 23.8 34.6
VideoChat2-Flash 7B 1000 36.2 41.9 18.9 48.2
EventGPT-7B 7B 5 42.2
EventFlash-Zero 3B 1000 2.3 45.3 60.4 85.0 58.2
EventFlash-3B 3B 1000 28.5 46.8 61.1 84.9 60.0
EventFlash-7B 7B 1000 24.0 52.3 64.2 87.6 63.1

EventFlash outperforms all video MLLMs and EventGPT across all four tasks. Compared to EventGPT's 5-bin limit, it handles 1000 bins (200× capacity). The 3B version's throughput of 28.5 Token/s is 12.4× faster than the non-sparse EventFlash-Zero (2.3).

Ablation Study

Component ablation (S=Spatial Sparsity, T=Temporal Sparsity):

Model S T Token/s GDC FGQA HAQA MCQA
Baseline 2.3 45.3 60.4 85.0 58.2
A 5.3 (2.3×) 46.3 61.2 85.1 59.6
B 14.0 (6.1×) 47.1 60.6 83.8 60.3
Ours 28.5 (12.4×) 46.8 61.1 84.9 60.0

Temporal sparsification contributes more to speedup (6.1×) than spatial sparsification (2.3×). The 10 ms aggregation interval is the optimal point for accuracy and efficiency.

Key Findings

  • Temporal redundancy is the primary bottleneck in event streams; combining temporal and spatial sparsification yields a 12.4× gain.
  • Massive token compression does not degrade accuracy and sometimes improves it, suggesting the removed tokens were indeed redundant.
  • EventFlash maintains fine-grained descriptive accuracy in extreme scenarios where frame cameras fail, such as high-speed puppet impacts or night-time streets.

Highlights & Insights

  • Recognizing the fundamental difference between event and video redundancy is the key insight; building sparsification on event density rather than regular grids is crucial.
  • Combining physical intensity modeling with semantic similarity for temporal merging is more principled than simple frame-based merging.
  • The pragmatic focus on efficiency over pure accuracy is valuable: long-range event understanding (1000 bins) is fundamentally impossible for models like EventGPT; treating throughput and sequence length as primary objectives enables real-world applications in surveillance and autonomous driving.

Limitations & Future Work

  • Evaluation primarily relies on the self-constructed EventMind dataset; while EventChat-Sub results are reported, the lack of more independent third-party benchmarks limits absolute comparability.
  • Open-ended tasks (GDC, FGQA) use GPT-4o as an LLM-Judge, which may introduce model bias.
  • The 7B version throughput (24.0) is lower than the 3B version (28.5), and the relationship between sparsification gains and model scale requires more discussion.
  • Synthetic events from V2E may have gaps compared to the noise/trigger characteristics of real event cameras.
  • Event MLLMs: EventGPT (first event MLLM, dense tokens), EventVL (RGB+Event fusion), LLaFEA (frame-event region grounding). EventFlash differs by treating "sparsity" as a feature rather than a burden.
  • MLLM Token Sparsification: While video methods assume grid redundancy, this work demonstrates that irregular event streams require custom sparsification.
  • Insight: When input modalities have strong structural priors (like spatiotemporal sparsity), it is better to embed these priors directly into token design rather than using generic densification pipelines.

Rating

  • Novelty: ⭐⭐⭐⭐ Systematically embedding event sparsity into tokenization (physical + semantic + density criteria) is a solid, well-positioned approach.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Includes 3B/7B comparisons and video vs. event MLLM benchmarks, though depends heavily on self-constructed data.
  • Writing Quality: ⭐⭐⭐⭐ Clear mapping between challenges and modules; complete formulas and diagrams.
  • Value: ⭐⭐⭐⭐ Addresses the real bottleneck of long-sequence understanding (5 bin → 1000 bin) with significant throughput gains.