EventGPT: Event Stream Understanding with Multimodal Large Language Models¶

Conference: CVPR 2025
arXiv: 2412.00832
Code: https://github.com/EventGPT (TBD)
Area: Multimodal VLM
Keywords: Event Camera, MLLM, Spatio-Temporal Aggregation, Three-Stage Training, Event-Language Alignment

TL;DR¶

The first MLLM specifically designed for event camera streams. By employing a three-stage progressive training paradigm (vision-language alignment \(\to\) event-language alignment \(\to\) instruction tuning), it bridges the massive domain gap between asynchronous event data and language, substantially outperforming general MLLMs in event scene description and VQA.

Background & Motivation¶

Background: Event cameras record intensity changes asynchronously with high temporal resolution, offering unique advantages in high-speed motion and extreme lighting conditions. However, the representation of event data differs significantly from traditional RGB images, causing existing MLLMs (such as LLaVA, Qwen2-VL, etc.) to perform poorly when directly processing event data.

Limitations of Prior Work: General MLLMs only score 1.5-2.4/5.0 (out of 5.0) in detailed description on event data, as they are pre-trained on RGB image-text pairs and cannot comprehend the spatio-temporal structure of event streams. Moreover, event data lacks paired language annotation datasets, making it impossible to directly train event-language models.

Key Challenge: A representation gap exists between event camera data and natural language—events are asynchronous, sparse spike signals, which are fundamentally different from any visual data seen during MLLM pre-training. How to bridge this gap with minimal annotated data is a key challenge.

Goal: To design an event-specific MLLM architecture and training strategy that enables the model to comprehend event streams and perform scene description, reasoning, and visual question answering (VQA) using natural language.

Key Insight: Leveraging RGB images as an intermediate bridge—first aligning the vision-language space on RGB image-text pairs (reusing existing data), then aligning event features to the same space using synthetic event-text pairs, and finally fine-tuning on real-world event data.

Core Idea: Aligning the representation spaces of event camera data and language models through a three-stage progressive training paradigm ("Image \(\to\) Event \(\to\) Instruction") and a spatio-temporal aggregation module.

Method¶

Overall Architecture¶

The event stream is split into \(T\) temporal windows to form an event tensor, which is encoded by OpenCLIP ViT-L/14 into spatio-temporal features \(\mathcal{Z} \in \mathbb{R}^{T \times S \times D}\). The spatio-temporal aggregator applies average pooling and max pooling along the temporal and spatial dimensions separately and concatenates them to obtain a fused representation. This is then mapped to the input space of the LLM (Vicuna-v1.5) via a linear projection layer and an event-language adapter.

Key Designs¶

Spatio-Temporal Aggregator:
- Function: Extract joint spatio-temporal representations from multi-window event features.
- Mechanism: Average pool \(\mathcal{Z} \in \mathbb{R}^{T \times S \times D}\) along the temporal dimension to obtain \(\mathcal{Z}_T^{avg} \in \mathbb{R}^{S \times D}\) (aggregating temporal information while retaining spatial structure) and along the spatial dimension to obtain \(\mathcal{Z}_S^{avg} \in \mathbb{R}^{T \times D}\) (aggregating spatial information while retaining temporal structure), then concatenate them with their corresponding max-pooled representations into \(\overline{\mathcal{Z}} \in \mathbb{R}^{(T+S) \times D}\).
- Design Motivation: The uniqueness of event data lies in its high temporal resolution, requiring simultaneous modeling of spatial characteristics (what is changing) and temporal characteristics (how it changes). Independent dual-path pooling preserves more structural information than direct flattening.
Three-Stage Progressive Training:
- Function: Progressively bridge the modality gap of RGB \(\to\) Event \(\to\) Language.
- Mechanism: Stage 1 (Vision-Language Alignment): Train the linear projection layer using LLaVA-Pretrain 558K RGB image-text pairs, with the encoder and LLM frozen. Stage 2 (Event-Language Alignment): Train the spatio-temporal aggregator and the event-language adapter using 1 million synthetic event-text pairs from N-ImageNet-Chat, with the rest frozen. Stage 3 (Instruction Tuning): Fine-tune all parameters using 120k real-world annotations from Event-Chat.
- Design Motivation: Direct training on event-text pairs is ineffective due to the immense modality gap. Utilizing RGB as an intermediate bridge, synthetic data for pre-training, and real data for fine-tuning progressively narrows this gap.
Event-Language Adapter:
- Function: Provide additional cross-modal alignment beyond the linear projection layer.
- Mechanism: A linear layer introduced in Stage 2 to further map event features into the representation space of the LLM. Ablation studies show that the adapter's contribution (+3.24% DC) is greater than that of the spatio-temporal aggregator (+2.35% DC), indicating that cross-modal alignment represents a larger bottleneck than temporal modeling.
- Design Motivation: Due to the massive distribution gap between event and RGB data, the projection layer initially trained for RGB is insufficient for alignment, necessitating an additional adapter layer.

Loss & Training¶

Standard next-token prediction cross-entropy loss is employed. In terms of datasets, two new datasets are constructed: N-ImageNet-Chat (1 million synthetic event-text pairs, generated from N-ImageNet event emulation data) and Event-Chat (120k real-world annotations, sourced from driving scene data in DSEC and E2VID).

Key Experimental Results¶

Main Results¶

Model	LLM	N-ImageNet DC/CR/VQA	Event-Chat DC/CR/VQA
LLaVA-7B	Vicuna	1.54/1.07/1.88	2.20/4.04/3.26
Qwen2-VL-7B	Qwen2	1.74/1.46/1.91	2.38/4.02/2.91
InternVL2-8B	InternLM	1.51/1.87/2.08	2.37/4.00/3.71
EventGPT-7B	Vicuna	2.39/2.57/2.23	3.52/4.09/4.29
EventGPT-13B	Vicuna	2.41/2.81/2.40	3.40/4.13/4.26

Ablation Study¶

Configuration	DC	CR	VQA	Description
Baseline (No aggregator, no adapter)	3.40	3.97	4.15	Baseline
+Spatio-Temporal Aggregator	3.48	4.02	4.20	+2.35%
+Event-Language Adapter	3.51	4.05	4.25	+3.24%
+Both (Full)	3.52	4.09	4.29	+3.53%

Key Findings¶

General MLLMs perform poorly on event data: The best-performing model, InternVL2-8B, achieves only 2.37/5.0 on the detailed description (DC) of Event-Chat, whereas EventGPT reaches 3.52/5.0 (+48%).
The event-language adapter is more crucial than the spatio-temporal aggregator: The adapter contributes +3.24%, while the aggregator contributes +2.35%, indicating that cross-modal alignment represents a larger bottleneck.
The optimal number of temporal windows is \(N_w=5\): Too few windows (3) lose temporal details, while too many (>7) lead to sparse distribution per window.
Strong downstream transfer capability: Text descriptions generated by EventGPT can directly drive GroundingDINO for object detection and GroundedSAM for instance segmentation.

Highlights & Insights¶

The "image as a bridge" training strategy cleverly addresses the lack of language annotations for event data. This progressive alignment concept can be extended to align other emerging sensor modalities (e.g., radar, haptics) with language.
New Dataset Construction: N-ImageNet-Chat (1 million) and Event-Chat (120k) provide the first large-scale language-annotated datasets for the event camera community.
Practical Value: Event cameras possess irreplaceable advantages in autonomous driving (tunnels, night-time) and high-speed motion scenarios; EventGPT enables natural language interaction within these environments.

Limitations & Future Work¶

The event encoder utilizes OpenCLIP pre-trained on RGB, where the event-RGB domain gap may limit feature quality. Event-specific pre-training could be explored in future work.
The spatio-temporal aggregator uses simple average/max pooling. More sophisticated sequence modeling (e.g., Mamba, temporal transformers) might yield better performance.
Event-Chat is primarily derived from driving scenarios; generalization to indoor, industrial, and sports environments remains unverified.
The reliability and consistency of the evaluation metrics (1-5 GPT scores) require further validation.

vs E2VID + LLaVA Pipeline: Two-stage methods that convert events to RGB and then apply MLLM lose the high temporal resolution advantages of event data. EventGPT processes data end-to-end to preserve temporal information.
vs Event-Specific Vision Models (e.g., event object detection, optical flow estimation): These task-specific models cannot generalize to novel tasks. EventGPT achieves open-ended capabilities through a language interface.
vs General MLLMs: Experiments demonstrate that directly employing general MLLMs to process event data is highly ineffective; the domain gap is too massive and necessitates dedicated training.

Rating¶

Novelty: ⭐⭐⭐⭐ The first event camera MLLM, filling an important gap, though the architectural components themselves (aggregator, adapter) are relatively standard.
Experimental Thoroughness: ⭐⭐⭐ Comparisons with general MLLMs are clear, but there is a lack of quantitative comparisons with event-specific methods on downstream tasks.
Writing Quality: ⭐⭐⭐⭐ The motivation behind the three-stage training is well-explained, and the dataset construction is meticulously described.
Value: ⭐⭐⭐⭐ Pioneering significance for the event camera community, with valuable dataset contributions.