ICLR 2026 Autonomous Driving Event camera open-vocabulary instance segmentation SAM CLIP multimodal fusion annotation-free training

SEAL: Segment Any Events with Language¶

Conference: ICLR 2026 arXiv: 2601.23159 Code: https://0nandon.github.io/SEAL (coming soon) Area: Autonomous Driving Keywords: Event camera, open-vocabulary instance segmentation, SAM, CLIP, multimodal fusion, annotation-free training

TL;DR¶

This paper introduces the open-vocabulary event instance segmentation (OV-EIS) task for the first time, and proposes the SEAL framework. Through multimodal hierarchical semantic guidance (MHSG) and a lightweight multimodal fusion network, SEAL achieves multi-granularity (instance-level + part-level) semantic segmentation of event streams using only event–image pairs (without dense annotations), substantially outperforming all baselines while achieving the fastest inference speed.

Background & Motivation¶

Advantages of event cameras: Event cameras offer extremely high temporal resolution, ultra-low latency, high dynamic range, and low power consumption, remaining effective in scenarios where conventional cameras fail, such as low-light and overexposed conditions.

Limitations of existing event segmentation: Existing event semantic segmentation (ESS) methods are confined to closed-set vocabularies and cannot recognize objects outside the training categories, nor can they distinguish between different instances of the same class.

Open-vocabulary event understanding is nascent: OpenESS achieves only open-vocabulary semantic segmentation without instance-level recognition; EventSAM supports event instance segmentation but lacks semantic recognition capability.

Absence of evaluation benchmarks: No multi-semantic benchmark dataset for event instance segmentation has previously existed.

Efficiency requirements: Event cameras are commonly deployed on edge devices, necessitating parameter-efficient and fast-inference model designs.

Domain gap: Directly applying image-domain pretrained models to event streams introduces a large domain gap due to noise and artifacts, even when images are reconstructed via E2VID.

Method¶

Overall Architecture¶

SEAL belongs to the annotation-free domain adaptation (AF-DA) category: - During training: Only event–image pairs \((I^{evt}, I^{img})\) are used; no dense event annotations are required. - During inference: Only event embeddings \(I^{evt}\) are input; masks and category predictions are generated given user-provided visual prompts (points/boxes). - The framework consists of two core components: the MHSG module (providing multimodal hierarchical semantic supervision) and the multimodal fusion network (a lightweight mask classifier).

Key Designs 1: Multimodal Hierarchical Semantic Guidance (MHSG)¶

Hierarchical visual guidance: - SAM is applied to paired images to generate segmentation maps at three granularities: semantic-level \(M_s^{img}\), instance-level \(M_i^{img}\), and part-level \(M_p^{img}\). - Pixel-level features are extracted via the CLIP visual encoder and then RoI-pooled to obtain visual features for each level of masks.

Hierarchical textual guidance: - A LLaMA-based MLLM generates rich textual descriptions for each mask. - These descriptions are encoded via the CLIP text encoder to form hierarchical textual guidance signals. - Unlike OpenESS, this approach does not rely on predefined category names but instead leverages the MLLM to produce diverse and rich vocabulary.

Key Designs 2: Multimodal Fusion Network (Three Components)¶

① Backbone Feature Enhancer: - Six multimodal fusion modules (self-attention + cross-attention + FFN) are stacked on top of the EventSAM backbone features. - During training, textual guidance \(M_l^{text}\) serves as the key/value in cross-attention. - During inference, dataset class names or user-defined language inputs are used instead. - Mask features are pooled from language-fused features via RoI-Align.

② Spatial Encoding: - Addresses two problems: dead masks (masks of small objects vanish after downsampling, yielding zero vectors) and semantic conflict (masks of different semantics project onto the same region on low-resolution feature maps). - SAM mask decoder mask tokens are used to encode spatial priors such as shape and location. - Spatial features \(G_l^{evt}\) and semantic features \(S_l^{evt}\) are concatenated and projected: \(M_l^{evt} = \text{proj}(\text{concat}(G_l^{evt}, S_l^{evt}))\)

③ Mask Feature Enhancer: - Further enhances semantic and spatial priors in mask features through masked cross-attention layers. - Language-fused backbone features (with positional encoding) serve as key/value, constraining attention to foreground regions.

Loss & Training¶

Two-stage training: Stage 1 trains EventSAM following its original protocol; Stage 2 freezes EventSAM and trains only the fusion network.
Training data: Mixed-24K (merging DDD17-Seg and DSEC-Semantic training sets, totaling 24,032 pairs).
Loss function: Cosine similarity distillation loss that aligns event mask features with both visual and textual guidance simultaneously:

\[\mathcal{L}_{distill} = \sum_{l \in \{s,i,p\}} \frac{1}{K_l}(1 - \cos(\hat{M}_l^{evt}, M_l^{img})) + \sum_{l \in \{s,i,p\}} \frac{1}{K_l}(1 - \cos(\hat{M}_l^{evt}, M_l^{text}))\]

Key Experimental Results¶

Four Evaluation Benchmarks¶

Benchmark	Source	Test Size	Resolution	# Classes	Evaluation Dimension
DDD17-Ins	DDD17-Seg	3,890	352×200	6	Coarse-grained instance segmentation
DSEC11-Ins	DSEC-Semantic	2,809	640×440	11	Medium-grained instance segmentation
DSEC19-Ins	DSEC-Semantic	2,809	640×440	19	Fine-grained instance segmentation
DSEC-Part	DSEC-Semantic	2,809	640×440	9 (5+4)	Part-level segmentation

Main Results (Table 1: Closed-Set Instance Segmentation, Box Prompt AP)¶

Method	Category	DDD17-Ins AP	DSEC11-Ins AP	DSEC19-Ins AP	Inference Time (ms)	Params (M)
OVSAM	AR-CDG	21.6	22.2	11.6	102.27	314.7
OpenSeg	Hybrid	35.0	23.6	13.0	427.01	228.4
MaskCLIP++	Hybrid	32.8	25.4	14.1	394.61	301.7
frame2recon	AF-DA	34.8	21.2	10.5	278.35	141.7
frame2voxel	AF-DA	33.6	21.3	11.3	88.19	109.1
SEAL (Ours)	AF-DA	38.2	28.8	14.8	22.28	99.1
Gain	-	+3.2	+3.4	+0.7	-	-

Part Segmentation Results (Table 2: DSEC-Part)¶

Method	Point AP	Box AP
VLPart	12.9	16.1
SEAL	13.6	18.3
Gain	+0.7	+2.2

Ablation Study — Hierarchical Semantic Guidance (Table 3)¶

Removing part-level guidance → part segmentation AP decreases (DSEC-Part Box: 14.4–15.4 vs. 18.3).
Removing instance/semantic-level guidance → instance segmentation AP decreases.
Using all three granularities yields the best performance, validating the necessity of hierarchical guidance.

Ablation Study — Model Architecture (Table 5)¶

Fusion	SE	MFE	DDD17 Box AP	DSEC-Part Box AP
✓			35.5	14.9
✓	✓		35.7	15.7
✓		✓	38.1	16.6
✓	✓	✓	38.2	18.3

Efficiency Advantage¶

SEAL achieves an inference time of 22.28 ms, far below all baselines (the second-best, frame2voxel, takes 88.19 ms — approximately 4× slower).
With 99.1M parameters, SEAL is the most parameter-efficient solution (the second-best, frame2spike, has 95.9M parameters but much lower performance).
The single-backbone architecture avoids the redundancy of baseline methods that require two separate backbones for mask generation and classification.

Highlights & Insights¶

First definition of the OV-EIS task: Advances open-vocabulary event understanding from the semantic level to the instance level, filling a research gap.
Elegant hierarchical semantic guidance design: Leverages SAM's inherent three-level mask mechanism to construct part/instance/semantic three-granularity supervision in a natural and effective manner.
Annotation-free training framework: Requires only event–image pairs without any manual dense annotations; supervision signals are automatically generated via CLIP and MLLM.
Dual advantage in efficiency and performance: Inference speed is 4× faster than the fastest baseline, parameter count is the lowest, and AP is comprehensively highest — well-suited for low-power edge deployment of event cameras.
Spatial encoding module resolves dead masks and semantic conflicts: Spatial priors from SAM mask tokens compensate for semantic features, with UMAP visualizations clearly demonstrating improvements in the feature space.
Four self-constructed evaluation benchmarks: Cover label granularity (6/11/19 classes) and semantic granularity (instance/part), providing a complete evaluation framework for future research.

Limitations & Future Work¶

Dependency on event–image paired data: Training still requires temporally synchronized event–image pairs, limiting applicability to purely event-based data.
Visual prompts still required: Inference requires user-provided point/box prompts; the prompt-free SEAL++ variant is only briefly mentioned in the appendix.
Benchmark limitations: All four benchmarks are drawn from driving scenarios (DDD17/DSEC), lacking validation across diverse settings such as indoor and industrial environments.
Limited number of categories: The closed-set evaluation covers at most 19 classes, and truly large-scale open-vocabulary capability has not been demonstrated.
E2VID reconstruction quality: The MHSG hierarchical guidance depends on paired image quality, which may degrade under extreme event conditions.
Two-stage training: Requires training EventSAM before training the fusion network, resulting in a relatively complex training pipeline.

Direction	Representative Works	Relation to This Paper
Event semantic segmentation	EV-SegNet, ESS, HALSIE, HMNet	Prior work; performs semantic segmentation only
Event instance segmentation	EventSAM	Base model of this work; performs class-agnostic segmentation only
Open-vocabulary event understanding	OpenESS, EventCLIP, EventBind	Semantic-level only; this paper advances to the instance level
Image open-vocabulary segmentation	CLIP, MaskCLIP, OpenSeg, OVSeg	Used as mask classifiers in baselines
SAM and variants	SAM, OVSAM, Mask-Adapter	Provide spatial priors and baseline comparisons

Rating¶

Novelty: ⭐⭐⭐⭐ — First to define the OV-EIS task; MHSG hierarchical guidance design is original and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ — 4 benchmarks, 11 baseline comparisons, 3 ablation studies, with thorough visualization analysis.
Writing Quality: ⭐⭐⭐⭐ — Clear structure, rigorous problem definition, and well-motivated exposition.
Value: ⭐⭐⭐⭐ — Opens a new direction for open-world understanding in event vision; the framework is efficient and practically applicable.