SEAL: Segment Any Events with Language¶
Conference: ICLR 2026 arXiv: 2601.23159 Code: https://0nandon.github.io/SEAL (coming soon) Area: Autonomous Driving Keywords: Event camera, open-vocabulary instance segmentation, SAM, CLIP, multimodal fusion, annotation-free training
TL;DR¶
This paper introduces the open-vocabulary event instance segmentation (OV-EIS) task for the first time, and proposes the SEAL framework. Through multimodal hierarchical semantic guidance (MHSG) and a lightweight multimodal fusion network, SEAL achieves multi-granularity (instance-level + part-level) semantic segmentation of event streams using only event–image pairs (without dense annotations), substantially outperforming all baselines while achieving the fastest inference speed.
Background & Motivation¶
Advantages of event cameras: Event cameras offer extremely high temporal resolution, ultra-low latency, high dynamic range, and low power consumption, remaining effective in scenarios where conventional cameras fail, such as low-light and overexposed conditions.
Limitations of existing event segmentation: Existing event semantic segmentation (ESS) methods are confined to closed-set vocabularies and cannot recognize objects outside the training categories, nor can they distinguish between different instances of the same class.
Open-vocabulary event understanding is nascent: OpenESS achieves only open-vocabulary semantic segmentation without instance-level recognition; EventSAM supports event instance segmentation but lacks semantic recognition capability.
Absence of evaluation benchmarks: No multi-semantic benchmark dataset for event instance segmentation has previously existed.
Efficiency requirements: Event cameras are commonly deployed on edge devices, necessitating parameter-efficient and fast-inference model designs.
Domain gap: Directly applying image-domain pretrained models to event streams introduces a large domain gap due to noise and artifacts, even when images are reconstructed via E2VID.
Method¶
Overall Architecture¶
SEAL belongs to the annotation-free domain adaptation (AF-DA) category: - During training: Only event–image pairs \((I^{evt}, I^{img})\) are used; no dense event annotations are required. - During inference: Only event embeddings \(I^{evt}\) are input; masks and category predictions are generated given user-provided visual prompts (points/boxes). - The framework consists of two core components: the MHSG module (providing multimodal hierarchical semantic supervision) and the multimodal fusion network (a lightweight mask classifier).
Key Designs 1: Multimodal Hierarchical Semantic Guidance (MHSG)¶
Hierarchical visual guidance: - SAM is applied to paired images to generate segmentation maps at three granularities: semantic-level \(M_s^{img}\), instance-level \(M_i^{img}\), and part-level \(M_p^{img}\). - Pixel-level features are extracted via the CLIP visual encoder and then RoI-pooled to obtain visual features for each level of masks.
Hierarchical textual guidance: - A LLaMA-based MLLM generates rich textual descriptions for each mask. - These descriptions are encoded via the CLIP text encoder to form hierarchical textual guidance signals. - Unlike OpenESS, this approach does not rely on predefined category names but instead leverages the MLLM to produce diverse and rich vocabulary.
Key Designs 2: Multimodal Fusion Network (Three Components)¶
① Backbone Feature Enhancer: - Six multimodal fusion modules (self-attention + cross-attention + FFN) are stacked on top of the EventSAM backbone features. - During training, textual guidance \(M_l^{text}\) serves as the key/value in cross-attention. - During inference, dataset class names or user-defined language inputs are used instead. - Mask features are pooled from language-fused features via RoI-Align.
② Spatial Encoding: - Addresses two problems: dead masks (masks of small objects vanish after downsampling, yielding zero vectors) and semantic conflict (masks of different semantics project onto the same region on low-resolution feature maps). - SAM mask decoder mask tokens are used to encode spatial priors such as shape and location. - Spatial features \(G_l^{evt}\) and semantic features \(S_l^{evt}\) are concatenated and projected: \(M_l^{evt} = \text{proj}(\text{concat}(G_l^{evt}, S_l^{evt}))\)
③ Mask Feature Enhancer: - Further enhances semantic and spatial priors in mask features through masked cross-attention layers. - Language-fused backbone features (with positional encoding) serve as key/value, constraining attention to foreground regions.
Loss & Training¶
- Two-stage training: Stage 1 trains EventSAM following its original protocol; Stage 2 freezes EventSAM and trains only the fusion network.
- Training data: Mixed-24K (merging DDD17-Seg and DSEC-Semantic training sets, totaling 24,032 pairs).
- Loss function: Cosine similarity distillation loss that aligns event mask features with both visual and textual guidance simultaneously:
Key Experimental Results¶
Four Evaluation Benchmarks¶
| Benchmark | Source | Test Size | Resolution | # Classes | Evaluation Dimension |
|---|---|---|---|---|---|
| DDD17-Ins | DDD17-Seg | 3,890 | 352×200 | 6 | Coarse-grained instance segmentation |
| DSEC11-Ins | DSEC-Semantic | 2,809 | 640×440 | 11 | Medium-grained instance segmentation |
| DSEC19-Ins | DSEC-Semantic | 2,809 | 640×440 | 19 | Fine-grained instance segmentation |
| DSEC-Part | DSEC-Semantic | 2,809 | 640×440 | 9 (5+4) | Part-level segmentation |
Main Results (Table 1: Closed-Set Instance Segmentation, Box Prompt AP)¶
| Method | Category | DDD17-Ins AP | DSEC11-Ins AP | DSEC19-Ins AP | Inference Time (ms) | Params (M) |
|---|---|---|---|---|---|---|
| OVSAM | AR-CDG | 21.6 | 22.2 | 11.6 | 102.27 | 314.7 |
| OpenSeg | Hybrid | 35.0 | 23.6 | 13.0 | 427.01 | 228.4 |
| MaskCLIP++ | Hybrid | 32.8 | 25.4 | 14.1 | 394.61 | 301.7 |
| frame2recon | AF-DA | 34.8 | 21.2 | 10.5 | 278.35 | 141.7 |
| frame2voxel | AF-DA | 33.6 | 21.3 | 11.3 | 88.19 | 109.1 |
| SEAL (Ours) | AF-DA | 38.2 | 28.8 | 14.8 | 22.28 | 99.1 |
| Gain | - | +3.2 | +3.4 | +0.7 | - | - |
Part Segmentation Results (Table 2: DSEC-Part)¶
| Method | Point AP | Box AP |
|---|---|---|
| VLPart | 12.9 | 16.1 |
| SEAL | 13.6 | 18.3 |
| Gain | +0.7 | +2.2 |
Ablation Study — Hierarchical Semantic Guidance (Table 3)¶
- Removing part-level guidance → part segmentation AP decreases (DSEC-Part Box: 14.4–15.4 vs. 18.3).
- Removing instance/semantic-level guidance → instance segmentation AP decreases.
- Using all three granularities yields the best performance, validating the necessity of hierarchical guidance.
Ablation Study — Model Architecture (Table 5)¶
| Fusion | SE | MFE | DDD17 Box AP | DSEC-Part Box AP |
|---|---|---|---|---|
| ✓ | 35.5 | 14.9 | ||
| ✓ | ✓ | 35.7 | 15.7 | |
| ✓ | ✓ | 38.1 | 16.6 | |
| ✓ | ✓ | ✓ | 38.2 | 18.3 |
Efficiency Advantage¶
- SEAL achieves an inference time of 22.28 ms, far below all baselines (the second-best, frame2voxel, takes 88.19 ms — approximately 4× slower).
- With 99.1M parameters, SEAL is the most parameter-efficient solution (the second-best, frame2spike, has 95.9M parameters but much lower performance).
- The single-backbone architecture avoids the redundancy of baseline methods that require two separate backbones for mask generation and classification.
Highlights & Insights¶
- First definition of the OV-EIS task: Advances open-vocabulary event understanding from the semantic level to the instance level, filling a research gap.
- Elegant hierarchical semantic guidance design: Leverages SAM's inherent three-level mask mechanism to construct part/instance/semantic three-granularity supervision in a natural and effective manner.
- Annotation-free training framework: Requires only event–image pairs without any manual dense annotations; supervision signals are automatically generated via CLIP and MLLM.
- Dual advantage in efficiency and performance: Inference speed is 4× faster than the fastest baseline, parameter count is the lowest, and AP is comprehensively highest — well-suited for low-power edge deployment of event cameras.
- Spatial encoding module resolves dead masks and semantic conflicts: Spatial priors from SAM mask tokens compensate for semantic features, with UMAP visualizations clearly demonstrating improvements in the feature space.
- Four self-constructed evaluation benchmarks: Cover label granularity (6/11/19 classes) and semantic granularity (instance/part), providing a complete evaluation framework for future research.
Limitations & Future Work¶
- Dependency on event–image paired data: Training still requires temporally synchronized event–image pairs, limiting applicability to purely event-based data.
- Visual prompts still required: Inference requires user-provided point/box prompts; the prompt-free SEAL++ variant is only briefly mentioned in the appendix.
- Benchmark limitations: All four benchmarks are drawn from driving scenarios (DDD17/DSEC), lacking validation across diverse settings such as indoor and industrial environments.
- Limited number of categories: The closed-set evaluation covers at most 19 classes, and truly large-scale open-vocabulary capability has not been demonstrated.
- E2VID reconstruction quality: The MHSG hierarchical guidance depends on paired image quality, which may degrade under extreme event conditions.
- Two-stage training: Requires training EventSAM before training the fusion network, resulting in a relatively complex training pipeline.
Related Work & Insights¶
| Direction | Representative Works | Relation to This Paper |
|---|---|---|
| Event semantic segmentation | EV-SegNet, ESS, HALSIE, HMNet | Prior work; performs semantic segmentation only |
| Event instance segmentation | EventSAM | Base model of this work; performs class-agnostic segmentation only |
| Open-vocabulary event understanding | OpenESS, EventCLIP, EventBind | Semantic-level only; this paper advances to the instance level |
| Image open-vocabulary segmentation | CLIP, MaskCLIP, OpenSeg, OVSeg | Used as mask classifiers in baselines |
| SAM and variants | SAM, OVSAM, Mask-Adapter | Provide spatial priors and baseline comparisons |
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to define the OV-EIS task; MHSG hierarchical guidance design is original and effective.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 4 benchmarks, 11 baseline comparisons, 3 ablation studies, with thorough visualization analysis.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, rigorous problem definition, and well-motivated exposition.
- Value: ⭐⭐⭐⭐ — Opens a new direction for open-world understanding in event vision; the framework is efficient and practically applicable.