PASS: Path-Selective State Space Model for Event-Based Recognition¶
Conference: NeurIPS 2025 arXiv: 2409.16953 Code: GitHub Area: Video Understanding / Event Camera Keywords: Event Camera, State Space Model, Frequency Generalization, Long-Sequence Modeling, Mamba
TL;DR¶
PASS proposes the Path-selective Event Aggregation and Scan (PEAS) module and the Multi-faceted Selection Guiding (MSG) loss, leveraging the linear complexity and frequency generalization capability of SSMs to perform event-based recognition across a broad distribution of event lengths from \(10^6\) to \(10^9\), while limiting performance degradation to only 8.62% under varying inference frequencies (compared to 20.69% for the baseline).
Background & Motivation¶
Event cameras are bio-inspired sensors that asynchronously capture brightness changes, offering high temporal resolution, high dynamic range, and low latency. However, existing event-based recognition methods face two critical challenges:
Limited event-length distribution: Existing datasets cover event lengths only in the \(10^6\)–\(10^7\) range, whereas high-speed or long-duration event streams require handling a much wider range (\(10^6\)–\(10^9\)); the quadratic complexity of Transformers creates a computational bottleneck at large event volumes.
Poor inference-frequency generalization: Although event cameras have a natural advantage for high-speed dynamic scenes, model performance degrades significantly (up to −20.69%) when the inference sampling frequency deviates from the training frequency, preventing full exploitation of the high temporal resolution.
Existing model architectures each have their own drawbacks: - Step-by-step structures: support parallel processing but incur high attention complexity. - Recurrent structures: cannot be parallelized and tend to forget early information.
Core Idea: Exploit the linear complexity and input-frequency generalization properties of SSMs (Mamba), combined with an adaptive event-frame selection mechanism, to handle broad event distributions and generalize across different inference frequencies.
Method¶
Overall Architecture¶
Event stream → Fixed-event-length sampling + frame aggregation → PEAS module (selective scan encoding into fixed-dimension features) → MSG loss-guided optimization → SSM spatio-temporal modeling module → Classification prediction
Key Designs¶
-
Event Sampling and Frame Aggregation:
- Samples are taken at fixed time windows \(1/f\) (where \(f\) is the sampling frequency), with a fixed number \(G\) of events per sample.
- This yields \(P = Tf\) event groups, converted into event-frame representations \(F \in \mathbb{R}^{P \times H \times W \times 3}\).
- Fixed-event-count aggregation is more robust than fixed-time-window aggregation.
-
PEAS Module (Path-selective Event Aggregation and Scan):
- Selection mask prediction: A two-layer 3D convolution followed by an activation function generates a selection mask \(M \in \mathbb{R}^{K \times P}\) from event frames \(F\) (\(K\) = number of selected frames, \(P\) = total frames).
- Differentiable selection: Gumbel Softmax is used during training for differentiable frame selection; standard Softmax is used at inference.
- Matrix-multiplication selection: Einsum multiplies the mask with original frames to obtain selected frames \(F' \in \mathbb{R}^{K \times H \times W \times 3}\).
- Bidirectional event scanning: Selected frames are unrolled into a 1D sequence in spatio-temporal order (following VideoMamba's spatio-temporal scanning), concatenated left-to-right and top-to-bottom.
- Core value: adaptively compresses variable-length event streams (\(10^6\)–\(10^9\)) into fixed-dimension features in an end-to-end learnable manner.
-
MSG Loss (Multi-faceted Selection Guiding):
- WEIE Loss (Within-Frame Event Information Entropy):
- Computes the grayscale histogram entropy of each selected frame.
- Maximizing this loss encourages selection of information-rich frames and reduces the randomness of selecting empty (padded) frames.
- \(\mathcal{L}_{WEIE} = -\sum_{k=1}^{K}\sum_{i=1}^{N}P_i^k \log P_i^k / K\)
- IEMI Loss (Inter-frame Event Mutual Information):
- Computes the joint-distribution mutual information between adjacent selected frames (including spatial position information).
- Minimizing this loss reduces redundancy among selected frames, ensuring each frame carries unique information.
- Total objective: \(\mathcal{L}_{total} = \mathcal{L}_{IEMI} - \mathcal{L}_{WEIE} + \mathcal{L}_{CLS}\)
- WEIE Loss (Within-Frame Event Information Entropy):
-
Event Spatio-Temporal Modeling Module:
- 3D convolution (\(1\times16\times16\)) for patch embedding.
- Concatenation of a learnable CLS token + spatial positional embeddings + temporal embeddings.
- Input to \(L\) stacked B-Mamba blocks (bidirectional Mamba).
- CLS token extracted, passed through layer normalization and a linear classification head for final prediction.
- Initialized with VideoMamba pretrained weights.
Loss & Training¶
- Total loss = IEMI (minimized) − WEIE (maximized) + cross-entropy classification loss.
- Model scales: Tiny (7M), Small (25M), Middle (74M).
- The number of selected frames \(K\) is a hyperparameter, with different values per dataset (1/2/8/16/32).
- Self-constructed datasets: ArDVS100 (100-class action conversion, event lengths 1s–256s), TemArDVS100 (fine-grained temporal annotations), Real-ArDVS10 (10-class real-world dataset).
Key Experimental Results¶
Main Results¶
| Dataset | Event Scale | Metric | Ours | Prev. SOTA | Gain |
|---|---|---|---|---|---|
| N-Caltech101 | ~\(10^6\) | Top-1 | 94.60% | EventDance: 92.35% | +2.25% |
| N-Imagenet | ~\(10^6\) | Top-1 | 61.32% | MEM: 57.89% | +3.43% |
| PAF | ~\(10^7\) | Top-1 | 98.28% | ExACT: 94.83% | +3.45% |
| SeAct | ~\(10^7\) | Top-1 | 66.38% | ExACT: 66.07% | +0.38% |
| HARDVS | ~\(10^7\) | Top-1 | 98.41% | S5-ViT: 95.98% | +8.31% |
| ArDVS100 | ~\(10^9\) | Top-1 | 97.35% | S5-ViT: 93.39% | +3.96% |
| TemArDVS100 | ~\(10^9\) | Top-1 | 89.00% | S5-ViT: 79.62% | +9.38% |
| Real-ArDVS10 | ~\(10^9\) | Top-1 | 100% | S5-ViT: 93.33% | +6.67% |
Ablation Study¶
| Configuration | PAF Top-1 | ArDVS100 Top-1 | Notes |
|---|---|---|---|
| No sampling | 92.90% | 92.31% | All frames used directly |
| Random sampling | 92.98% | 92.23% | Random selection of \(K\) frames |
| PEAS | 93.33% | 92.84% | +0.35% / +0.61% |
| PEAS + MSG | 94.83% | 93.85% | +1.85% / +1.62% |
| Frequency Generalization | Performance Drop: Train 60 Hz → Infer 100 Hz |
|---|---|
| Time-window baseline | −20.69% |
| Event-count baseline | ~−15% |
| PASS | −8.62% |
Key Findings¶
- Although PEAS compresses the number of frames, the selected frames retain task-critical information (outperforming the no-sampling baseline by +0.43%).
- The two components of the MSG loss are complementary: IEMI reduces redundancy (+0.77%), while WEIE further reduces selection randomness (an additional +1.08%).
- PASS maintains strong performance (97.35%) on \(10^9\)-scale events, where baseline methods struggle with such long sequences.
- Frequency generalization is a core advantage of PASS: regardless of whether training is conducted at 20 Hz, 60 Hz, or 100 Hz, cross-frequency inference performance degrades by at most 8.62%.
Highlights & Insights¶
- Natural fit between SSM and event cameras: The linear complexity and frequency generalization of SSMs align perfectly with the high temporal resolution characteristics of event streams.
- Information-theoretic frame selection: Using information entropy and mutual information as selection guidance signals is more principled than heuristic rules.
- End-to-end selection via Gumbel Softmax: Differentiable frame selection enables end-to-end training of PEAS, avoiding the complexity of two-stage training pipelines.
- Self-constructed long-sequence datasets: ArDVS100 and TemArDVS100 fill the gap in \(10^9\)-scale event recognition benchmarks.
- Practical significance of frequency generalization: In real-world deployment, inference frequencies often differ from training frequencies; the strong generalization of PASS substantially reduces deployment difficulty.
Limitations & Future Work¶
- Larger-scale VideoMamba models exhibit overfitting, necessitating better regularization strategies.
- The number of selected frames \(K\) is set manually and cannot be determined adaptively.
- Event-frame representation is only one of many event representations; comparisons with voxel grids, time surfaces, and other representations are insufficient.
- The self-constructed datasets are synthesized by concatenation, which may introduce distribution gaps relative to real-world continuous long-duration event streams.
Related Work & Insights¶
- vs. ExACT: ExACT brute-forces the problem with a 471M-parameter model, whereas PASS achieves superior results more efficiently with 74M parameters.
- vs. S5-ViT: S5-ViT is the first to introduce SSMs into event-based detection but addresses frequency generalization via a low-pass constraint loss; PASS tackles frequency generalization more fundamentally through frame selection.
- vs. VideoMamba: PASS builds upon VideoMamba but introduces the PEAS module and MSG loss tailored to event-stream characteristics, rather than directly applying the original framework.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The motivation for applying SSMs to event-based recognition is natural; PEAS + MSG exhibits originality, though it does not represent a breakthrough contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Five public datasets plus three self-constructed datasets, comprehensive frequency generalization experiments, and thorough ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear and figures are abundant, though some notation in the equations is slightly ambiguous.
- Value: ⭐⭐⭐⭐ — Provides an efficient long-sequence modeling solution for event-based recognition; the frequency generalization property offers strong practical utility.