EventGait: Towards Robust Gait Recognition with Event Streams¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/QUEAHREN/EventGait
Area: Human Understanding / Gait Recognition
Keywords: Gait Recognition, Event Camera, Spiking Neural Networks, Cross-modal Structural Alignment, Dual-stream Network

TL;DR¶

EventGait performs gait recognition using event cameras and proposes a dual-stream framework: a dynamic stream uses a Mixture of Spiking Experts (MoSE) with varying membrane time constants to adaptively capture multi-scale motion, while a static stream employs Cross-modal Structural Alignment (CroSA) with DINOv2 as a teacher to distill dense shape priors into sparse events. It matches camera-based methods in normal light and significantly outperforms them in low light (+37.3% at night) on both synthetic and real event gait benchmarks.

Background & Motivation¶

Background: Gait recognition is a non-intrusive, privacy-preserving biometric modality capable of long-distance operation. Prevailing modalities include silhouettes, skeletons, parsing maps, RGB images, and point clouds; recent methods have shifted from intermediate representations to end-to-end RGB approaches.

Limitations of Prior Work: RGB cameras suffer from low temporal resolution ($\approx 30$ ms) and narrow dynamic range ($< 80$ dB), leading to degraded visual cues under low light, occlusion, or motion blur. While LiDAR point clouds are robust to illumination, they are prohibitively expensive and energy-intensive (approx. $\$75$K per LiDAR vs. $\$1.5$K per event camera), hindering large-scale deployment.

Key Challenge: A trade-off exists between "robustness $\leftrightarrow$ cost" among image-based methods and 3D sensors. Gait recognition requires both stable static body shape and high-frequency dynamic motion rhythm. Event cameras, with microsecond-level resolution ($< 3$ µs) and high dynamic range ($> 120$ dB), naturally suppress irrelevant textures (clothing/color). However, prior methods (e.g., EV-Gait) aggregate event streams into grid-like "event frames" over long windows. This voxelization destroys two things: ① high-frequency temporal cues are smoothed over, losing dynamic rhythm; ② spatial representations become too sparse for deep networks to extract discriminative appearance features.

Goal: To fully exploit the potential of event cameras without degrading temporal precision or spatial density, creating a robust gait recognizer that matches RGB performance in normal light and surpasses it in low light.

Key Insight: Robust event-based gait recognition should not only encode static shapes but also preserve fine-grained dynamics from high-resolution events. Thus, "dynamics" and "shapes" should be explicitly decoupled and modeled via specialized mechanisms.

Core Idea: Use "short-term segments + Mixture of Spiking Experts" to model high-frequency motion and "long-term segments + VFM distillation" to recover dense shapes. This dual-stream approach preserves temporal advantages while compensating for spatial sparsity.

Method¶

Overall Architecture¶

EventGait is an end-to-end dual-stream framework. The input event stream is processed via a dual-scale temporal design, where a long window $T$ is partitioned into two granularities: short-term segments $\mathbf{E}_d$ with small $\Delta T = T/K$ for the dynamic stream, and a long-term segment $\mathbf{E}_s$ aggregating the entire window $T$ for the static stream. The dynamic stream consists of Mixture of Spiking Experts (MoSE) to output motion features $\mathbf{F}_d$; the static stream is a CNN encoder supervised by a frozen Cross-modal Structural Alignment (CroSA) teacher (DINOv2) to output dense shape features $\mathbf{F}_s$. These features are fused via $\Phi(\cdot)$ into a unified gait descriptor $\mathbf{F}_{gait}$ for downstream decoding and classification. Notably, the CroSA RGB teacher branch is only present during training; at inference, the static stream processes events only.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Event Stream"] --> B["Dual-scale Representation<br/>Split short Ed + long Es"]
    B -->|"Short-term Ed"| C["MoSE Dynamic Stream<br/>Multi-tau Spiking Experts"]
    B -->|"Long-term Es"| D["CroSA Static Stream<br/>VFM Distill Dense Shape"]
    C -->|"Motion Features Fd"| E["Fusion + Classification<br/>Unified Gait Descriptor"]
    D -->|"Shape Features Fs"| E
    E --> F["Identity Output"]

Key Designs¶

1. Dual-scale Event Representation + Dual-stream Decomposition

Event cameras asynchronously record log-intensity changes $e_i=(x_i,y_i,t_i,p_i)$. When changes exceed a threshold $c$, an event with polarity $p_i\in\{+1,-1\}$ is triggered. For deep network compatibility, the window $T$ is divided into $K$ bins using a linear interpolation kernel:

\[E_p(x,y,k)=\sum_{e_i\in E_p}\max\!\Big(0,\,1-\frac{|t_i-t_k|}{\Delta T}\Big),\quad \Delta T=T/K,\]

resulting in $\mathbf{E}\in\mathbb{R}^{2\times K\times H\times W}$. The dynamic stream uses short-term $\mathbf{E}_d$ (high-frequency motion), while the static stream uses long-term $\mathbf{E}_s$ (stable shape), allocating the conflict between "temporal precision" and "spatial stability" to specialized streams.

2. MoSE (Mixture of Spiking Experts)

Traditional CNNs/RNNs fail to capture event sparsity. The authors utilize Spiking Neural Networks (SNN). The membrane potential $U(t)$ of an LIF neuron integrates sparse spikes over time:

\[\tau\frac{dU(t)}{dt}=-U(t)+R\cdot I(t),\qquad S(t)=\Theta(U(t)-U_{th}),\]

where the membrane time constant $\tau$ controls decay. Small $\tau$ ("fast" neurons) capture high-frequency bursts in bright scenes, while large $\tau$ ("slow" neurons) accumulate sparse signals in low light. MoSE uses $N$ parallel spiking experts $\{E_1,\dots,E_N\}$ with different initial $\tau_i$. A spiking gating network $\mathcal{G}(\cdot)$ adaptively calculates mixing coefficients $\alpha_i$:

\[\hat{E}_t=\sum_{i=1}^{N}\alpha_i\,E_i(E_t).\]

3. CroSA (Cross-modal Structural Alignment)

To address the sparsity of $\mathbf{E}_s$, CroSA employs cross-modal distillation. A frozen DINOv2 teacher $F_{teacher}$ processes grayscale versions of synchronous RGB frames $\mathbf{I}_g$ to generate $z_{img}$. A CNN-based event static encoder $F_{student}$ generates $z_{evs}$ via an alignment layer $\mathcal{A}(\cdot)$. The $\ell_2$ distance minimizes the gap:

\[z_{img}=F_{teacher}(\mathbf{I}_g),\quad z_{evs}=\mathcal{A}(F_{student}(\mathbf{E}_s)),\qquad \mathcal{L}_{align}=\|z_{evs}-z_{img}\|_2^2.\]

4. Synthetic Pipeline + Benchmarks

Due to the lack of massive event gait data, the authors developed a pipeline to synthesize event streams from RGB videos (using frame interpolation and v2e models). This created CCGR-Mini-E (970 IDs, 53 covariates) and SUSTech1K-E (1,050 IDs, multi-modal pairs).

Loss & Training¶

The total loss combines classification and alignment:

\[\mathcal{L}_{total}=\mathcal{L}_{ce}+\mathcal{L}_{tri}+\lambda_d\mathcal{L}_{align},\]

where $\mathcal{L}_{ce}$ is cross-entropy and $\mathcal{L}_{tri}$ is triplet loss. $\lambda_d$ (default 0.2) balances the alignment term.

Key Experimental Results¶

Main Results¶

In-domain evaluation on SUSTech1K-E (Rank-1). EventGait (4.6M parameters) outperforms state-of-the-art camera methods by $+5.2\%$ overall and even surpasses LiDAR methods in some cases:

Input	Method	Params	CL (Clothing)	NT (Night)	Overall
Point Cloud	LidarGait++ (CVPR25)	4.4M	92.4	92.2	92.7
Silhouette	GaitBase (CVPR23)	8.0M	49.6	25.9	76.1
Event	EVGait (CVPR19)	45.2M	67.8	78.7	65.4
Event	Ours	4.6M	93.3	84.8	92.8

Compared to GaitBase, performance gains reach $+18.4\%$ in clothing-change and $+37.3\%$ at night.

Ablation Study¶

The dual-stream architecture is critical, and 3 experts in MoSE is the optimal balance (SUSTech1K-E):

Configuration	NM	CL	NT	Overall	Note
Static Only	82.6	61.6	76.9	82.0	Missing motion
Dynamic Only	74.5	52.0	71.7	72.4	Missing structure
Dual-stream	92.5	78.1	84.8	92.8	Full model
MoSE (1 expert)	—	—	—	88.4	Standard SNN
MoSE (3 experts)	—	—	—	92.8	Default

Key Findings¶

Dual-stream complementarity: Removing either stream causes a performance drop of $10\%+$.
Multi-tau MoSE: Using 3 experts (92.8) yields significant gains over a single SNN (88.4).
Moderate CroSA supervision: $\lambda_d=0.2$ is optimal; excessive weighting introduces identity-irrelevant textures from RGB.
Low-light superiority: Performance drops only $9.6\%$ across light conditions, whereas silhouette methods drop $30\%+$.

Highlights & Insights¶

Explicit decoupling of dynamics and shape allows each mechanism to focus on its strength, avoiding trade-offs in temporal/spatial resolution.
MoSE applies MoE to the temporal domain of spiking neurons, letting the network hold different "temporal filters" (membrane constants) for varying lighting/motion.
CroSA utilizes VFM knowledge transfer without increasing inference cost, distilling dense human structure priors into sparse event encoders.
Cost-efficiency: Achieving LiDAR-level performance at a fraction of the hardware cost ($1/50$).

Limitations & Future Work¶

Reliance on synthetic data: Current benchmarks are synthesized, introducing a sim-to-real gap.
Training-time RGB: CroSA requires synchronized RGB frames for distillation, which may not always be available in pure event-collection pipelines.
Cross-domain challenges: Rank-1 on CCGR-Mini remains low ($3.7\%$ in Table 3), indicating that high covariate and viewpoint variance is still unresolved.

vs. EV-Gait: EventGait preserves sub-bin temporal precision and uses a dual-stream design, achieving higher accuracy with 10x fewer parameters.
vs. Silhouette methods: These fail under low light; EventGait remains robust due to the high dynamic range of event cameras.
vs. LiDAR methods: EventGait matches accuracy while using much cheaper sensors, demonstrating that 2D event streams can provide sufficient spatio-temporal dynamics.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Decoupling + MoSE + CroSA provides a fresh paradigm for event gait recognition.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive evaluation across 5 categories with comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Clear motivation; some implementation details are relegated to the appendix.
Value: ⭐⭐⭐⭐⭐ High potential for low-cost, robust surveillance and security applications.