GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences¶
Conference: ICLR2026 arXiv: 2508.07782 Code: To be confirmed Area: Human Understanding Keywords: gait recognition, snippet paradigm, temporal modeling, silhouette, 2D convolution
TL;DR¶
This paper proposes a Snippet paradigm that organizes gait silhouette sequences into several "snippets," each formed by randomly sampling frames from a contiguous interval. This design captures both short-range temporal context and long-range temporal dependencies, achieving 77.5% Rank-1 on Gait3D with a 2D convolution backbone, surpassing all 3D convolution methods.
Background & Motivation¶
Gait recognition takes silhouette sequences as input. Two mainstream modeling paradigms currently exist:
Unordered Set: Exemplified by GaitSet, this approach treats all frames as an unordered set, extracts features independently via 2D convolution, and aggregates them through Set Pooling. While efficient and robust to frame-order perturbations, independent per-frame processing discards short-range temporal context between adjacent frames.
Ordered Sequence: Exemplified by GaitGL and DeepGaitV2-3D, this approach jointly models spatiotemporal features using 3D/P3D convolutions. Although capable of capturing local temporal patterns, training typically samples only ~30 consecutive frames, making it difficult to model long-range temporal dependencies (real-world sequences can exceed 200 frames).
Core Problem: Can a new paradigm simultaneously achieve short-range temporal awareness and long-range temporal coverage?
The authors draw inspiration from human cognition—recognizing a person typically requires observing only a few key actions (not necessarily a complete gait cycle)—and propose viewing gait as a combination of individualized actions, with each action represented by a snippet.
Method¶
Overall Architecture¶
The silhouette sequence is divided into equal-length segments; frames are randomly sampled from each segment to form a snippet (representing a local action), and the overall gait feature is obtained by aggregating features from multiple snippets.
3.1 Snippet Sampling¶
Training phase:
- Divide the sequence into \(K\) equal-length segments (length \(L=16\), approximately one gait cycle)
- Randomly select \(M=4\) segments; randomly sample \(N=8\) frames from each segment to form one snippet
- Total sampled frames \(S = M \times N = 32\)
- The length of the first segment \(L_1\) is randomly drawn from \(\{1,\ldots,L\}\) to increase sampling diversity
Inference phase:
- All frames are used: all frames within each segment form a snippet, with \(M=K, N=L\)
- The first segment length is fixed at \(L\) to ensure prediction stability
3.2 Snippet Modeling¶
The model, termed GaitSnippet, addresses three sub-problems:
(1) Intra-Snippet Modeling¶
Goal: capture local temporal context within a snippet to enhance frame-level features. The Snippet Block is designed as follows:
- Gathering: Treats frames within a snippet as an unordered set and aggregates them into a snippet-level representation via Temporal Max Pooling (non-parametric)
- Smoothing: Applies a \(1\times1\) convolution to the aggregated features to suppress noise and reduce the semantic gap between frame-level and snippet-level representations
- Residual: Fuses snippet-level contextual information with frame-level features via residual connection
The Snippet Block is embedded between the two spatial convolution layers of a standard 2D residual block, forming the Residual Snippet Block as the backbone's basic building unit. The design is inspired by P3D, enabling frame-level features to continuously perceive local temporal context throughout the hierarchical feature extraction process.
(2) Cross-Snippet Modeling¶
- At the backbone output, Intra-Snippet Gathering is first applied to frame-level features to obtain snippet-level representations
- All snippets are then treated as an unordered set and aggregated into a sequence-level representation via Temporal Max Pooling
- This forms a hierarchical unordered set structure: frames → snippets → sequence
Key distinction: Although Set Pooling is applied at both levels, the temporal modeling within each snippet via the Snippet Block means the overall model is not frame-level permutation-invariant—local temporal information has been incorporated into the frame-level features.
(3) Snippet-Level Supervision¶
The snippet paradigm naturally produces two levels of representation (sequence-level and snippet-level). An additional supervision branch is introduced for snippet-level features:
- Sequence-level loss: Triplet Loss \(\mathcal{L}_{tp}\) + Cross-Entropy Loss \(\mathcal{L}_{ce}\) (with BNNeck)
- Snippet-level loss: \(\mathcal{L}_{tp}^{\star}\) + \(\mathcal{L}_{ce}^{\star}\), constructing positive/negative pairs at the snippet granularity
- Total loss: \(\mathcal{L}_{all} = \mathcal{L}_{tp} + \mathcal{L}_{ce} + \alpha(\mathcal{L}_{tp}^{\star} + \mathcal{L}_{ce}^{\star})\), with \(\alpha=0.75\)
- The snippet-level branch is used only during training and incurs no inference overhead
Backbone¶
Built upon DeepGaitV2-2D (a ResNet-style 2D convolution backbone), with standard residual blocks replaced by Residual Snippet Blocks. Horizontal Pyramid Mapping is adopted to extract multi-granularity local representations.
Key Experimental Results¶
Main Results (In-the-Wild Datasets)¶
| Method | Type | Backbone | Gait3D R1 | Gait3D mAP | GREW R1 | GREW R5 |
|---|---|---|---|---|---|---|
| GaitSet | Set | 2D | 36.7 | 30.0 | 48.4 | 63.6 |
| GaitBase | Set | 2D | 64.6 | 55.3 | 60.1 | 75.5 |
| DeepGaitV2-2D | Set | 2D | 68.2 | 60.4 | 68.6 | 82.0 |
| DeepGaitV2-3D | Seq | 3D | 72.8 | 63.9 | 79.4 | 88.9 |
| VPNet | Seq | 3D | 75.4 | — | 80.0 | 89.4 |
| SwinGait-3D | Seq | Swin3D | 75.0 | 67.2 | 79.3 | 88.9 |
| GaitSnippet | Snippet | 2D | 77.5 | 69.4 | 81.7 | 90.9 |
Key findings:
- GaitSnippet with a 2D convolution backbone comprehensively outperforms all 3D convolution methods
- Compared to DeepGaitV2-2D with the same backbone: Gait3D R1 improves by +9.3% and mAP by +9.0%
- Achieves AVG 95.1% on CCPG (clothing-change scenario) and R1 42.4% on CCGR-MINI, both state-of-the-art
Ablation Study (Gait3D)¶
Effect of Snippet Sampling:
- Simply replacing DeepGaitV2-2D's sampling strategy from Set to Snippet raises R1 from 68.2% to 69.5% (+1.3%), indicating that snippet sampling itself provides a regularization effect
- Optimal hyperparameters: \(L=16, M=4, N=8\)
Snippet Block components:
- Removing Gathering: R1 drops to 73.3% (snippet-level supervision becomes infeasible)
- Removing Smoothing: R1 drops to 74.8% (increased noise and semantic gap)
- Removing Residual: R1 drops to 72.5% (loss of frame-level fine-grained information; largest impact)
Snippet-Level Supervision:
- \(\alpha=0\) (no snippet-level supervision) still yields competitive performance, confirming the effectiveness of the Snippet Block itself
- Adding snippet-level supervision (\(\alpha=0.75\)) further improves performance
Highlights & Insights¶
Highlights:
- Proposes a third paradigm between sets and sequences, with a clear concept grounded in cognitive science
- Outperforms all 3D methods using only a 2D convolution backbone at lower computational cost
- Snippet Block design is concise (non-parametric pooling + \(1\times1\) conv + residual) and easy to integrate into existing architectures
- Hierarchical supervision (sequence-level + snippet-level) fully exploits the structural advantages of the snippet paradigm
- Achieves state-of-the-art performance in both controlled (CCPG) and unconstrained (Gait3D/GREW) scenarios
Limitations & Future Work¶
- The snippet length \(L\) is fixed at 16 frames, which may not be adaptive enough for individuals with highly variable cadence
- Cross-Snippet Modeling relies solely on Max Pooling for aggregation; more sophisticated inter-snippet relationship modeling (e.g., Transformer) remains unexplored
- During inference, all frames across all snippets must be processed, which may still incur high computational cost for long sequences
- Validation is limited to the silhouette modality; extension to skeleton, RGB, and other modalities has not been explored
Rating¶
- Novelty: ⭐⭐⭐⭐ Proposes the snippet paradigm with a well-defined conceptual contribution
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four datasets with comprehensive ablation studies
- Value: ⭐⭐⭐⭐ 2D backbone surpassing 3D methods demonstrates strong practical utility