Skip to content

GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences

Conference: ICLR2026 arXiv: 2508.07782 Code: To be confirmed Area: Human Understanding Keywords: gait recognition, snippet paradigm, temporal modeling, silhouette, 2D convolution

TL;DR

This paper proposes a Snippet paradigm that organizes gait silhouette sequences into several "snippets," each formed by randomly sampling frames from a contiguous interval. This design captures both short-range temporal context and long-range temporal dependencies, achieving 77.5% Rank-1 on Gait3D with a 2D convolution backbone, surpassing all 3D convolution methods.

Background & Motivation

Gait recognition takes silhouette sequences as input. Two mainstream modeling paradigms currently exist:

Unordered Set: Exemplified by GaitSet, this approach treats all frames as an unordered set, extracts features independently via 2D convolution, and aggregates them through Set Pooling. While efficient and robust to frame-order perturbations, independent per-frame processing discards short-range temporal context between adjacent frames.

Ordered Sequence: Exemplified by GaitGL and DeepGaitV2-3D, this approach jointly models spatiotemporal features using 3D/P3D convolutions. Although capable of capturing local temporal patterns, training typically samples only ~30 consecutive frames, making it difficult to model long-range temporal dependencies (real-world sequences can exceed 200 frames).

Core Problem: Can a new paradigm simultaneously achieve short-range temporal awareness and long-range temporal coverage?

The authors draw inspiration from human cognition—recognizing a person typically requires observing only a few key actions (not necessarily a complete gait cycle)—and propose viewing gait as a combination of individualized actions, with each action represented by a snippet.

Method

Overall Architecture

The silhouette sequence is divided into equal-length segments; frames are randomly sampled from each segment to form a snippet (representing a local action), and the overall gait feature is obtained by aggregating features from multiple snippets.

3.1 Snippet Sampling

Training phase:

  • Divide the sequence into \(K\) equal-length segments (length \(L=16\), approximately one gait cycle)
  • Randomly select \(M=4\) segments; randomly sample \(N=8\) frames from each segment to form one snippet
  • Total sampled frames \(S = M \times N = 32\)
  • The length of the first segment \(L_1\) is randomly drawn from \(\{1,\ldots,L\}\) to increase sampling diversity

Inference phase:

  • All frames are used: all frames within each segment form a snippet, with \(M=K, N=L\)
  • The first segment length is fixed at \(L\) to ensure prediction stability

3.2 Snippet Modeling

The model, termed GaitSnippet, addresses three sub-problems:

(1) Intra-Snippet Modeling

Goal: capture local temporal context within a snippet to enhance frame-level features. The Snippet Block is designed as follows:

  • Gathering: Treats frames within a snippet as an unordered set and aggregates them into a snippet-level representation via Temporal Max Pooling (non-parametric)
  • Smoothing: Applies a \(1\times1\) convolution to the aggregated features to suppress noise and reduce the semantic gap between frame-level and snippet-level representations
  • Residual: Fuses snippet-level contextual information with frame-level features via residual connection

The Snippet Block is embedded between the two spatial convolution layers of a standard 2D residual block, forming the Residual Snippet Block as the backbone's basic building unit. The design is inspired by P3D, enabling frame-level features to continuously perceive local temporal context throughout the hierarchical feature extraction process.

(2) Cross-Snippet Modeling

  • At the backbone output, Intra-Snippet Gathering is first applied to frame-level features to obtain snippet-level representations
  • All snippets are then treated as an unordered set and aggregated into a sequence-level representation via Temporal Max Pooling
  • This forms a hierarchical unordered set structure: frames → snippets → sequence

Key distinction: Although Set Pooling is applied at both levels, the temporal modeling within each snippet via the Snippet Block means the overall model is not frame-level permutation-invariant—local temporal information has been incorporated into the frame-level features.

(3) Snippet-Level Supervision

The snippet paradigm naturally produces two levels of representation (sequence-level and snippet-level). An additional supervision branch is introduced for snippet-level features:

  • Sequence-level loss: Triplet Loss \(\mathcal{L}_{tp}\) + Cross-Entropy Loss \(\mathcal{L}_{ce}\) (with BNNeck)
  • Snippet-level loss: \(\mathcal{L}_{tp}^{\star}\) + \(\mathcal{L}_{ce}^{\star}\), constructing positive/negative pairs at the snippet granularity
  • Total loss: \(\mathcal{L}_{all} = \mathcal{L}_{tp} + \mathcal{L}_{ce} + \alpha(\mathcal{L}_{tp}^{\star} + \mathcal{L}_{ce}^{\star})\), with \(\alpha=0.75\)
  • The snippet-level branch is used only during training and incurs no inference overhead

Backbone

Built upon DeepGaitV2-2D (a ResNet-style 2D convolution backbone), with standard residual blocks replaced by Residual Snippet Blocks. Horizontal Pyramid Mapping is adopted to extract multi-granularity local representations.

Key Experimental Results

Main Results (In-the-Wild Datasets)

Method Type Backbone Gait3D R1 Gait3D mAP GREW R1 GREW R5
GaitSet Set 2D 36.7 30.0 48.4 63.6
GaitBase Set 2D 64.6 55.3 60.1 75.5
DeepGaitV2-2D Set 2D 68.2 60.4 68.6 82.0
DeepGaitV2-3D Seq 3D 72.8 63.9 79.4 88.9
VPNet Seq 3D 75.4 80.0 89.4
SwinGait-3D Seq Swin3D 75.0 67.2 79.3 88.9
GaitSnippet Snippet 2D 77.5 69.4 81.7 90.9

Key findings:

  • GaitSnippet with a 2D convolution backbone comprehensively outperforms all 3D convolution methods
  • Compared to DeepGaitV2-2D with the same backbone: Gait3D R1 improves by +9.3% and mAP by +9.0%
  • Achieves AVG 95.1% on CCPG (clothing-change scenario) and R1 42.4% on CCGR-MINI, both state-of-the-art

Ablation Study (Gait3D)

Effect of Snippet Sampling:

  • Simply replacing DeepGaitV2-2D's sampling strategy from Set to Snippet raises R1 from 68.2% to 69.5% (+1.3%), indicating that snippet sampling itself provides a regularization effect
  • Optimal hyperparameters: \(L=16, M=4, N=8\)

Snippet Block components:

  • Removing Gathering: R1 drops to 73.3% (snippet-level supervision becomes infeasible)
  • Removing Smoothing: R1 drops to 74.8% (increased noise and semantic gap)
  • Removing Residual: R1 drops to 72.5% (loss of frame-level fine-grained information; largest impact)

Snippet-Level Supervision:

  • \(\alpha=0\) (no snippet-level supervision) still yields competitive performance, confirming the effectiveness of the Snippet Block itself
  • Adding snippet-level supervision (\(\alpha=0.75\)) further improves performance

Highlights & Insights

Highlights:

  • Proposes a third paradigm between sets and sequences, with a clear concept grounded in cognitive science
  • Outperforms all 3D methods using only a 2D convolution backbone at lower computational cost
  • Snippet Block design is concise (non-parametric pooling + \(1\times1\) conv + residual) and easy to integrate into existing architectures
  • Hierarchical supervision (sequence-level + snippet-level) fully exploits the structural advantages of the snippet paradigm
  • Achieves state-of-the-art performance in both controlled (CCPG) and unconstrained (Gait3D/GREW) scenarios

Limitations & Future Work

  • The snippet length \(L\) is fixed at 16 frames, which may not be adaptive enough for individuals with highly variable cadence
  • Cross-Snippet Modeling relies solely on Max Pooling for aggregation; more sophisticated inter-snippet relationship modeling (e.g., Transformer) remains unexplored
  • During inference, all frames across all snippets must be processed, which may still incur high computational cost for long sequences
  • Validation is limited to the silhouette modality; extension to skeleton, RGB, and other modalities has not been explored

Rating

  • Novelty: ⭐⭐⭐⭐ Proposes the snippet paradigm with a well-defined conceptual contribution
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four datasets with comprehensive ablation studies
  • Value: ⭐⭐⭐⭐ 2D backbone surpassing 3D methods demonstrates strong practical utility