Walking Further: Semantic-aware Multimodal Gait Recognition Under Long-Range Conditions¶

Conference: AAAI 2026 arXiv: 2603.14189 Code: github.com/O-VIGIA/LRGait Area: Autonomous Driving / Gait Recognition Keywords: Multimodal Gait Recognition, Long-Range Recognition, LiDAR-Camera Fusion, CLIP Semantic Guidance, Cross-Distance Retrieval

TL;DR¶

This paper introduces LRGait — the first LiDAR-Camera multimodal gait dataset targeting long-range (10–50m) cross-distance scenarios — and proposes EMGaitNet, an end-to-end framework that achieves 2D-3D cross-modal feature fusion via Semantic Mining (SeMi), Semantic-Guided Alignment (SGA), and Symmetric Cross-Attention Fusion (SCAF) modules, reaching state-of-the-art performance on multiple benchmarks.

Background & Motivation¶

State of the Field¶

Gait recognition is a non-intrusive and difficult-to-spoof biometric technology with important applications in intelligent surveillance and remote identity verification. While recent methods have achieved strong results in controlled environments, significant challenges remain under long-range and multimodal conditions.

Limitations of Prior Work¶

Dataset	Sensor	Max Distance	Cross-Distance	Day/Night
CASIA-B	Camera	2–4m	✗	✗
SUSTech1K	LiDAR+Camera	8–12m	✗	✗
FreeGait	LiDAR+Camera	25m	✗	✗
LRGait (Ours)	LiDAR+Camera	10–50m	✓	✓

Core issues: 1. Existing datasets extend to at most 25m, insufficient to cover real-world surveillance needs (e.g., 50m). 2. No cross-distance samples for the same identity at varying distances (e.g., 50m→10m retrieval). 3. Most methods support only a single modality, failing to exploit the complementary advantages of LiDAR and RGB.

Root Cause¶

Modality Gap: The representation spaces of RGB images and LiDAR point clouds differ substantially, making direct fusion ineffective.

Preprocessing Loss: Existing methods typically use depth maps (projected from point clouds) or silhouette maps (extracted from RGB) as inputs, discarding fine-grained geometric/textural details.

Long-Range Degradation: At long distances, point clouds become extremely sparse and RGB images become blurry, greatly exacerbating the degradation of preprocessing-based methods.

Method¶

Overall Architecture¶

EMGaitNet is an end-to-end framework that directly processes raw RGB video and raw point cloud sequences:

Feature Extraction: ResNet9 extracts 2D visual features; PointGNN extracts 3D geometric features.
SeMi Module: CLIP-based semantic mining that extracts body-part-aware semantic cues.
SGA Module: Semantic-Guided Alignment that bridges the 2D-3D modality gap using semantic features.
SCAF Module: Symmetric Cross-Attention Fusion that hierarchically integrates 2D-3D features.
ST Module: Spatio-temporal module that captures global gait dynamics.

Key Designs¶

1. Dual-Stream Feature Extraction¶

2D Branch: Uses OpenGait's lightweight ResNet9 backbone to extract per-frame visual features $F_{i,j}^{2d} \in \mathbb{R}^{h \times w \times d}$.

3D Branch: Employs the PointGNN backbone to process raw point clouds, progressively capturing local and global geometric patterns via graph convolution layers:

A local neighborhood graph is constructed based on feature cosine similarity (TopK most similar points): $$\mathcal{N}_{P_i^j}(P_i^j[k]) = \underset{u \neq k}{\text{TopK}}(\cos(F_{i,j}^{3d}[k], F_{i,j}^{3d}[u]))$$
Edge features are computed and aggregated: $$F_{i,j}^{3d}[k] = \text{Maxpool}_{u \in \mathcal{N}}(\text{MLP}(e_{k,u}))$$

Design Motivation: Graph-based representations in PointGNN are better suited for sparse point clouds (long-range scenarios) than PointNet++. Constructing graphs based on feature similarity rather than pure spatial distance mitigates the impact of point cloud sparsity.

2. CLIP Semantic Mining Module (SeMi)¶

Function: Leverages CLIP to extract body-part-level semantic cues, serving as an intermediate bridge for cross-modal alignment.

Core Idea: - Constructs body-part prompts: "A photo of the [PART] of a [X] person", where [PART] ∈ {"head", "arms", "torso", "legs", "feet"}. - Extracts global visual embeddings $v = \text{CLIP}_v(I_i^j)$ using the CLIP visual encoder. - Maps visual features $v$ to the text space via an Inversion Network to obtain $v^*$, replacing the [X] placeholder. - Feeds the modified prompts into the CLIP text encoder to produce identity-aware semantic features $t^* \in \mathbb{R}^{5 \times d}$.

Design Motivation: - Class-level semantics (e.g., "a person's legs") are insufficient; gait recognition requires instance-level fine-grained semantics. - The Inversion Network injects individual-specific visual information into the prompts, personalizing the semantic cues. - Body-part decomposition provides a natural intermediate representation for subsequent cross-modal alignment.

3. Semantic-Guided Alignment Module (SGA)¶

Function: Uses the multi-granularity semantic cues produced by SeMi as an intermediate bridge to align RGB and point cloud features.

Core Idea: Applies cross-attention with 2D/3D features as Query and semantic features $t^*$ as Key/Value:

\[\text{CA}(\bar{F}_{i,j}^{2d}, t^*) = \text{Softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V\]

Features are refined via residual connection and FFN:

\[\hat{F}_{i,j}^{2d} = \text{LayerNorm}(\tilde{F}_{i,j}^{2d} + \text{FFN}(\tilde{F}_{i,j}^{2d}))\]

The same operation is applied to 3D features $F_{i,j}^{3d}$, yielding aligned $\hat{F}_{i,j}^{3d}$.

Design Motivation: Semantic features serve as shared alignment anchors, aligning features from different modalities in a common semantic space while suppressing background noise (e.g., spurious associations between LiDAR point clouds and irrelevant RGB background regions) via the attention mechanism.

4. Symmetric Cross-Attention Fusion Module (SCAF)¶

Function: Hierarchically integrates complementary information in the aligned shared space through symmetric dual-stream cross-attention.

Core Idea: 2D and 3D features alternately serve as Query while attending to the other as Key/Value, enabling bidirectional alignment and mutual information fusion:

\[F_{i,j}^{2d'} = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O\]

where each attention head is: $$\text{head}_k = \text{Softmax}\left(\frac{Q_k K_k^\top}{\sqrt{d_h}}\right) V_k$$

$Q_k$ is derived from 2D features and $K_k, V_k$ from 3D features (and vice versa).

Design Motivation: The symmetric structure ensures both modalities contribute equally; the multi-head design captures complementary information across different subspaces; hierarchical fusion progressively refines cross-modal features.

5. Spatio-Temporal Module (ST)¶

Function: Aggregates global gait dynamics.

Temporal Pooling: MaxPool over the 2D feature sequence along the temporal dimension → $F_i^{tp} \in \mathbb{R}^{h \times w \times d}$
Spatial Pooling: AvgPool over 3D features along the point dimension → $F_i^{sp} \in \mathbb{R}^{n \times d}$
Spatio-Temporal Cross-Attention Fusion:

\[\tilde{F}_i^{sp} = \text{CA}(F_i^{sp}, F_i^{tp}) + F_i^{sp}$$ $$F_i^{fusion} = \text{MLP}(\text{CA}(F_i^{tp}, \tilde{F}_i^{sp}) + F_i^{tp})\]

Horizontal Pyramid Pooling (HPP) is then applied for part-level matching.

Loss & Training¶

\[\mathcal{L} = \alpha \mathcal{L}_{tri} + \beta \mathcal{L}_{ce}\]

$\mathcal{L}_{tri}$: Triplet loss ($\alpha=1.0$) for learning discriminative embedding spaces.
$\mathcal{L}_{ce}$: Cross-entropy loss ($\beta=2.0$) for classification supervision.
L2 distance is used at inference to measure similarity between probe and gallery samples.
Adam optimizer with weight decay 0.0005; MultiStepLR decay at 15K and 30K epochs.
10 RGB frames and corresponding point clouds are randomly sampled per epoch; trained on 2× RTX 3090 GPUs.

Key Experimental Results¶

Main Results¶

SUSTech1K (Overall Rank-1 Accuracy):

Method	Modality	NM	BG	CL	OC	Overall R-1	Overall R-5
GaitBase	sil	81.3	77.3	49.6	81.4	76.0	89.1
LidarGait++	pc	94.2	93.9	79.7	91.9	92.7	98.2
LiCAF	depth+sil	95.8	95.7	82.7	96.6	93.9	98.8
EMGaitNet	pc+rgb	98.2	96.4	81.7	99.6	96.0	99.0

LRGait (Cross-Distance Cross-View Rank-1, gallery=D-10):

Method	Modality	D-20	D-30	D-40	D-50	N-20	N-30	Overall R-1
GaitBase	sil	67.9	53.9	48.5	33.8	41.6	33.4	46.8
LiCAF	depth+sil	74.8	71.6	60.4	65.3	42.5	27.8	59.6
EMGaitNet	pc+rgb	88.5	82.4	80.8	74.4	38.2	31.7	68.9

FreeGait:

Method	R-1	R-5	mAP
LidarGait++	82.0	93.6	87.2
HMRNet	80.8	93.6	86.5
EMGaitNet	85.2	96.8	89.0

Ablation Study¶

Contribution of Each Module (LRGait dataset):

Configuration	Baseline	+SGA	+SGA+SeMi	+SGA+SeMi+ST
Overall R-1	52.3	58.5	64.2	68.9
Overall R-5	70.2	75.9	80.7	85.8

Incremental contribution of each module: - SGA: +6.2% (bridging the modality gap) - SeMi: +5.7% (semantic guidance) - ST: +4.7% (spatio-temporal modeling)

Key Findings¶

End-to-end outperforms preprocessing: EMGaitNet using raw pc+rgb surpasses LiCAF using depth+sil by 2.1% on SUSTech1K.
Clear advantage at long range: At D-50, EMGaitNet outperforms the second-best method by 14.0% (74.4% vs. 65.3%), demonstrating the superiority of end-to-end approaches at far distances.
Strong occlusion robustness: Achieves 99.6% accuracy under OC conditions, indicating that multimodal priors effectively compensate for missing occluded information.
Nighttime remains challenging: All methods degrade substantially at night; EMGaitNet achieves only 21.9% at N-40.
SeMi contributes most critically: Body-part-level semantic cues provide the essential intermediate representation for cross-modal alignment.

Highlights & Insights¶

First 50m long-range multimodal gait dataset: Fills the gap in long-range gait recognition data, covering cross-distance, day/night, and multi-weather conditions.
CLIP-guided cross-modal fusion: An inversion network injects visual information into the text space to generate instance-level semantic anchors — a novel cross-modal alignment strategy.
End-to-end processing of raw inputs: Avoids information loss caused by feature preprocessing (depth mapping, silhouette extraction), especially in long-range and nighttime scenarios.
PointGNN for gait recognition: First application of a graph-based 3D backbone to gait tasks; constructing graphs via feature similarity rather than spatial distance adapts well to sparse point clouds.
LiDAR long-range pedestrian detection benchmark: Additional annotation of 4,500 frames for long-range pedestrian detection provides independent research value.

Limitations & Future Work¶

Severe day-to-night domain shift: Nighttime performance drops substantially; dedicated multimodal domain adaptation methods are needed.
Frozen CLIP: The possibility of fine-tuning CLIP on gait data remains unexplored.
Computational overhead: Graph construction in PointGNN and CLIP encoding increase inference latency.
Only 101 identities: Dataset scale remains small compared to GREW (26K identities).
Single-person assumption: Multi-person scenarios that arise in real-world surveillance are not addressed.

Comparison with LiCAF: LiCAF performs asymmetric fusion via simple cross-attention, whereas EMGaitNet employs semantically guided symmetric fusion, offering a more principled design.
Relationship to SUSTech1K: LRGait extends the distance range from 12m to 50m, representing a major advancement in the distance coverage of gait datasets.
Insight: The semantic capabilities of VLMs such as CLIP can effectively bridge heterogeneous modalities; this strategy is generalizable to other multimodal fusion tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — Outstanding dataset contribution; CLIP-guided fusion is creative, though individual modules are relatively conventional.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three datasets, detailed ablations, and comparisons across diverse modalities and conditions.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with thorough dataset description.
Value: ⭐⭐⭐⭐ — The dataset and baselines provide an important foundation for long-range gait recognition research.