Mahalanobis Distance-Based Multi-View Optimal Transport for Multi-View Crowd Localization¶

Conference: ECCV 2024
arXiv: 2409.01726
Code: Yes (Project Page)
Area: Other
Keywords: Multi-view crowd localization, Optimal transport, Mahalanobis distance, Density map, Point supervision

TL;DR¶

Proposed a Mahalanobis distance-based multi-view optimal transport loss (M-MVOT) that adaptively adjusts transmission cost based on the line-of-sight direction and target-to-camera distance, introducing point-supervised optimal transport to the multi-view crowd localization task for the first time and significantly outperforming density map MSE loss-based methods.

Background & Motivation¶

Multi-view crowd localization aims to fuse information from multiple camera views to predict the positions of all individuals on the ground plane of a scene, with wide applications in crowd analysis, autonomous driving, and public transportation management.

Limitations of Prior Work:

Limitations of Density Map Supervision: Existing methods rely on density maps generated by fixed-size Gaussian kernels as supervision signals and are trained with MSE losses. In crowded areas, when Gaussian kernels overlap significantly, local peaks are smoothed out, making it impossible to locate each individual accurately.

Single-View OT Not Extended to Multi-View: Optimal transport (OT) loss has demonstrated significant advantages in single-image crowd localization (e.g., the GL method), but it has not yet been explored for multi-view crowd localization.

Multi-View Specific Challenges: When features are projected from camera views to the ground plane, streak artifacts are generated along the line-of-sight direction due to the unknown 3D height of objects, which degrades the localization accuracy of density maps.

Method¶

Overall Architecture¶

The model architecture follows a standard multi-view crowd localization pipeline: single-view feature extraction → projection to the ground plane → multi-view fusion and decoding. The innovation lies in replacing the MSE loss of the ground plane density map with the proposed M-MVOT loss.

Key Designs¶

1. Mahalanobis Distance Transmission Cost (Replacing Euclidean Distance)

Standard OT uses Euclidean distance as the transmission cost: $C_{ij} = \exp(\|\mathbf{x}_i - \mathbf{y}_j\|)$

This work utilizes Mahalanobis distance to define a cost function with elliptical contours: $$C_{ij} = \exp(\sqrt{(\mathbf{x}_i - \mathbf{y}_j)^T \mathbf{S}^{-1} (\mathbf{x}_i - \mathbf{y}_j)})$$

The covariance matrix is $\mathbf{S} = \mathbf{R} \boldsymbol{\Sigma} \mathbf{R}^{-1}$, where the rotation matrix $\mathbf{R}$ is determined by the line-of-sight direction.

2. Line-of-Sight Direction Guidance (MV-OT)

The minor axis of the ellipse is along the line-of-sight direction ($\sigma_1^2=1$), and the major axis is perpendicular to the line-of-sight direction ($\sigma_2^2=1.2$). This implies that prediction errors deviating from the line-of-sight direction are penalized more heavily to counteract the streak artifacts caused by projection.

3. Distance Adaptive Adjustment (ED-OT)

Points far from the cameras tend to have larger prediction errors. The variance is adjusted by the distance from the target to the camera: $$\sigma_1^2 = \sigma_2^2 = 1/\exp(\alpha \cdot \text{MinMaxNorm}(d_{cam}))$$

The further the distance, the smaller the variance, and the larger the penalty.

4. Joint Line-of-Sight + Distance (M-OT)

Combining both mechanisms: $$\sigma_1^2 = 1, \quad \sigma_2^2 = \exp(\alpha \cdot \text{MinMaxNorm}(d_{cam}))$$

Distant points are penalized more heavily along the line-of-sight direction, while constraints are appropriately relaxed in the direction perpendicular to the line of sight.

5. Multi-View Extension (M-MVOT)

Based on a distance selection strategy, the transmission cost for each ground truth point is calculated using the M-OT of the nearest camera: $$C_{ij} = \sum_{k=1}^K \mathbb{1}(d_{cam}^k) \exp(\sqrt{(\mathbf{x}_i - \mathbf{y}_j)^T \mathbf{S}_k^{-1} (\mathbf{x}_i - \mathbf{y}_j)})$$

Loss & Training¶

An unbalanced optimal transport (UOT) formulation is adopted and solved using Sinkhorn iterations. The total loss consists of the M-MVOT loss combined with an auxiliary 2D density map loss. The hyperparameter $\tau$ is set to 1 for CVCS and Wildtrack, and 20 for MultiviewX; $\alpha$ is set to 1 for CVCS and 0.05 for others.

Key Experimental Results¶

Main Results¶

CVCS Dataset (Cross-Scene):

Method	MODA↑	MODP↑	Precision↑	Recall↑	F1↑
M-MVOT (Ours)	43.5	74.1	85.5	52.3	64.9
E-MVOT (Ours)	43.1	74.3	85.6	51.8	64.5
SHOT	31.7	72.1	94.5	33.6	49.6
MVDeTr	24.9	79.6	98.1	25.4	40.4
3DROM	20.1	74.2	84.1	23.7	37.0
MVDet	14.2	59.3	85.0	17.3	28.7

MultiviewX Dataset:

Method	MODA↑	MODP↑	F1↑
M-MVOT (Ours)	96.7	86.1	98.3
3DROM	95.0	84.9	97.5
MVDeTr	93.7	91.3	97.8

Ablation Study¶

Loss Function	Line-of-Sight	Distance Adjustment	MODA↑	F1↑
MSE	✗	✗	14.2	28.7
E-MVOT	✗	✗	43.1	64.5
MV-MVOT	✓	✗	42.2	63.4
ED-MVOT	✗	✓	38.7	62.3
M-MVOT	✓	✓	43.5	64.9

Key Findings¶

The transition from MSE to OT brings a huge improvement (MODA from 14.2 to 43.1), demonstrating that point supervision is superior to density map supervision.
M-MVOT consistently outperforms E-MVOT, validating the effectiveness of the Mahalanobis distance cost function.
The improvement is most significant on the cross-scene CVCS dataset, indicating the robust generalization capability of the method.
Visualizations show that M-MVOT achieves better localization in both crowded and distant regions.
M-MVOT reduces false detections and streak artifacts.

Highlights & Insights¶

First to Introduce OT Point Supervision to Multi-View Localization: Fills a research gap, where the transition from density maps to point supervision brings a qualitative leap.
Physics-Driven Cost Function Design: The elliptical contours of the Mahalanobis distance elegantly correspond to the directional characteristics of projection artifacts.
Joint Modeling of Line-of-Sight Direction + Distance: Fully utilizes the geometric information of multi-view systems.
Plug-and-play loss function: The method serves as a plug-and-play loss function that can be combined with any existing multi-view localization model.

Limitations & Future Work¶

It underperforms 3DROM (a dedicated data augmentation method) on the small-scale, single-scene Wildtrack dataset, indicating potential overfitting issues.
The selection strategy of the nearest camera is relatively simple; weighting contributions from multiple cameras could be considered.
The hyperparameter $\alpha$ requires different settings on different datasets, which increases the difficulty of parameter tuning.
The integration with detection frameworks (e.g., DETR-like architectures) has not been explored.
Learning an adaptive covariance matrix instead of manual configuration could be considered.

The GL method proved that MSE loss and Bayesian loss are special, sub-optimal cases of unbalanced OT loss, providing a theoretical foundation for this work.
Mahalanobis distance is widely used in traditional statistics; this work ingeniously combines it with camera geometry.
The one-to-one matching paradigm of P2PNet provides an alternative strategy to density maps.
The data augmentation in 3DROM is orthogonal to the loss improvement in this work, and they can be combined.

Rating¶

Novelty: ★★★★☆ — First to introduce OT into multi-view localization, with physical intuition in the design of Mahalanobis distance.
Practicality: ★★★★☆ — Plug-and-play loss function, easy to integrate.
Experimental Thoroughness: ★★★★★ — Three datasets + detailed ablation + visualization analysis.
Writing Quality: ★★★★☆ — Clear mathematical derivations, progressive reasoning.