Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance¶

Conference: ECCV 2024
arXiv: 2405.10589
Code: Mentioned in the paper that it will be released
Area: Human Understanding
Keywords: Crowd Counting, Crowd Localization, Point Supervision, Matching Instability, Implicit Feature Interpolation

TL;DR¶

Proposes an Auxiliary Point Guidance (APG) strategy and an Implicit Feature Interpolation (IFI) module to stabilize the proposal-target matching instability in point-based crowd counting methods by explicitly generating auxiliary positive and negative samples near ground-truth points, achieving state-of-the-art results on multiple datasets.

Background & Motivation¶

Crowd counting and localization methods can be broadly categorized into three types: density map-based (map-based), detection-based, and point-based. Point-based methods (e.g., P2PNet, CLTR, PET) have attracted attention due to being end-to-end trainable and requiring no complex post-processing, but they suffer from a key problem: unstable proposal-target matching.

Specifically, in each training epoch, a large number of target points are matched with different point proposals, which blurs the learning objective of each proposal. The authors define the Instability Rate (IR) to measure this instability and find that existing methods (e.g., Matcher) maintain a high IR during training. The root of this instability lies in the lack of effective learning strategies to guide the network to consistently select the most appropriate proposal, ultimately leading to regional underestimation or overestimation.

Method¶

Overall Architecture¶

APGCC uses VGG-16 as the backbone network to extract features from conv3 and conv4 layers. After enhancing multi-scale representations through an ASPP module, implicit feature interpolation (IFI) is applied to obtain feature representations at arbitrary positions. Finally, these are fed into prediction heads to output confidence and offset. The overall loss is:

\[\mathcal{L}_{overall} = \mathcal{L}_{point} + \lambda_5 \mathcal{L}_{APG}\]

Key Designs¶

Auxiliary Point Guidance (APG): Core innovation, designed to solve the matching instability problem. For each ground-truth point \((x,y)\), auxiliary positive samples \(A_{pos}\) and auxiliary negative samples \(A_{neg}\) are generated in its vicinity:
- Auxiliary Positive Samples: Randomly generated with offsets within the range of \([-n_{pos}, n_{pos}]\) near the ground-truth coordinates. The training objective is a confidence score close to 1, and the predicted offset should steer back to the ground-truth location.
- Auxiliary Negative Samples: Generated within the range of \([-n_{neg}, -n_{pos}] \cup [n_{pos}, n_{neg}]\). The training objective is a confidence score close to 0 and an offset converging to zero, preventing negative samples from "crossing the border" of the ground truth through offsets.
- Design Motivation: By explicitly teaching the network that "proposals close to the ground truth should be positive and those far away should be negative," the matching process is guided to be stable, enabling the same ground-truth point to be consistently matched to the same proposal across different epochs.
Implicit Feature Interpolation (IFI): Since auxiliary points are located at arbitrary coordinates rather than grid points, traditional bilinear interpolation is not flexible enough. IFI utilizes implicit functions for continuous feature representation:
- For any position \((x,y)\), the four nearest latent features \(Z_i^*\) are located and the distances \(\delta_i^*\) are calculated.
- The features and distances are concatenated and fed into an MLP \(f_\theta\), and positional encoding \(\phi(\cdot)\) is introduced to enhance the capture of high-frequency information.
- The final feature is obtained via area-weighted summation: \(F_{proposal}(x,y) = \sum_{i=1}^{4} \frac{S_i}{S} f_\theta(Z_i^*, \delta_i^*, \phi(\delta_i^*))\)
- Compared to traditional upsampling, IFI achieves more precise feature learning with fewer parameters.

Loss & Training¶

Point Loss \(\mathcal{L}_{point}\): Contains the Cross Entropy classification loss \(\mathcal{L}_{cls}\) and the Euclidean regression loss \(\mathcal{L}_{loc}\).
APG Positive Loss \(\mathcal{L}_{APG}^{pos}\): Maximizes the confidence of auxiliary positive samples and minimizes the distance between their predicted locations and the ground truth.
APG Negative Loss \(\mathcal{L}_{APG}^{neg}\): Minimizes the confidence of auxiliary negative samples and constrains their offsets to approach zero.
Training Details: Adam optimizer, learning rate \(10^{-4}\) (backbone \(10^{-5}\)), batch size 8, stride \(s=8\), \((k_{pos}, k_{neg}) = (2, 2)\), random ranges \((n_{pos}, n_{neg}) = (2, 8)\).

Key Experimental Results¶

Main Results¶

Dataset	Metric	APGCC	PET (Prev. SOTA)	Gain
SHHA	MAE/MSE	48.8/76.7	49.3/78.7	-0.5/-2.0
SHHB	MAE/MSE	5.6/8.7	6.1/9.6	-0.5/-0.9
UCF_CC_50	MAE/MSE	154.8/205.5	159.9/223.7	-5.1/-18.2
UCF-QNRF	MSE	136.6	144.3	-7.7
JHU-Crowd++	MAE/MSE	54.3/225.9	58.5/238.0	-4.2/-12.1
NWPU	MAE/MSE	71.7/284.4	74.4/328.5	-2.7/-44.1

Localization Performance (NWPU):

Method	F1(σ_l)	P(σ_l)	F1(σ_s)
PET	74.2%	75.2%	67.5%
APGCC	76.4%	79.2%	68.9%

Localization Performance (SHHA):

Method	F1(σ=4)	F1(σ=8)
CLTR	43.2%	74.2%
APGCC	48.7%	78.4%

Ablation Study¶

Configuration	MAE	Description
Matcher only	Baseline	Hungarian matching only
Nearest Point	Severe underestimation	Multiple ground truths mapped to the same proposal
APG only	Suboptimal	Excessive dependence due to the lack of auxiliary point references during inference
Matcher + APG	48.8	Balances allocation and guidance

IFI Ablation:

Configuration	MAE	Description
Nearest neighbor (w/o MLP)	Poor	Insufficient feature context
Bilinear interpolation (w/o MLP)	Moderate	Lacks continuous distance transformation
IFI with single reference point	Suboptimal	Insufficient reference points
IFI w/o positional encoding	Suboptimal	Lost high-frequency information
IFI (Full)	Optimal	All components synergistic

Key Findings¶

APG significantly reduces the Instability Rate (IR), and a (2,2) ratio of positive to negative samples is sufficient.
APG is only used during training and adds no inference overhead.
IFI achieves better performance with fewer parameters compared to traditional upsampling.
Under a strict localization threshold (σ=4), the F1 score of APGCC improves by 5.5% compared to CLTR.

Highlights & Insights¶

Precise Problem Definition: The paper quantifies the long-neglected matching instability in point-based methods using the Instability Rate.
Ingenious Auxiliary Point Approach: Instead of changing the matching mechanism itself, it "teaches" the network how to make the correct choices through additional supervision signals.
Decoupled Training/Inference: APG only functions during training, incurring zero overhead during inference, which is engineering-friendly.
IFI provides a general feature extraction scheme for arbitrary positions, which can be transferred to other tasks requiring non-grid positional features.

Limitations & Future Work¶

The random ranges \((n_{pos}, n_{neg})\) of auxiliary points require tuning based on the specific dataset.
The stride is fixed at 8, lacking an adaptive multi-scale stride mechanism.
The application of the APG strategy to set prediction frameworks like the DETR series detectors has not been explored.
Although the MLP parameter size of implicit feature interpolation is small, it still introduces some computational cost.

P2PNet first proposed the point-based crowd counting framework but did not resolve the matching instability problem.
Hungarian matching in DETR also exhibits early instability; the concept of APG could inspire potential solutions there.
The success of implicit functions (e.g., LIIF, NeRF) in continuous representation inspired the design of IFI.

Rating¶

Novelty: ⭐⭐⭐⭐ — The APG strategy is novel with clear intuition, and the analysis of matching instability is thorough.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation on 6 datasets across dual tasks of counting and localization, with exhaustive ablation studies.
Writing Quality: ⭐⭐⭐⭐ — Clear structure and well-elaborated problem motivation.
Value: ⭐⭐⭐⭐ — Provides substantial improvements for point-based methods, and the APG concept is generalizable.