AFreeCA: Annotation-Free Counting for All¶
Conference: ECCV 2024
arXiv: 2403.04943
Code: https://github.com/adrian-dalessandro/AFreeCA
Area: Object Detection / Counting
Keywords: Object Counting, Unsupervised, diffusion model, synthetic data, Density Estimation
TL;DR¶
By leveraging Stable Diffusion to generate synthetic sorting/counting data, this work implements a two-stage strategy of learning sorting before anchoring counts, combined with density-guided image partitioning. This enables the first annotation-free counting method applicable to objects of arbitrary categories, outperforming existing unsupervised methods in crowd counting.
Background & Motivation¶
Background: Object counting (especially crowd counting) traditionally relies on density map supervision, making annotations extremely expensive (e.g., 5,109 images in the ShanghaiTech dataset require 3,000 hours of annotation). Unsupervised methods attempt to eliminate this annotation burden but are currently limited to crowd counting (e.g., CSS-CCNN, CrowdCLIP) and cannot generalize to other categories.
Limitations of Prior Work: - Fully supervised methods require point-wise annotations, leading to high costs and fixed categories. - Existing unsupervised methods (e.g., CSS-CCNN) require prior knowledge (such as maximum count and power-law distribution parameters). - Although CrowdCLIP utilizes CLIP for zero-annotation, it is still limited by the inadequate quantity comprehension of the CLIP text encoder. - There is no general, annotation-free counting method applicable to arbitrary categories.
Key Challenge: While LDMs (such as Stable Diffusion) can generate high-quality synthetic images, their text encoders lack accurate comprehension of specific quantities—a prompt for "20 people" might generate an image containing 15 or 25 people. This label noise intensifies as the quantity increases, preventing direct supervision using synthetic data.
Goal: How to bypass the inaccurate quantity comprehension of LDMs and exploit their powerful image generation capabilities to build an annotation-free object counting system?
Key Insight: Although the absolute counts generated by LDMs are inaccurate, the relative sorting signals they provide when adding or subtracting objects are highly reliable (99% accuracy). Therefore, one can first learn sorting features using this reliable signal, and then perform lightweight anchoring with the noisy counting data.
Core Idea: Decompose the counting problem into "learning sorting first (reliable signals) \(\rightarrow\) anchoring counts next (noisy but sufficient) \(\rightarrow\) performing density-guided partitioning for high-density areas", using LDM synthetic data throughout without manual annotations.
Method¶
Overall Architecture¶
A three-stage pipeline: (1) Use Stable Diffusion's image-to-image and outpainting to generate sorting triplet data, pre-training a sorting network to learn features related to object quantity; (2) Use SD text-to-image to generate synthetic images with noisy counting labels, fine-tuning only a linear layer to anchor sorting features to specific counts; (3) Train a density classifier to partition high-density regions during inference to improve accuracy.
Key Designs¶
-
Synthetic Sorting Data Generation + Sorting Network Pre-training:
- Function: Starting from a reference image, use SD with image-to-image to remove objects and outpainting to add objects, producing sorting triplets \((x^{syn-}, x^{ref}, x^{syn+})\).
- Mechanism: For each reference image, generate 4 incremented versions and 4 decremented versions to obtain 16 ordered triplets. Train a ResNet-50 backbone using a RankSim-style sorting loss, aligning sorting relationships in both the feature space \(\ell_{sort}^z\) and the prediction space \(\ell_{sort}^y\).
- Design Motivation: The sorting signals from SD (relative magnitude relationships after adding/subtracting objects) are much more accurate than absolute counts (99% reliable). Learning quality features with this reliable signal first avoids training from scratch with noisy counting labels.
- Key Verification: Feature map visualization demonstrates that the sorting network accurately focuses on target object regions.
-
Synthetic Counting Data + Count Anchoring:
- Function: Use SD to generate synthetic images with counting prompts (e.g., "20 people"), fine-tuning only a linear layer to map sorting features to specific count values.
- Mechanism: Generate synthetic images with various counts in the range of 1–1000 (150 images per category + 800 zero-object images). Freeze the backbone and train only the linear layer \(g_\Phi\) using the MSE loss \(\mathcal{L}_{count}\).
- Noise Filtering: Adopt the CleanNet approach to compute characteristic prototype vectors for each counting category, filtering out samples inconsistent with their respective category prototypes.
- Design Motivation: Fine-tuning only the linear layer is intentional—protecting the pre-trained sorting features from being corrupted by noisy synthetic counting data. Ablation studies show that full-network fine-tuning performs significantly worse (MAE 43.9 vs 35.0 on SHB).
-
Density Classifier Guided Partitioning (DCGP):
- Function: Partition dense images during inference, counting each sub-patch individually and aggregating the results.
- Mechanism: (a) Use SD to generate synthetic images across three density levels (none/sparse/dense) to train the density classification head \(h_\phi\); (b) During inference, generate a count map and a density map, directly summing the count map for sparse regions, and cropping patches from the original high-resolution image for dense regions to recount through the network.
- Design Motivation: The counting network is more accurate at lower counts (as SD's label noise scales with quantity), and partitioning ensures the number of objects per patch falls within a reliable range. Using original high-resolution patches also avoids feature loss caused by resizing.
- DCGP vs Simple Partitioning: Simple 3×3 partitioning compared to no partitioning (which worsens MAE from 42.1 to 47.1 on SHB due to boundary noise in sparse regions) shows that DCGP selectively partitioning only dense regions (35.0) yields significantly better performance.
Loss & Training¶
- Pre-training phase: \(\mathcal{L}_{sort} = \ell_{sort}^y + 5.0 \cdot \ell_{sort}^z\), joint sorting loss.
- Count anchoring phase: \(\mathcal{L}_{count}\) MSE loss, training only the linear layer.
- Density classification phase: \(\mathcal{L}_{dense}\) cross-entropy loss, training only the classification head.
- Sequential training across three phases, with the backbone frozen during the latter two stages.
Key Experimental Results¶
Main Results (Crowd Counting)¶
| Dataset | Metric | AFreeCA | CrowdCLIP | CSS-CCNN++ | Gain |
|---|---|---|---|---|---|
| ShanghaiTech B | MAE | 35.0 | 69.3 | - | -49.5% |
| JHU-Crowd++ | MAE | 173.8 | 213.7 | 197.9 | -12.2% |
| ShanghaiTech A | MAE | 152.7 | 146.1 | 195.6 | Slightly higher than CrowdCLIP |
| UCF-QNRF | MAE | 283.1 | 283.3 | 414.0 | Comparable / Substantially exceeded |
Object Counting (CARPK Vehicle Counting)¶
| Method | Type | MAE | MSE |
|---|---|---|---|
| BMNet+ | Few-shot | 10.44 | 13.77 |
| CLIP-Count | Zero-shot | 11.96 | 16.61 |
| AFreeCA | Unsupervised | 9.35 | 12.29 |
Ablation Study¶
| Configuration | SHB MAE | SHB MSE | Description |
|---|---|---|---|
| ImageNet pretrain | 68.1 | 104.7 | No sorting pre-training |
| Intra-Image Rank | 52.8 | 76.3 | Intra-image crop sorting only |
| \(\ell_{sort}^y\) only | 38.2 | 58.0 | Prediction space sorting loss only |
| \(\ell_{sort}^y + \ell_{sort}^z\) (Ours) | 35.0 | 50.7 | Joint feature + prediction sorting |
| Full network finetune | 43.9 | 74.6 | Full-network fine-tuning performs worse |
| Last layer finetune | 35.0 | 50.7 | Training only the linear layer is better |
Key Findings¶
- Sorting pre-training is crucial: Transitioning from ImageNet to Sorting pre-training reduces SHB MAE from 68.1 to 35.0 (cut in half).
- Fine-tuning only the linear layer is far superior to full-network fine-tuning: 35.0 vs 43.9, proving the effectiveness of the strategy to protect sorting features from noisy data corruption.
- DCGP significantly outperforms fixed partitioning: A fixed 3x3 partition performs worse than no partitioning on certain datasets, whereas DCGP selectively processes regions, yielding overall improvement.
- Outperforms few-shot methods on CARPK vehicle counting: MAE of 9.35 vs BMNet+'s 10.44, demonstrating the category-agnostic nature of the method.
- ShanghaiTech A is the only dataset where the method does not lead: Likely because the ultra-high-density scenes in SHA exceed the partitioning capabilities of DCGP.
Highlights & Insights¶
- Turning LDM's "weakness" into an advantage: The inaccurate quantity comprehension of SD is a recognized issue, but this work cleverly identifies that its sorting signals from adding/subtracting objects are highly reliable. This "step-back" philosophy—prioritizing reliable sorting relationships over precise count labels—is highly elegant.
- The strategy of freezing the backbone and training only the linear layer: Simple yet extremely effective, this serves as a general strategy for handling synthetic data domain gaps and label noise. This insight is valuable for other tasks utilizing synthetic data for training.
- DCGP adaptive partitioning: Partitioning selectively on dense regions rather than uniformly. This avoids boundary noise caused by over-segmenting sparse regions, forming a practical engineering design.
Limitations & Future Work¶
- The density classifier utilizes only three categories: The granularity of no/sparse/dense may be insufficient; multi-level density classification or regression might be superior.
- Performance on SHA (ultra-high-density) falls short of CrowdCLIP: In extreme high-density scenarios, 3x3 partitioning might still be too coarse, necessitating finer-grained recursive partitioning strategies.
- Synthetic data uses only simple prompts: No complex prompt engineering or ControlNet was utilized to improve synthetic data quality.
- Fully freezing the backbone might be overly conservative: Selectively fine-tuning some layers (e.g., adapters or LoRA) might yield a better balance between protecting features and adapting to the task.
- No utilization of LDM density map information: Attention maps from SD/SDXL may contain object location information, which could serve as weakly supervised spatial signals.
Related Work & Insights¶
- vs CrowdCLIP: CrowdCLIP utilizes CLIP features for sorting and text matching, whereas this work uses sorting triplets generated by SD for pre-training—the latter learns features that are more focused on target objects (independent of CLIP's semantic features).
- vs CSS-CCNN++: CSS-CCNN requires prior knowledge (e.g., maximum count, power-law parameters), while this work is entirely prior-free.
- vs CLIP-Count (Zero-shot): CLIP-Count achieves an MAE of 45.7 on SHB, while ours achieves 35.0; ours also performs better on CARPK. The unsupervised method outperforms the zero-shot approach.
Rating¶
- Novelty: ⭐⭐⭐⭐ The perspective of using LDM sorting signals for counting pre-training is novel, presenting the first annotation-free counting method for arbitrary categories.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across multiple datasets with detailed ablation studies, though it lacks quantitative evaluations on more non-crowd categories.
- Writing Quality: ⭐⭐⭐⭐ Clear logic, sequentially progressing through a three-stage pipeline, with well-designed diagrams.
- Value: ⭐⭐⭐⭐ Opens up a new direction for annotation-free counting using generative models, indicating high practical application potential.