CaMiT: A Time-Aware Car Model Dataset for Classification and Generation¶

Conference: NeurIPS 2025 arXiv: 2510.17626 Code: GitHub Area: Image Generation Keywords: temporal dataset, fine-grained classification, continual learning, time-aware generation, car model recognition

TL;DR¶

This paper introduces the CaMiT dataset (787K labeled + 5.1M unlabeled car images, 2005–2023) to systematically study temporal drift in fine-grained visual categories, providing benchmarks across four settings: static pre-training, time-incremental pre-training, time-incremental classifier learning, and time-aware image generation.

Background & Motivation¶

Existing large-scale visual datasets (ImageNet, LAION, DataComp) follow a "train once" paradigm, neglecting the appearance drift that visual categories undergo over time. This drift is particularly pronounced in technical artifacts such as automobiles, where design iterations, new model releases, and the retirement of older models cause the visual representation of the same category to diverge increasingly across years.

Prior work either focuses on coarse-grained visual categories (VCT-107, CLEAR) or emphasizes large-scale VLM continual pre-training (TIC-DataComp), leaving a gap in temporally-aware datasets for fine-grained technical artifacts over long time horizons. Existing car datasets such as StanfordCars and CompCars include production year metadata but lack image upload timestamps and contain fewer than 150 samples per class, making them unsuitable for long-range temporal analysis.

CaMiT collects nearly 20 years of car images via the Flickr API and is the first to address a key question: how to model the visual evolution of fine-grained technical artifacts over extended time periods in visual models.

Method¶

Overall Architecture¶

CaMiT is constructed in three stages — data collection → data filtering → data annotation — and is evaluated across four experimental settings:

SPT (Static Pre-Training): analyzes temporal drift effects without any temporal mitigation
TIP (Time-Incremental Pre-Training): updates the backbone model on a yearly basis
TICL (Time-Incremental Classifier Learning): freezes the backbone and updates only the classification head
TAIG (Time-Aware Image Generation): incorporates temporal metadata into training captions

Key Designs¶

Dataset construction pipeline:

Collection: Flickr API queries over 425 car sub-category/brand/model combinations, up to 5,000 images per query per year, yielding 7.5M initial images
Filtering: CLIP embedding deduplication (threshold 0.9) → YOLOv11x vehicle detection (confidence ≥ 0.6, bbox ≥ 64px) → Qwen2.5-7B filtering of non-exterior images → face blurring → SAM 2 overlap detection, resulting in 5.87M vehicle crops
Annotation: semi-automatic pipeline — Qwen2.5-7B open-set prediction → GPT-4o focused confirmation → discriminator ensemble trained with DeiT weak labels; thresholds determined after human verification of 20K samples, achieving 99.6% annotation accuracy across 190 classes

Temporal drift analysis: KID (Kernel Inception Distance) between years is computed using CLIP ViT-B embeddings, confirming that larger temporal gaps correspond to greater embedding divergence.

Classification experiment design:

SPT: compares DINOv2, CLIP (general pre-training), and MoCo v3 (domain-specific pre-training) with LoRA adaptation
TIP: Reservoir update vs. LoRA annual adaptation vs. their combination
TICL: frozen backbone + NCM / FeCAM / RanPAC / RanDumb incremental classifiers
TAIG: SD1.5 + LoRA with temporal metadata embedded in captions

Loss & Training¶

Classification experiments employ NCM (Nearest Class Mean) or the original objectives of respective TICL methods. Generation experiments are based on Stable Diffusion 1.5 standard diffusion loss with LoRA fine-tuning:

\[\mathcal{L} = \mathbb{E}_{t, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t, c_{\text{time}})\|^2\right]\]

where \(c_{\text{time}}\) denotes the text condition encoding year information.

Key Experimental Results¶

Main Results¶

Pre-training	Model	\(A_{avg}\)↑	\(A_{crt}\)↑	\(A_{bck}\)↑	\(A_{fwd}\)↑
SPT	DINOv2 ViT-B	26.1	32.6	26.1	25.3
SPT	CLIP+Li ViT-B	65.6	74.0	63.9	66.3
SPT	MoCo v3+Li ViT-B	66.0	76.5	63.2	67.4
TIP	MoCo v3+R+La ViT-S	78.5	90.2	82.5	73.0
TICL	RanPAC + MoCo v3+Li ViT-B	87.8	—	—	—

The TICL approach (RanPAC + domain-specific pre-trained backbone) achieves the best overall accuracy of 87.8%, a gain of 21.8 percentage points over naive NCM.

Ablation Study¶

TICL Algorithm	DINOv2 ViT-S	MoCo v3+Li ViT-S	CLIP+Li ViT-B	MoCo v3+Li ViT-B
NCM	20.9	64.9	65.6	66.0
NCM-TI	26.3	71.4	70.1	72.1
FeCAM	61.0	85.6	79.9	81.5
RanDumb	62.1	83.1	77.2	84.2
RanPAC	66.4	86.6	80.3	87.8

LoRA-based annual adaptation in TIP requires only 0.3 GPU hours/year, compared to 18 GPU hours/year for the Reservoir variant
Domain-specific MoCo v3 pre-training with LoRA adaptation achieves performance comparable to CLIP trained on 2 billion images

Key Findings¶

In fine-grained classification, domain-specific pre-training (MoCo v3 on CaMiT) with LoRA adaptation matches large-scale general-purpose models (CLIP trained on 2B images), a finding that diverges from conclusions in coarse-grained classification
Time-Incremental Classifier Learning (TICL) is the most effective strategy for mitigating temporal drift, with RanPAC achieving the best performance on the domain-specific backbone
Time-Aware Image Generation (TAIG), by embedding year metadata, yields generated image distributions that more closely match the real data distribution

Highlights & Insights¶

⭐ The first long-horizon temporal visual dataset targeting fine-grained technical artifacts, filling an important gap in the field
⭐ Systematic evidence that specialized small models can match general large models on fine-grained tasks, providing strong support for resource-efficient approaches
The semi-automatic annotation pipeline (VLM + discriminative model ensemble) achieves 99.6% accuracy at substantially lower cost than fully manual annotation
The TICL approach (frozen backbone + incremental classifier) strikes an excellent balance between accuracy and computational efficiency

Limitations & Future Work¶

The dataset focuses exclusively on automobiles; generalizability to other technical artifact categories (electronics, furniture, etc.) remains unverified
Flickr as the sole data source introduces geographic and demographic biases (European brands are most heavily represented)
Time-aware generation is only explored with SD1.5; more advanced generative architectures are not investigated
The 190 class categories are relatively limited and do not cover all mainstream global car models

CaMiT is complementary to temporal visual datasets such as CLEAR, VCT-107, and TIC-DataComp, which target coarse-grained categories or large-scale VLM pre-training, whereas CaMiT focuses on fine-grained recognition. The strong performance of RanPAC in TICL experiments suggests that random projection combined with prototype accumulation may constitute an efficient paradigm for temporal continual learning. The dataset has direct practical relevance to deployed systems such as vehicle model recognition in autonomous driving.

Rating¶

⭐⭐⭐⭐ (4/5)

Dimension	Rating
Novelty	⭐⭐⭐⭐
Technical Depth	⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐

The dataset contribution is solid and comprehensive, and the four experimental settings are well designed. However, the core technical novelty is relatively limited — the work is primarily a systematic combination of existing methods — and its scope is confined to a single artifact category.