Fast Data Attribution for Text-to-Image Models¶
Conference: NeurIPS 2025 arXiv: 2511.10721 Code: https://peterwang512.github.io/FastGDA Area: Image Generation Keywords: Data Attribution, Text-to-Image Models, Learning-to-Rank, Feature Distillation, Efficient Retrieval
TL;DR¶
This work distills the accurate but computationally expensive Attribution by Unlearning (AbU) method into a lightweight feature embedding space. By training via learning-to-rank, simple cosine similarity retrieval approximates the costly attribution ranking, enabling millisecond-level data attribution at the scale of Stable Diffusion + LAION-400M for the first time.
Background & Motivation¶
Data attribution aims to identify, given an image generated by a text-to-image model, the training images that most influenced that output. This problem is of critical practical importance for applications such as creator compensation and copyright tracing.
Existing methods face a fundamental dilemma — a trade-off between efficiency and accuracy:
- Influence function methods (TRAK, D-TRAK): Require pre-storing gradient information for all training samples, incurring enormous storage overhead (30–290 GB); a single query takes tens of seconds to minutes, and dimensionality reduction projection degrades accuracy.
- Unlearning methods (AbU): Compute attribution by performing an "unlearning" operation on the generated image and detecting which training images are affected. High accuracy but extremely slow — each query requires over 2 hours.
- Off-the-shelf feature retrieval (DINO, CLIP): Millisecond-level retrieval, but based solely on visual/semantic similarity and unable to truly reflect causal data influence.
- Economic infeasibility: Text-to-image platforms charge 5–10 cents per image, whereas the computational cost of existing attribution methods may be orders of magnitude higher.
The core insight is: can the precise attribution capability of slow methods be "taught" to a fast feature retrieval system?
Method¶
Overall Architecture¶
The method consists of an offline training phase and an online deployment phase. In the offline phase, AbU+ serves as the teacher to generate large-scale attribution ranking data, which is used to train an attribution-specific feature embedding network via learning-to-rank. In the online phase, only the query image's feature embedding needs to be computed, followed by fast similarity retrieval using indices such as FAISS.
Key Designs¶
-
Attribution by Unlearning+ (AbU+): Performs certified unlearning on a pretrained model — a one-step Newton update maximizes the loss on the generated image (effectively "forgetting" it), then detects which training images exhibit increased reconstruction loss. The attribution score is defined as \(\tau(\hat{\mathbf{z}}, \mathbf{z}) = \mathcal{L}(\mathbf{z}, \theta_{-\hat{\mathbf{z}}}) - \mathcal{L}(\mathbf{z}, \theta_0)\). Compared to the original AbU, EK-FAC (Eigenvalue-corrected Kronecker Factorization) replaces the diagonal Fisher approximation for inverting the Fisher information matrix, significantly improving attribution quality. However, AbU+ still requires a forward pass over each training sample, taking 2 hours for 100K training images.
-
Two-stage data collection strategy: Directly computing attribution scores over the entire training set is prohibitively expensive. Observing that most training samples have negligible contribution to a given generated image, the method first retrieves \(K=10000\) nearest neighbors using off-the-shelf DINO features, then applies AbU+ only to these candidates. This reduces the per-query data collection cost from \(O(N)\) to \(O(K)\), provided the neighbor set contains the truly influential images.
-
Attribution-specific feature learning: The feature embedding is \(f_\psi = g_\psi \circ \phi\), where \(\phi\) is a pretrained encoder (DINO image encoder concatenated with CLIP text encoder) and \(g_\psi\) is a three-layer MLP. Cosine similarity predicts attribution ranking as \(r_\psi(\hat{\mathbf{z}}, \mathbf{z}_i) = \cos(f_\psi(\hat{\mathbf{z}}), f_\psi(\mathbf{z}_i))\), trained with a BCE loss: attribution rankings are normalized to labels in \([1/K, 1]\), transformed through a sigmoid with learnable affine scaling, and then cross-entropy is computed. BCE is preferred over MSE regression (which fails to converge) or ordinal loss (which does not support fast cosine retrieval).
Loss & Training¶
- BCE ranking loss: \(\mathcal{L}(\psi, \alpha, \beta) = \mathbb{E} [\ell_{\text{BCE}}(\pi_{\hat{\mathbf{z}}}^i, \sigma_{\alpha,\beta}(r_\psi(\hat{\mathbf{z}}, \mathbf{z}_i)))]\), where \(\sigma_{\alpha,\beta}(x) = 1/(1+e^{-(\alpha x + \beta)})\) with two learnable affine parameters.
- Negative sample injection: With probability 0.1, training samples are randomly drawn from outside the neighbor set and assigned the worst rank (rank=1), helping the model distinguish irrelevant images. At a ratio of 0.1, mAP improves from 0.709 to 0.724; higher ratios degrade fine-grained ranking learning.
- Neighbor subsampling: Only \(M \approx 0.1K\) candidate training samples are used per iteration, substantially reducing data collection cost with negligible loss in ranking accuracy.
- Data scaling study: Increasing the number of queries is more effective than increasing the number of candidates per query — under a fixed budget of 2.45M attribution pairs, configurations with more queries and fewer candidates per query perform better.
Key Experimental Results¶
Main Results: MSCOCO Counterfactual Evaluation¶
On 100K MSCOCO training images, leave-\(K\)-out counterfactual testing is performed on 110 generated image queries (the model is retrained after removing the top-\(K\) attributed training images):
| Method | Latency | Storage | ΔL(k=500)↑ | ΔL(k=4000)↑ | MSE(k=500)↑ | CLIP(k=500)↓ |
|---|---|---|---|---|---|---|
| Random | — | — | 3.51 | 3.47 | 4.09 | 7.86 |
| D-TRAK | 46.7s | 30GB | 5.44 | 9.59 | 5.86 | 7.31 |
| AbU+ | 2.28hr | 1.9GB | 5.83 | 10.70 | 5.64 | 7.15 |
| DINO | 11.6ms | 354MB | 4.76 | 8.06 | 4.51 | 7.41 |
| Ours | 18.7ms | 354MB | 5.28 | 9.35 | 4.78 | 7.37 |
Among fast methods (latency < generation time of 21.5s), the proposed method achieves the best attribution performance; it is 2,500× faster than D-TRAK and 400,000× faster than AbU+.
Ablation Study¶
Feature space selection:
| Feature | mAP before tuning | mAP after tuning |
|---|---|---|
| CLIP-Text | Higher | Moderate |
| DINO | Moderate | Higher |
| DINO + CLIP-Text | — | Highest |
Before tuning, text features perform better; after tuning, image features surpass them — indicating that visual information is more fundamental for attribution but requires attribution-specific training to be activated. The final model uses the concatenation of DINO and CLIP-Text.
Loss function comparison: MSE regression fails to converge; ordinal loss achieves comparable ranking accuracy to BCE but does not support fast cosine retrieval; BCE offers the best trade-off between accuracy and efficiency.
Data scaling: Performance improves rapidly with the number of queries before saturating; a few thousand queries suffice to capture most of the ranking signal.
Stable Diffusion Scale Validation¶
Validated on Stable Diffusion v1.4 + LAION-400M:
- 100K neighbor candidates are retrieved per query; a total of 101M attribution pairs are collected for training.
- The tuned DINO+CLIP-Text features yield significant improvements across all mAP thresholds.
- Unlike MSCOCO, text features are more critical for attribution on the SD model — possibly because AbU+ attribution scores correlate more strongly with text similarity.
- The data has not yet saturated; additional compute budget can further improve performance.
Key Findings¶
- Attribution-specific features can be effectively distilled from slow methods — 18.7ms retrieval retains most of the attribution accuracy of hour-level methods.
- The choice and combination of feature spaces have a large impact on results — neither visual nor text features alone are sufficient; their fusion achieves the best performance.
- The combination of two-stage collection and subsampling improves data collection efficiency by orders of magnitude, making large-scale data collection practically feasible.
Highlights & Insights¶
- Elegance of the distillation approach: The retrieval mechanism itself is unchanged (still cosine similarity); only the feature space is optimized to align with attribution rankings — zero deployment overhead.
- First large-scale validation: Successful application of data attribution to Stable Diffusion trained on LAION-400M, demonstrating the scalability of the method.
- Independent contribution of AbU+: Replacing the diagonal Fisher approximation with EK-FAC is a valuable improvement in its own right.
- Systematic design study: Comprehensive ablations over feature selection, loss functions, data scale, and sampling strategies provide clear design guidance for future work.
Limitations & Future Work¶
- Distillation preserves only ranking information, discarding the absolute magnitude and concentration/distribution of influence — it cannot distinguish between "strongly influenced by a single training image" and "moderately influenced by multiple training images."
- Systematic biases of the teacher method AbU+ are inherited by the student model.
- Offline data collection still requires substantial GPU time — approximately 1,470 GPU hours for MSCOCO and 17,250 GPU hours for Stable Diffusion.
- Validation is limited to diffusion models; applicability to new architectures such as flow matching and one-step models remains unexplored.
- The student MLP is lightweight (3 layers), and performance may be bottlenecked by the pretrained feature representations.
Related Work & Insights¶
- TRAK/D-TRAK: Gradient projection-based influence function methods with moderate speed but large storage requirements and accuracy limited by projection dimensionality.
- AbU: The predecessor of the teacher method in this work; accurate but impractical at 2 hours per query.
- AbC (Wang et al. 2023): Adapts features for attribution in the customization setting; this work generalizes that idea to general-purpose attribution.
- FAISS: An efficient approximate nearest neighbor retrieval library that enables retrieval over training sets of hundreds of millions of samples.
- Insight: The distillation + retrieval paradigm can be generalized to other scenarios requiring a "slow-accurate vs. fast-coarse" trade-off, such as semantic search and recommendation systems.
Rating¶
⭐⭐⭐⭐ (4/5)
This work is the first to scale data attribution to Stable Diffusion + LAION-400M and achieve millisecond-level deployment. The distillation + retrieval technical framework is elegant and practical, and the systematic design ablations provide clear guidance for future research. Points are deducted primarily because the data collection phase still requires substantial GPU investment, and distillation discards the quantitative information of attribution.