Skip to content

iFusion: Integrating Dynamic Interest Streams via Diffusion Model for Click-Through Rate Prediction

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=iYQgXETC1D
Code: To be confirmed
Area: Recommender Systems / CTR Prediction / Generative User Interest Modeling
Keywords: CTR Prediction, Long-Short Term Interest Fusion, Diffusion Models, Classifier-Free Guidance, Autoregressive Denoising

TL;DR

iFusion reformulates "long-short term user interest fusion" as a conditional generation problem—utilizing short-term interests as guidance to perform diffusion denoising on long-term interest representations. This approach bypasses the assumptions of traditional linear fusion (concatenation/attention/gating), achieving CTR improvements across public datasets, industrial datasets, and online A/B tests.

Background & Motivation

Background: CTR prediction is core to recommendation and advertising, predominantly built upon user behavior modeling. Conventional practices partition behaviors into long-term sequences (stable preferences, modeled via historical logs) and short-term sequences (volatile interests, modeled via recent sessions), which are then fused after sophisticated individual modeling. Substantial progress has been made in long-term modeling (SIM, ETA, TWIN series) and short-term modeling independently.

Limitations of Prior Work: Conversely, the "fusion" step has been long overlooked. Existing fusion methods (concatenation, attention, gating) inherently rely on linear assumptions. The authors identify three critical flaws:

  • Feature Space Misalignment: Long-term and short-term behaviors use different features and encoders (e.g., clicks vs. purchases). These two representations are naturally heterogeneous; linear operators presuppose spatial alignment that does not exist.
  • Fragmented Late-Fusion: Late-fusion splits "behavior modeling" and "interest fusion" into independent pipelines, creating an inductive bias that hinders cross-sequence integration.
  • Perturbation-Interest Entanglement: Linear fusion lacks a mechanism to distinguish meaningful interest signals from random fluctuations in the short term, allowing noise to propagate uncontrollably and contaminate stable long-term representations.

Furthermore, generative ranking schemes like HSTU concatenate all behaviors into a single sequence, which can lead to insufficient modeling when behaviors are sparse—discriminative ranking fails to "infer and fuse" interests from limited evidence.

Key Challenge: Long-short term interests follow heterogeneous, non-stationary, and sometimes contradictory evolution patterns. Linear fusion operators cannot capture these non-linear interactions nor disentangle noise from signals.

Goal: To design a fusion mechanism that is robust to perturbations, allows for joint modeling and fusion, and satisfies low-latency requirements for online serving.

Core Idea: Instead of treating fusion as a deterministic operator, it is reframed as conditional generation. The long-term interest \(h_L\) is treated as the initial diffusion state \(x_0\), which is diffused to Gaussian noise in the forward process. In the reverse process, guided by short-term session interests \(\{h_i^S\}\), a fused representation \(\hat{x}_0\) is generated that preserves long-term preferences while absorbing short-term dynamics.

Method

Overall Architecture

iFusion employs long-term interest \(h_L\) as the starting point \(x_0\) of the diffusion process. The forward process incrementally adds noise following a variance schedule \(\{\beta_t\}\), where \(q(x_t|x_0)=\mathcal{N}(\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t)I)\), until it converges to standard Gaussian. The reverse process performs step-wise denoising guided by short-term session interests to yield the "Generative Fused Interest (GFI)" \(\hat{x}_0\), which is then concatenated with other features for the final pCTR prediction. Two core components support the reverse process: DCFG provides robust guidance under perturbations, and MARN models multi-session mixed guidance and interest evolution along the denoising chain.

flowchart LR
    A[Long-term Interest h_L] -->|x_0 Forward Addition of Noise| B[Noisy State x_T]
    C[Short-term Session Interests h_i^S] --> D[DCFG Decoupled Guidance]
    D -->|Core Preference g_cp / Transient Fluctuation g_tf| E[MARN Autoregressive Denoising]
    B --> E
    E --> F[Generative Fused Interest GFI x_0]
    G[Other Features] --> H[Final Layer]
    F --> H --> I[pCTR]

Key Designs

1. DCFG: Decoupling guidance into "Core Preference" and "Transient Fluctuation". Standard Classifier-Free Guidance (CFG) uses a single scale factor \(\gamma\) to mix conditional and unconditional predictions \(\hat f_\theta=f_\theta(x_t,t)+\gamma(f_\theta(x_t,t,g)-f_\theta(x_t,t))\), assuming uniform signal quality. However, the signal-to-noise ratio in interest representations is significantly lower than in image generation; stable preferences and transient fluctuations are mixed, and a uniform scale is susceptible to noise. Drawing from stochastic thermodynamics, the authors view user interest dynamics as particles moving in a composite potential field \(V(x_t|g)=V(x_t|g_{cp})+V(x_t|g_{tf})\), where core preferences \(g_{cp}\) create deep potential wells (stable attractors) and transient fluctuations \(g_{tf}\) create shallow perturbations. Functional decoupling is achieved via specialized structures: a "low-pass" path for core preferences \(h_{cp}=\text{AvgPool}(\text{Encoder}(g))\) (strong regularization + global pooling for stability) and a "high-pass" path for transient fluctuations \(h_{tf}=\text{Attention}(\text{Encoder}(g))\) (capturing variations). Theorem 1 proves that if the Hessian principal eigenspaces of the two energy functions are approximately orthogonal, the conditional score can be precisely decomposed as \(\nabla_{x_t}\log p=\gamma_{cp}(-\nabla E_{cp})+\gamma_{tf}(-\nabla E_{tf})\), grounding decoupling in architectural constraints rather than strict independence assumptions. The final guidance is \(\hat f_\theta=f_\theta(x_t,t)+\sum_{j\in\{cp,tf\}}\gamma_j(f_\theta(x_t,t,g_j)-f_\theta(x_t,t))\).

2. MARN: Utilizing autoregressive denoising to chain multi-session guidance. Current diffusion-based recommendation often relies on single-vector guidance + non-autoregressive (NAR) structures (e.g., parallel injection via MLP/Transformer). Parallel generation fails to capture fine-grained session dependencies and struggles with strong temporal coupling or non-linear guidance relationships. MARN processes \(K\) short-term session interests sequentially according to the chain rule—the denoised output from a previous session acts as the "noisy representation" for the subsequent session, decomposing the complex joint distribution into a conditional chain. Theorem 2 provides three guarantees for the superiority of AR over NAR in multi-session diffusion: tighter KL upper bounds when session dependencies exist \(I(s_i;s_j)>0\), \(O(K)\) lower gradient variance, and adaptive session weighting \(\alpha_k\propto\exp(-\|\nabla_{s_k}L\|/\sigma_t)\) based on gradients. NAR only matches AR when sessions are independent or latency constraints are extreme. This advantage scales super-linearly with the number of sessions \(K\).

3. Consistency Constraint: Exchanging noise-invariant representations for "one-step inference" to meet online low-latency requirements. Iterative denoising is too slow for industrial CTR services. The authors introduce a consistency loss \(L_{cons}=\mathbb{E}_{t_1,t_2}\|f_\theta(x_{t_1},t_1)-f_\theta(x_{t_2},t_2)\|^2\), forcing the generated interest representations to be consistent across different noise levels. This allows the model to learn noise-invariant representations, achieving high-quality generation with minimal sampling steps (optimal results achieved with a cosine schedule + single-step inference in experiments), making diffusion models viable for real-time systems.

4. Multi-objective Training and Zero-data Fallback. Total loss \(L=L_{CE}+\lambda_1 L_{Evol}+\lambda_2 L_{Dist}+\lambda_3 L_{cons}+\beta\|\Theta\|^2\). \(L_{CE}\) handles the primary CTR task, \(L_{Evol}\) (cosine distance of the next session interest) handles interest evolution, \(L_{Dist}=\|g_{cp}^\top g_{tf}\|_2^2\) enforces decoupling, and \(L_{cons}\) manages efficiency. Theoretically, Theorems 3 and 4 prove that when behavior data is entirely missing, the denoising process retreats along the interest manifold to a population-level statistical prior \(\epsilon_\theta(z_t,t)=\mathbb{E}_{z_0\sim p_{data}}[\epsilon|z_t]\), providing a reasonable fallback for cold-start or sparse scenarios.

Key Experimental Results

Main Results (AUC / RelaImpr across four datasets)

Method Amazon AUC Taobao AUC Ali Ads AUC Industrial AUC
AvgPooling DNN 0.7689 0.8539 0.6352 0.7512
DIN 0.8162 0.8995 0.6422 0.7564
DIEN 0.8377 0.9222 0.6431 0.7611
SIM / ETA / SDIM 0.842x 0.927x 0.659x 0.7625~0.7628
TWIN / TWIN-V2 0.8431/0.8433 0.9288/0.9289 0.6601/0.6607 0.7630/0.7634
MTGR 0.8440 0.9296 0.6615 0.7648
DiffuRec / DreamRec / DiffuMIN 0.8395~0.8427 0.9258~0.9288 0.6584~0.6595 0.7607~0.7623
iFusion (Ours) 0.8512 0.9347 0.6652 0.7685

iFusion leads across all four datasets. On the industrial set, the RelaImpr compared to AvgPooling reaches +6.89% (vs. +5.41% for the strongest baseline, MTGR). In CTR contexts, an AUC gain of 0.001 is considered practically significant. Notably, existing diffusion methods (DiffuRec/DreamRec/DiffuMIN) underperform compared to MTGR, which the authors attribute to their inability to decouple core preferences from transient fluctuations.

Ablation Study (Industrial Dataset AUC)

Dimension Configuration AUC
DCFG w/o guidance / w/ CFG / Ours 0.7607 / 0.7663 / 0.7685
MARN NAR-MLP / NAR-Att / AR-Att / Ours 0.7644 / 0.7650 / 0.7689 / 0.7685
Consistency w/o cons (12.9 b/s) / Ours (16.2 b/s) 0.7686 / 0.7685
  • DCFG improves AUC by ~0.0022 over standard CFG, validating the necessity of decoupled guidance; naively feeding all guidance leads to performance drops.
  • Gains from MARN stem from hierarchical processing across sessions rather than network capacity—deepening the internal network yields no further gain, indicating the "interest space fusion" paradigm is effective.
  • Consistency loss has negligible impact on AUC (0.7686→0.7685) while increasing inference speed by +25.6% (12.9→16.2 batches/sec).

Key Findings

  • Noise Scheduling + Sampling Steps: Cosine scheduling combined with consistency loss makes single-step inference optimal; increasing steps actually leads to performance drops due to error accumulation (Fig 4a, \(r \approx -0.95 \sim -0.97\)).
  • Session Number Scaling: As the number of sessions increases, the advantage of AR (MARN) over NAR becomes more pronounced, confirming the super-linear conclusion of Theorem 2.
  • Efficiency: Offline inference time increases by only +0.3%, and online TP99 latency increases by only +0.302%.
  • Online A/B (7 days, hundreds of millions of users): CTR +2.44%, eCPM +2.61% (both \(p < 0.001\)).

Highlights & Insights

  • Generative Reframing of Fusion: Moving beyond the linear mindset of "concat/attention/gate" by viewing long-short term fusion as a conditional generation problem—denoising long-term interest conditioned on short-term factors.
  • Decoupling via Architectural Constraints: DCFG utilizes low-pass/high-pass structures and Hessian orthogonal approximations for functional decoupling, bypassing strict statistical independence assumptions.
  • Provable Superiority of AR vs. NAR: Theorem 2 quantifies the benefits of autoregressive denoising under multi-session dependency via KL bounds, gradient variance, and adaptive weights, which is supported by session scaling experiments.
  • Practical Deployment: The consistency loss allows for one-step inference, keeping offline/online latency overhead under 0.3%, making it highly viable for industrial use.

Limitations & Future Work

  • Strong Theoretical Assumptions: Theorem 1 relies on the "approximate orthogonality" of Hessian principal eigenspaces, and Theorem 2's superiority depends on session dependency structures; empirical measurements of these properties were not provided.
  • Hyperparameter Complexity: Four loss weights \(\lambda_1, \lambda_2, \lambda_3, \beta\) plus two guidance scales \(\gamma_{cp}, \gamma_{tf}\) suggest high tuning costs and potential generalization robustness concerns.
  • Scoped to CTR Ranking: The method focuses on CTR; its transferability to multi-task ranking or sequential recommendation generation remains to be verified.
  • Empirical Evidence for Zero-data Theory: While Theorems 3 and 4 regarding population priors are in the appendix, systematic experiments for cold-start scenarios are lacking in the main text.
  • Discriminative User Behavior Modeling: Evolution from MLP → RNN (GRU4Rec) → Attention (DIN/DIEN) and long-term dependency modeling (SIM/ETA/SDIM/TWIN). iFusion connects this lineage of "separate modeling" to generative fusion.
  • Generative / Diffusion Recommendation: DiffuRec, DreamRec, and DiffuMIN apply diffusion to sequential recommendation, but few target CTR; HSTU/MTGR follow discriminative generation on unified sequences. iFusion addresses the lack of decoupled guidance in these methods.
  • Insight: Reframing "feature fusion"—a step often linearized by default—as conditional generation, and using classifier-free guidance for signal/noise separation, is a transferable strategy for other multi-source representation tasks. Consistency distillation is a key missing piece for making generative recommendation industrial-ready.

Rating

  • Novelty: ⭐⭐⭐⭐ Reframing fusion as conditional diffusion with DCFG/MARN is conceptually innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage with four datasets, ablation, hyperparameters, efficiency, and large-scale online A/B; main text lacks zero-data/cold-start empirical validation.
  • Writing Quality: ⭐⭐⭐⭐ Clearly organized with 3 motivation points, 2 components, and 4 RQs; theorems are well-mapped to designs.
  • Value: ⭐⭐⭐⭐ High industrial value with proven offline/online gains and minimal latency overhead.