ICML2025 Image Generation AI paper notes paper summaries Diffusion Models Alignment/RLHF Adversarial Robustness Model Compression Image Restoration Agents

🎨 Image Generation¶

🧪 ICML2025 · 92 paper notes

📌 Same area in other venues: 📷 CVPR2026 (492) · 🔬 ICLR2026 (353) · 💬 ACL2026 (5) · 🧪 ICML2026 (141) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (221)

🔥 Top topics: Diffusion Models ×38 · Alignment/RLHF ×4 · Adversarial Robustness ×3 · Model Compression ×3 · Image Restoration ×3

Action-Minimization Meets Generative Modeling: Efficient Transition Path Sampling with the Onsager-Machlup Functional: This paper proposes interpreting the score functions of pretrained generative models (diffusion models and flow matching) as drift terms in stochastic dynamics. By minimizing the Onsager-Machlup (OM) action functional, the pretrained models are repurposed in a zero-shot manner for transition path sampling (TPS) in molecular systems. This achieves physically realistic transition paths at a fraction of the computational cost of traditional methods on systems like alanine dipeptide and fast-folding proteins.
All-atom Diffusion Transformers: Unified Generative Modelling of Molecules and Materials: This work proposes the All-atom Diffusion Transformer (ADiT), a two-stage framework that maps molecules and crystals into a unified latent space via a VAE, and then utilizes a Diffusion Transformer to generate new samples within this latent space. It is the first to achieve simultaneous generation of periodic materials (crystals) and non-periodic molecular systems using a single model. ADiT achieves SOTA performance on MP20, QM9, and GEOM-DRUGS, while being an order of magnitude faster than equivariant diffusion models.
Angle Domain Guidance: Latent Diffusion Requires Rotation Rather Than Extrapolation: It is discovered that the root cause of color distortion in Classifier-Free Guidance (CFG) is the amplification of sample norms in the latent space. To address this, the Angle Domain Guidance (ADG) algorithm is proposed, which enhances guidance in the angle domain rather than the amplitude domain. By constraining norm variation while optimizing angular alignment, ADG eliminates abnormal color saturation under high guidance weights while maintaining or even improving text-image alignment.
Annealing Flow Generative Models Towards Sampling High-Dimensional and Multi-Modal Distributions: Proposes Annealing Flow (AF)—a Continuous Normalizing Flow (CNF) based method for sampling high-dimensional multi-modal distributions. It trains with a dynamic Optimal Transport (OT) objective combined with Wasserstein regularization to guide mode exploration through an annealing process, significantly outperforming existing NF and MCMC methods in high-dimensional multi-modal settings.
Autoencoder-Based Hybrid Replay for Class-Incremental Learning: Proposed the Autoencoder-Based Hybrid Replay (AHR) strategy, which utilizes a Hybrid Autoencoder (HAE) to compress and store samples in the latent space rather than the original input space. By combining Charged Particle System Energy Minimization (CPSEM) and the Repulsion Force Algorithm (RFA) to incrementally embed new class centroids, it reduces the memory complexity from \(\mathcal{O}(t)\) to \(\mathcal{O}(0.1t)\) in the worst-case scenario while maintaining SOTA performance.
Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment: Proposes Preference Embedding—embedding responses into a multi-dimensional latent space to capture complex preference structures (including intransitive preferences), achieving \(O(K)\) query complexity (identical to Bradley-Terry models but with significantly higher expressiveness). Combined with General Preference Optimization (GPO), it outperforms Bradley-Terry reward models on RewardBench and AlpacaEval 2.0.
Beyond One-Hot Labels: Semantic Mixing for Model Calibration: Proposes CSM (Calibration-aware Semantic Mixing), which leverages pre-trained diffusion models to generate high-fidelity semantically mixed samples (e.g., cat-dog hybrids) and accurately re-annotates soft label confidence using CLIP. Training with an \(L_2\) loss achieves superior model confidence calibration compared to existing calibration methods.
BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling: This paper proposes the Bridge framework, which generates high-quality text-to-time-series paired data using an LLM multi-agent system and utilizes a hybrid prompt of semantic prototypes and textual descriptions to drive a diffusion model. It achieves cross-domain, instance-level text-controlled time-series generation (TC-TSG), ranking SOTA in 11 out of 12 datasets.
Broadband Ground Motion Synthesis by Diffusion Model with Minimal Condition: Proposed HEGGS (High-fidelity Earthquake Groundmotion Generation System), which leverages the naturally paired characteristics of waveforms in seismic datasets, combined with a conditional latent diffusion model and an Amplitude Correction Module (ACM), to generate high-fidelity three-component seismic waveforms end-to-end with minimal conditional information (latitude, longitude, focal depth, magnitude).
Compositional Scene Understanding through Inverse Generative Modeling: This paper proposes the Inverse Generative Modeling (IGM) framework, which reformulates scene understanding tasks as an inversion problem of searching for optimal conditional parameters within compositional generative models. By composing multiple small diffusion models to represent complex scenes, the method achieves robust out-of-distribution generalization capabilities and directly leverages pre-trained text-to-image models for zero-shot multi-object perception.
ContinualFlow: Learning and Unlearning with Neural Flow Matching: Proposes ContinualFlow, a targeted unlearning framework for generative models based on Flow Matching. By reweighting via an energy function to softly subtract unwanted regions of the data distribution, it achieves efficient unlearning without requiring retraining or direct access to the samples to be forgotten.
Continuous Semi-Implicit Models: Proposes CoSIM, which extends hierarchical semi-implicit models to a continuous-time framework. It achieves simulation-free, highly efficient training via continuous transition kernels, and designs consistency-preserving transition kernels to enable distribution-level multi-step diffusion model distillation, achieving or exceeding existing diffusion acceleration methods on ImageNet 512×512.
Continuous Visual Autoregressive Generation via Score Maximization: Proposes a continuous visual autoregressive framework—based on the theory of strictly proper scoring rules, using energy score as a likelihood-free training objective to replace vector quantization for continuous token autoregressive image generation, where EAR-H achieves an FID of 1.97 and is approximately 10 times faster in inference than the diffusion-loss method MAR.
DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space: Proposes DCTdiff, which performs end-to-end diffusion image generation directly in the Discrete Cosine Transform (DCT) frequency domain for the first time, seamlessly scaling to \(512 \times 512\) resolution without a VAE and outperforming pixel-space diffusion models in both generation quality and training efficiency.
Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is also a GAN Discriminator: DDO proposes parameterizing the likelihood model itself as a GAN discriminator (via the likelihood ratio). This enables fine-tuning pre-trained diffusion/autoregressive models using GAN targets without an additional discriminator network, significantly improving the FID records on CIFAR-10 and ImageNet (EDM: 1.97 \(\to\) 1.38, EDM2-S: 1.58 \(\to\) 0.97).
Directed Graph Grammars for Sequence-based Learning: This paper proposes DIGGED, which losslessly maps DAGs to unique sequences of production rules via unambiguous context-free graph grammars. Combined with a Transformer decoder, it achieves graph generation, property prediction, and Bayesian optimization, thoroughly outperforming existing methods on neural architecture search, Bayesian networks, and circuit design tasks.
Discriminative Policy Optimization for Token-Level Reward Models: Proposes the Q-function Reward Model (Q-RM), which decouples reward modeling from language generation by defining a discriminative policy to learn token-level \(Q\)-functions. This approach extracts precise token-level reward signals from preference data without needing fine-grained annotations, significantly improving the reasoning performance and training efficiency of PPO/REINFORCE.
Distillation of Discrete Diffusion through Dimensional Correlations (Di4C): This paper proposes the Di4C method, which captures correlations between dimensions through a "mixture" model. Combined with a consistency loss function, it distills multi-step discrete diffusion models into few-step models, demonstrating effectiveness across both image and language tasks.
DRAG: Data Reconstruction Attack using Guided Diffusion: This paper proposes DRAG, which leverages the image prior knowledge of pre-trained Latent Diffusion Models (LDMs) to reconstruct the original input images with high fidelity from deep Intermediate Representations (IRs) in Split Inference (SI) via a guided diffusion process, revealing severe privacy vulnerabilities of vision foundation models (such as CLIP and DINOv2) under SI scenarios.
Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation: Proposes Editable Noise Map Inversion (ENM Inversion), which simultaneously optimizes reconstruction error and edit alignment error during the inversion process. This "engraves" both source and target image information into the noise map, achieving an optimal balance between content preservation and editing fidelity.
Efficient Diffusion Models for Symmetric Manifolds: This paper proposes an efficient diffusion model framework on symmetric manifolds (tori, spheres, SO(n), U(n)), which bypasses heat kernel computation through the projection of Euclidean Brownian motion and Itô's lemma, reducing training complexity from exponential to nearly linear, and providing polynomial-class sampling accuracy guarantees.
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens: ResGen decouples the number of generation iterations from both sequence length and quantization depth by directly predicting cumulative RVQ embeddings instead of individual tokens, achieving an efficient generative model with high fidelity and rapid sampling.
Elucidating Flow Matching ODE Dynamics via Data Geometry and Denoisers: This paper analyzes the sampling trajectory dynamics of the Flow Matching (FM) ODE from the perspective of denoisers, revealing three stages of trajectory evolution (initial -> intermediate -> terminal) and establishing a convergence theory for the FM ODE when data is supported on a low-dimensional submanifold.
Exploring Position Encoding in Diffusion U-Net for Training-free High-resolution Image Generation: Through an in-depth analysis of the insufficient propagation of positional information caused by zero-padding in the convolutional layers of the diffusion U-Net at high resolutions, this paper proposes the Progressive Boundary Complement (PBC) method. PBC constructs progressive virtual boundaries inside the feature maps to enhance positional information propagation, achieving high-quality training-free high-resolution image generation.
Expressive Score-Based Priors for Distribution Matching with Geometry-Preserving Regularization: This paper proposes an expressive, score-based prior distribution (SAUB) that sidesteps prior density estimation via the Score Function Substitution (SFS) technique, combined with Gromov-Wasserstein geometry-preserving constraints to achieve stable and efficient distribution matching, yielding superior performance in fair classification, domain adaptation, and domain translation tasks.
Flat-LoRA: Low-Rank Adaptation over a Flat Loss Landscape: This paper proposes Flat-LoRA, which introduces random weight perturbation based on the Bayesian expected loss in the full parameter space, forcing LoRA to converge to flatter minima within the full parameter space. This improves both in-domain and out-of-distribution generalization with almost no increase in training time and GPU memory overhead.
FlexiClip: Locality-Preserving Free-Form Character Animation: FlexiClip proposes a clipart animation framework based on temporal Jacobian correction, probability flow ODE continuous-time modeling, and GFlowNet flow matching loss. It significantly improves temporal smoothness and geometric integrity of animations while maintaining visual consistency.
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length: FlexTok is proposed, a tokenizer that resamples 2D images into variable-length, ordered 1D discrete token sequences. It learns hierarchical encoding via nested dropout and utilizes a rectified flow decoder to generate high-quality reconstructions at any token count, achieving autoregressive image generation with FID < 2 using only 8 to 128 tokens on ImageNet.
Gaussian Mixture Flow Matching Models: This paper proposes Gaussian Mixture Flow Matching Models (GMFlow), which replace the traditional single-Gaussian denoising distribution with a dynamic Gaussian mixture distribution to model multimodal flow velocity fields. Trained via KL divergence loss, the derived GM-SDE/ODE solvers enable accurate few-step sampling. Additionally, a probabilistic guidance scheme is introduced to solve the CFG oversaturation issue, achieving a Precision of 0.942 on ImageNet 256×256 with only 6 sampling steps.
GaussMarker: Robust Dual-Domain Watermark for Diffusion Models: GaussMarker is proposed—the first dual-domain (spatial + frequency) diffusion model watermarking method. It consistently embeds watermarks in both the spatial and frequency domains of the initial Gaussian noise through a pipelined injector. Coupled with a model-independent learnable Gaussian Noise Restorer (GNR) to enhance robustness against rotation/cropping attacks, it achieves SOTA performance with an average TPR@1%FPR of 0.997 under eight image distortions across three Stable Diffusion versions.
Generative Audio Language Modeling with Continuous-Valued Tokens and Masked Next-Token Prediction: This paper studies causal language models for audio generation without using discrete tokens, leveraging token-wise diffusion to model the distribution of continuous-valued next-tokens, and proposes a masked next-token prediction task. With 193M parameters, it achieves performance comparable to SOTA diffusion models on AudioCaps.
GRAM: A Generative Foundation Reward Model for Reward Generalization: GRAM proposes training reward models using a generative (rather than discriminative) approach. It pre-trains a generative reward model through large-scale unsupervised learning, fine-tunes it with supervised data, and proves that label smoothing is mathematically equivalent to a regularized pairwise ranking loss, thereby achieving reward generalization across tasks.
Hessian Geometry of Latent Space in Generative Models: A method is proposed to analyze the latent space geometry of generative models by reconstructing the Fisher information metric, revealing that fractal-structured phase transition boundaries exist within the latent space of diffusion models, where the Lipschitz constant diverges.
Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots: This work proposes Hi-MAR, which introduces low-resolution tokens as intermediate pivots in masked autoregressive image generation to establish a coarse-to-fine hierarchical generation process. It also enhances inter-token dependency modeling with a Diffusion Transformer Head, significantly outperforming MAR on ImageNet with less computational cost (FID improved by 0.38).
Hierarchical Reinforcement Learning with Uncertainty-Guided Diffusional Subgoals: Proposes a hierarchical reinforcement learning framework combining conditional diffusion models with Gaussian process priors. Through an uncertainty-aware subgoal generation mechanism, it addresses the core challenge of high-level policies struggling to generate effective subgoals amid dynamic changes in low-level policies.
Importance Sampling for Nonlinear Models: By introducing the adjoint operator of nonlinear mappings, this work systematically extends classical norm sampling and leverage score sampling from linear models to nonlinear models, providing the first theoretical approximation guarantees for importance sampling in nonlinear models such as neural networks.
Improving the Diffusability of Autoencoders: Through 2D DCT spectral analysis, this study reveals excessively strong high-frequency components in the latent space of autoencoders that do not match the RGB space. A Scale Equivariance regularization is proposed to align the frequency distributions of both. Finetuning for only 10-20K steps reduces ImageNet FID by 19% and Kinetics FVD by over 44%.
InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference: InfoSEM is proposed, an unsupervised generative framework that leverages textual gene embeddings as informative priors for gene regulatory network (GRN) inference. Without GT labels, it outperforms supervised methods by 38.5%, and improves by an additional 11.1% when using labels as an auxiliary prior, while revealing that existing supervised methods learn gene-specific biases rather than genuine regulatory mechanisms.
Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models: The authors propose two methods, DMILO and DMILO-PGD, which leverage intermediate layer optimization (ILO) to partition the diffusion model sampling process, thereby significantly reducing GPU memory consumption. By integrating projected gradient descent (PGD) to prevent sub-optimal convergence, these methods comprehensively outperform state-of-the-art (SOTA) methods such as DMPlug on both linear and non-linear inverse problems.
IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models: IntLoRA is proposed to fine-tune quantized diffusion models using integer low-rank parameters. After merging weights, quantized inference weights are directly obtained without additional PTQ, balancing both training and inference efficiency.
Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features: The GeoDiffNet-F framework is proposed, which leverages frozen pre-trained diffusion models to extract low-level spatial features and adaptively injects hyperspectral spectral signatures into these spatial features through the FiLM (Feature-wise Linear Modulation) mechanism, realizing highly efficient hyperspectral image land-cover classification with very limited annotations.
Learning Single Index Models with Diffusion Priors: An efficient method utilizing diffusion model priors to recover signals from nonlinear observations of Semi-parametric Single Index Models (SIM) is proposed. It requires only one round of unconditional sampling and partial inversion without knowing the link function, significantly outperforming existing methods on 1-bit and cubic measurements with minimal NFE.
LIVS: A Pluralistic Alignment Dataset for Inclusive Public Spaces: Through a two-year community-based participatory research effort, this work constructs the LIVS dataset containing 37,710 pairs of multi-criteria preference annotations for the pluralistic alignment of text-to-image models in inclusive urban public space design, and validates its effectiveness by fine-tuning SDXL with DPO.
Local Manifold Approximation and Projection for Manifold-Aware Diffusion Planning: Proposes LoMAP, a training-free correction method for diffusion planning. It projects guided samples onto a local low-rank subspace constructed from nearest neighbors in offline data at each reverse diffusion step to prevent the generation of infeasible trajectories, theoretically proving that the guidance error grows with dimensionality as \(O(\sqrt{d})\).
Localizing and Mitigating Memorization in Image Autoregressive Models: This work utilizes an improved UnitMem metric to localize memorized neurons in image autoregressive models (VAR/RAR). It reveals that memorization distribution patterns differ significantly across architectural designs, and presents a privacy mitigation solution. By scaling down the weights of highly memorized neurons, the method achieves a substantial reduction in the volume of extractable training data (from 672 to 110 images in VAR-d30) with a controllable impact on generation quality.
LSCD: Lomb-Scargle Conditioned Diffusion for Time Series Imputation: This paper proposes LSCD, which integrates a differentiable Lomb-Scargle periodogram layer into a score-based diffusion model for time series imputation. Through frequency-domain conditioning information and a spectral consistency loss, the approach simultaneously improves time-domain imputation accuracy and frequency-domain recovery consistency under high missing rates.
Model Immunization from a Condition Number Perspective: This paper defines and analyzes the model immunization problem from the perspective of the Hessian matrix condition number, proposing a regularizer that maximizes/minimizes the condition number to render pre-trained models difficult to fine-tune for harmful tasks without affecting their performance on benign tasks.
Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization: MoDiff proposes a framework combining modulated quantization and error compensation to accelerate diffusion models. It reduces activation quantization from 8-bit to 3-bit without performance loss, while inheriting the dual advantages of both caching and quantization methods.
Morse: Dual-Sampling for Lossless Acceleration of Diffusion Models: The Morse dual-sampling framework is proposed, which learns residual feedback via a fast Dot model to compensate for the information loss in jump sampling of Dash (the original diffusion model), achieving 1.78×–3.31× lossless acceleration.
Multidimensional Adaptive Coefficient for Inference Trajectory Optimization in Flow and Diffusion: This paper proposes the Multidimensional Adaptive Coefficient (MAC), a plug-and-play module for flow/diffusion models. MAC extends traditional one-dimensional time scheduling coefficients to multidimensional, sample-adaptive coefficients. By optimizing the inference trajectory through adversarial training, MAC achieves a SOTA FID of 1.37 with 5 NFEs on conditional CIFAR-10 generation.
Nonparametric Identification of Latent Concepts: This work proposes the first theoretical framework for nonparametric concept identifiability. It proves that hidden concepts can be identified (up to component-wise transformation and permutation uncertainty) purely through the diversity of multi-class observations, without assuming concept types, functional relationships, or parametric generative models.
Normalizing Flows are Capable Generative Models: Proposes TarFlow (Transformer AutoRegressive Flow), which implements block autoregressive Normalizing Flows by stacking causal ViTs, breaking the 3 BPD barrier on ImageNet 64×64 for the first time. Through three key techniques—Gaussian noise augmentation, score-based denoising, and guidance—it enables the generation quality of NF models to rival diffusion models for the first time.
One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework: This paper proposes Co-Erasing, which introduces image supervision into the concept erasing pipeline for the first time. By leveraging text-image collaborative negative guidance and a text-guided image concept refinement module, Co-Erasing significantly improves the erasing efficacy of undesirable concepts while preserving the generation quality (usability) of benign concepts.
Origin Identification for Text-Guided Image-to-Image Diffusion Models: This paper proposes the ID2 task (Origin Identification of text-guided image-to-image Diffusion models), constructs the first dataset OriPID, and demonstrates that applying a linear transformation to VAE embeddings can generalize to identify the original source of generated images, outperforming similarity-based methods by 31.6% in mAP.
Out-of-Distribution Detection Methods Answer the Wrong Questions: This paper systematically demonstrates that current mainstream OOD detection methods (feature-based and logit-based) fundamentally answer the wrong questions—they detect "whether features are anomalous" or "whether the model is uncertain" rather than "whether the input comes from a different distribution." It also proves that various common improvement strategies cannot resolve this fundamental misalignment.
PAK-UCB Contextual Bandit: An Online Learning Approach to Prompt-Aware Selection of Generative Models and LLMs: This work proposes the PAK-UCB contextual bandit algorithm, which learns independent kernel functions for each generative model to predict the optimal model for a given prompt online, achieving prompt-level generative model/LLM selection and utilizing Random Fourier Features (RFF) to reduce computational overhead.
Performance Plateaus in Inference-Time Scaling for Text-to-Image Diffusion Without External Models: This paper systematically investigates the effect of applying Best-of-N inference-time scaling to initial noise optimization algorithms for text-to-image diffusion models without relying on external models (VLMs/CLIP). The study reveals that performance rapidly hits a plateau, where a small number of optimization steps can closely approach the maximum achievable performance under this setting. Furthermore, the optimal algorithm varies across different backbone diffusion models.
Position: All Current Generative Fidelity and Diversity Metrics are Flawed: Position paper: This work systematically demonstrates that all existing generative model fidelity and diversity metrics (including six pairs of metrics such as Improved Precision/Recall, Density/Coverage, and α-precision/β-recall) suffer from extensive failures in carefully designed sanity checks, urging the community to invest more effort in developing more reliable evaluation metrics.
PPO-MI: Efficient Black-Box Model Inversion via Proximal Policy Optimization: Formulating black-box model inversion attack as an MDP, PPO-MI uses PPO reinforcement learning to navigate and search the latent space of a generative model. Relying solely on the target model's prediction probabilities, it reconstructs training samples efficiently, achieving state-of-the-art attack success rates with fewer queries and less class data.
Preference Adaptive and Sequential Text-to-Image Generation: PASTA models personalized T2I generation as a multi-turn sequential decision-making problem. By generating candidate prompts via VLM, training a user preference model via EM, and learning a value function using offline RL (IQL), it significantly outperforms baseline LMMs in human evaluations.
Privacy Amplification Through Synthetic Data: Insights from Linear Regression: Under the linear regression framework, it is proved that synthetic data cannot provide privacy amplification when the adversary controls the seed. However, releasing a limited amount of synthetic data under random inputs can achieve a privacy amplification effect that exceeds the DP guarantee of the model itself, with an amplification rate of \(O(1/d)\).
Progressive Tempering Sampler with Diffusion: This paper proposes the Progressive Tempering Sampler with Diffusion (PTSD). By combining the temperature swapping mechanism of Parallel Tempering (PT) with a diffusion-based neural sampler, PTSD utilizes "temperature guidance" to extrapolate and generate low-temperature approximate samples from high-temperature diffusion models, achieving orders-of-magnitude faster target density evaluation.
Quantum Algorithms for Finite-horizon Markov Decision Processes: Four quantum value iteration algorithms (QVI-1/2/3/4) are proposed to achieve multi-dimensional quantum speedups in terms of the state space \(S\), action space \(A\), error \(\epsilon\), and horizon \(H\) under both the exact dynamics and generative model settings for finite-horizon time-varying MDPs, alongside proofs of asymptotically optimal quantum lower bounds.
ReFrame: Layer Caching for Accelerated Inference in Real-Time Rendering: Extends the intermediate layer caching technique (DeepCache) from diffusion models to U-Net/U-Net++ networks in real-time rendering pipelines, achieving an average of 1.4× inference speedup with negligible image quality degradation through a frame-difference adaptive caching strategy.
Reimagining Parameter Space Exploration with Diffusion Models: This work explores using diffusion models to learn the distribution of task-specific parameters (LoRA adapters) and directly generate new parameters. In wildlife classification scenarios, it validates that the generated parameters can match fine-tuning performance on known tasks, though cross-task generalization remains a challenge.
Representative Language Generation: Proposes a theoretical framework of "representative generation", which requires the outputs of generative models to proportionally represent various groups of interest in the training data, and introduces "group closure dimension" as a key combinatorial quantity to characterize generatability.
RestoreGrad: Signal Restoration Using Conditional Denoising Diffusion Models with Jointly Learned Prior: The RestoreGrad framework is proposed to jointly learn the prior distribution of conditional DDPMs (as opposed to a fixed standard Gaussian) using a Prior Net and a Posterior Net. By leveraging the correlation between degraded and clean signals to construct a more informative prior, it achieves 5-10× faster convergence and 2-2.5× fewer inference steps in speech enhancement and image restoration tasks.
Review, Remask, Refine (R3): Process-Guided Block Diffusion for Text Generation: Proposed the R3 (Review, Remask, Refine) framework, which leverages a Process Reward Model (PRM) at inference time to evaluate intermediate generated blocks of a masked-diffusion model, proportionally remasks and regenerates low-quality blocks, achieving training-free targeted error correction, and obtaining significant improvements on mathematical reasoning tasks with extremely low PRM call budgets.
Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation: This paper proposes a novel perspective that views diffusion training as "generative pre-training", revealing the fundamental limitation in distillation where the teacher and student models converge to different local optima. The authors demonstrate that pre-trained diffusion models can be efficiently converted into one-step generators (D2O) using only GAN objectives (without distillation loss). Furthermore, a fine-tuned variant with 85% of its parameters frozen (D2O-F) achieves highly competitive results using only 0.2M images.
SADA: Stability-guided Adaptive Diffusion Acceleration: Proposes a Stability Criterion based on the second-order difference of ODE trajectories to uniformly control step-wise and token-wise sparsity decisions, achieving \(\ge 1.8\times\) acceleration with \(\text{LPIPS} \le 0.10\) and \(\text{FID} \le 4.5\) on SD-2/SDXL/Flux, which significantly outperforms DeepCache and AdaptiveDiffusion.
Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction: Proposes the supremal visitation ratio \(C_{vr}\) to measure the exploration difficulty of online robust MDPs, designs ORBIT, the first efficient online algorithm supporting general \(f\)-divergences (TV/KL/\(\chi^2\)), and provides matching upper and lower bounds, proving that \(C_{vr}\) is a tight measure characterizing the online learnability of off-dynamics RL.
Shielded Diffusion: Generating Novel and Diverse Images using Sparse Repellency: This paper proposes SPELL (Sparse Repellency), a training-free method that injects a sparse repellency term during the generation process of diffusion models. This term pushes the sampling trajectories away from a reference set of images (either protected or already generated), thereby enhancing output diversity and preventing the duplication of the training set.
Simple and Critical Iterative Denoising: A Recasting of Discrete Diffusion in Graph Generation: This paper proposes the Simple Iterative Denoising (SID) and Critical Iterative Denoising (CID) frameworks, which eliminate the compounding denoising error in discrete diffusion by assuming conditional independence of intermediate noise states. It introduces a Critic network to adaptively adjust element-wise re-noising probabilities, significantly outperforming standard discrete diffusion baselines on graph and molecule generation tasks.
Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences: SmPO-Diffusion is proposed, which replaces binary preference labels with smoothed preference modeling and replaces forward noising estimation with ReNoise Inversion. It achieves SOTA performance in T2I diffusion model preference alignment while significantly reducing training costs (6.5 times faster than DPO and 26 times faster than KTO).
Stealix: Model Stealing via Prompt Evolution: Stealix proposes the first model stealing approach that does not require human-designed prompts. It iteratively evolves prompts using a genetic algorithm, synthesizes target-class images with Stable Diffusion to query the victim model, and requires only 1 real image per class. Under tight query budgets, it outperforms existing methods that rely on class names or handcrafted prompts, improving accuracy by up to 22.2%.
Synthetic Face Datasets Generation via Latent Space Exploration from Brownian Identity Diffusion: Inspired by the Brownian motion of soft particles in physics, this paper proposes an identity sampling method driven by stochastic forces in the latent space (Langevin, Dispersion, and DisCo algorithms) to generate large-scale, diverse synthetic face datasets for training face recognition models while preventing training data leakage.
Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?: This work systematically investigates "synthetic perception"—utilizing T2I models to instantly generate synthetic images for text-only data as a complementary modality. Through a three-stage evaluation framework (generation \(\rightarrow\) fusion \(\rightarrow\) evaluation), it demonstrates that this strategy brings significant improvements (+3.9% Acc) to strong LLMs like Llama-3/Qwen-2.5 on challenging tasks such as sarcasm detection and implicit sentiment analysis, while yielding marginal gains on simple factual classification tasks.
Taming Diffusion for Dataset Distillation with High Representativeness (D³HR): This work proposes the D³HR framework, which maps the complex Gaussian mixture distribution in the VAE latent space to a noise space with high normality via DDIM inversion, and then generates a highly representative distilled dataset using a group sampling strategy, comprehensively outperforming existing SOTAs on CIFAR, Tiny-ImageNet, and ImageNet-1K.
Taming Rectified Flow for Inversion and Editing: This work proposes RF-Solver and RF-Edit, two training-free methods that significantly improve inversion accuracy by accurately solving the Rectified Flow ODE via high-order Taylor expansion, and achieve high-quality image/video editing using self-attention feature sharing. They are compatible with mainstream models such as FLUX and OpenSora.
Task-Agnostic Pre-training and Task-Guided Fine-tuning for Versatile Diffusion Planner: Proposes the SODP framework: first pre-trains a diffusion planner with a large dataset of sub-optimal multi-task trajectories without reward labels, then quickly adapts to downstream tasks using policy-gradient-based RL fine-tuning, and introduces BC regularization to prevent performance collapse, achieving a 60.56% success rate (SOTA) on Meta-World 50 tasks.
Theoretical Guarantees on the Best-of-n Alignment Policy: This paper refutes the claim of exactness for the widely used best-of-n policy KL divergence formula \(\log(n) - (n-1)/n\) in the literature, proving it to be only an upper bound, and proposes tighter KL divergence estimators and theoretical bounds for the win rate.
ToMA: Token Merge with Attention for Diffusion Models: This paper proposes ToMA, which reformulates token merging as a submodular optimization problem and implements merge/unmerge via attention-like linear transformations. This makes it compatible with GPU-optimized schemes like FlashAttention, achieving actual end-to-end speedups of 24% and 23% on SDXL and Flux, respectively, with negligible image quality degradation (DINO \(\Delta < 0.07\)).
Towards a Mechanistic Explanation of Diffusion Model Generalization: By comparing the approximation error between neural network denoisers and the theoretically optimal empirical denoisers, this work discovers that the generalization of diffusion models stems from a local inductive bias shared across different architectures—neural networks tend to execute localized operations during denoising. Correspondingly, a training-free Patch Set Posterior Composites (PSPC) denoiser is proposed to replicate network behavior by aggregating local empirical denoisers, confirming that patch denoising and composition constitute a key mechanism for diffusion model generalization.
Tree-Sliced Wasserstein Distance: A Geometric Perspective: Proposes Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL), which replaces the one-dimensional lines in SW with tree-shaped line systems as projection domains. This preserves topological structure while maintaining the efficient computation of closed-form solutions, outperforming SW and its variants in gradient flows, style transfer, and generative models.
Tree-Sliced Wasserstein Distance with Nonlinear Projection: This paper proposes the Tree-Sliced Wasserstein (TSW) distance under a nonlinear projection framework. By replacing linear projections with Circular and Spatial nonlinear Radon transforms, the proposed method preserves the well-definedness and injectivity of the metric while significantly outperforming existing SW and TSW variants on tasks such as gradient flows, self-supervised learning, and generative models.
Understanding and Mitigating Memorization in Generative Models via Sharpness of Probability Landscapes: An analytical geometric framework for memorization in diffusion models is established via the Hessian curvature (sharpness) of log-probability density, introducing a new metric to detect memorization at the early stages of generation and designing a retraining-free initial noise optimization strategy, SAIL, to mitigate memorization.
Unsupervised Learning for Class Distribution Mismatch (UCDM): UCDM is proposed to train classifiers by synthesizing positive and negative sample pairs from unlabeled data using diffusion models. It addresses the class distribution mismatch (CDM) between training sets and target tasks without relying on labeled data, significantly outperforming existing semi-supervised methods on both closed-set and open-set tasks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations: This paper utilizes the "predictive visual representation" within Video Diffusion Models (VDMs), which simultaneously encodes current and future frame information, to implicitly learn an inverse dynamics model. This allows for action generation in a high-frequency, closed-loop manner, substantially outperforming existing methods on both simulation and real-world manipulation tasks.
Visual Generation Without Guidance: This work proposes Guidance-Free Training (GFT), which parameterizes the conditional model as a linear interpolation between a sampling network and an unconditional network. This enables direct training of guidance-free visual generative models from data, halving the sampling computation while matching the performance of CFG on five models (DiT, VAR, LlamaGen, MAR, and LDM).
When Diffusion Models Memorize: Inductive Biases in Probability Flow of Minimum-Norm Shallow Neural Nets: This work theoretically analyzes the convergence behavior of probability flows in diffusion models driven by minimum \(\ell^2\)-norm shallow ReLU denoisers. It proves that the probability flow can converge to training samples (memorization), sums of training samples ("virtual points"), or manifold points on the boundary of a hyperbox (generalization), with the "early stopping" effect of the diffusion time scheduler determining the convergence target.
DDIS: When Model Knowledge Meets Diffusion Model: This work proposes DDIS, the first data-free image synthesis method that leverages text-to-image (T2I) diffusion models as image priors. By aligning Batch Normalization (BN) layer statistics during diffusion sampling via Domain Alignment Guidance (DAG), and encoding class-specific attributes through a Class Alignment Token (CAT), DDIS significantly outperforms existing DFIS methods on ImageNet-1k and multi-domain PACS.
Zero-Shot Adaptation of Parameter-Efficient Fine-Tuning in Diffusion Models: ProLoRA is proposed, a training-free closed-form LoRA cross-model migration method. By decomposing and projecting the source LoRA onto the subspace and null space of the source model weights, and then re-projecting them onto the corresponding spaces of the target model, lossless transfer of style, concept, and acceleration LoRAs across different diffusion models is achieved.