🎨 Image Generation¶

🧠 NeurIPS2025 · 246 paper notes

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11): DFloat11 exploits the low-entropy property of exponent bits in BFloat16 weights to losslessly compress LLMs and diffusion models to approximately 70% of their original size (equivalent to ~11 bits) via Huffman coding. It further introduces hierarchical lookup tables and a two-phase GPU kernel for efficient online decompression, enabling lossless inference of Llama 3.1 405B on a single node with 8×80GB GPUs.
A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective: This paper identifies a generalization-to-memorization transition in diffusion models under self-consuming loops (where each generation of models is trained on synthetic data from the previous one), reveals a strong linear correlation between training set entropy and model generalization (Pearson \(r=0.91\)), and proposes entropy-based data selection strategies (Greedy Selection / Threshold Decay Filter) that effectively slow this transition—reducing FID from 75.7 to 44.7 at iteration 8 under the CIFAR-10 accumulate paradigm.
A Connection Between Score Matching and Local Intrinsic Dimension: This paper proves that the lower bound of the denoising score matching (DSM) loss is precisely the local intrinsic dimension (LID) of the data manifold, thereby establishing the DSM loss itself as an efficient LID estimator—requiring neither gradient computation nor multiple forward passes. On Stable Diffusion 3.5, this approach reduces peak memory usage to approximately 60% of FLIPD while yielding more stable estimates under quantization.
A Data-Driven Prism: Multi-View Source Separation with Diffusion Model Priors: This paper proposes DDPRISM, a method that exploits structural differences among linear transformations across multi-view observations. Within an EM framework, it learns an independent diffusion model prior for each unknown source without requiring any isolated source samples, enabling source separation and posterior sampling. DDPRISM outperforms existing methods on both synthetic benchmarks and real galaxy observations.
A Diffusion Model for Regular Time Series Generation from Irregular Data with Completion and Masking: This paper proposes a two-stage framework for generating regular time series from irregularly sampled data: (1) a TST autoencoder completes missing values to construct a "natural neighborhood," and (2) a masking strategy applied during visual diffusion model training computes loss only on observed pixels, avoiding over-reliance on completed values. The approach achieves an average 70% improvement in discriminative score and a 6.5× training speedup.
A Gradient Flow Approach to Solving Inverse Problems with Latent Diffusion Models: This paper proposes DWGF (Diffusion-regularized Wasserstein Gradient Flow), which rigorously formalizes posterior sampling with latent diffusion models as a regularized gradient flow of KL divergence in the Wasserstein-2 space. An ODE system in the latent space is derived to solve image inverse problems, achieving substantially higher PSNR than baselines on inpainting and super-resolution tasks on FFHQ-512.
Accelerating Parallel Diffusion Model Serving with Residual Compression: This paper proposes CompactFusion, a framework that eliminates communication redundancy in parallel diffusion inference via residual compression—transmitting only the activation differences between adjacent denoising steps rather than full activations. It achieves a 3.0× speedup on 4×L20 GPUs with generation quality significantly superior to DistriFusion, a 6.7× speedup under simulated Ethernet bandwidth, and maintains better quality than DistriFusion even at 100× compression.
AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models: This paper reveals an error accumulation phenomenon in diffusion model quantization—where quantization errors at each step propagate and amplify into subsequent steps—and proposes explicitly simulating consecutive multi-step denoising during PTQ calibration to jointly optimize quantization parameters, while reducing memory from O(n) to O(1) through a carefully designed objective function.
Adapting Speech Language Model to Singing Voice Synthesis: This paper adapts a 1.7B-parameter TTS-pretrained Speech Language Model to the Singing Voice Synthesis (SVS) task via score tokenization, multi-stream LM prediction, conditional flow matching refinement, and a vocoder. Using only 135 hours of synthesized singing data, the system achieves performance comparable to dedicated SVS systems.
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering: This paper introduces ALE-Bench, the first AI benchmark targeting scored algorithm engineering contests (AtCoder Heuristic Contest). It curates 40 NP-hard optimization problems and provides an interactive agent evaluation framework. The strongest model, o3-high, achieves only human-average performance in a one-shot setting, with significant gaps between AI and human experts in cross-problem consistency and long-horizon iterative improvement.
Aligning Compound AI Systems via System-level DPO: This paper models compound AI systems as DAGs and proposes the SysDPO framework, which extends DPO to joint multi-component alignment. By leveraging DAG decomposition, system-level preferences are transformed into an end-to-end optimizable loss function. The authors provide theoretical guarantees of \(\beta\)-perfect alignment and demonstrate substantial improvements in collaborative quality on both LLM+diffusion model and LLM+LLM systems.
Aligning Text to Image in Diffusion Models is Easier Than You Think: This paper proposes SoftREPA — a lightweight contrastive fine-tuning strategy that introduces learnable soft text tokens (fewer than 1M parameters) to perform contrastive learning on frozen pretrained T2I diffusion models, explicitly maximizing mutual information between text and image representations. SoftREPA significantly improves text-image alignment on SD1.5/SDXL/SD3 and generalizes to both image generation and image editing tasks.
Amortized Sampling with Transferable Normalizing Flows: This work proposes Prose — a 285M-parameter all-atom transferable normalizing flow based on the TarFlow architecture, trained on 21,700 short-peptide MD trajectories (totaling 4.3 ms of simulation time). Prose enables zero-shot uncorrelated proposal sampling for arbitrary short-peptide systems, outperforms MD baselines under equal energy evaluation budgets, and generates samples 4,000× faster than the prior transferable Boltzmann generator (TBG).
AugGen: Synthetic Augmentation using Diffusion Models Can Improve Recognition: This paper proposes AugGen, a self-contained synthetic data augmentation method that trains a class-conditional diffusion model on the target dataset, generates new "mixed-class" samples by interpolating class conditioning vectors across different identities, and uses the resulting augmented data to improve discriminative model training. AugGen achieves 1–12% performance gains on face recognition benchmarks without relying on any external data or auxiliary models.
BADiff: Bandwidth Adaptive Diffusion Model: This paper proposes BADiff—the first bandwidth-adaptive diffusion model—which embeds target entropy constraints as explicit conditions into the diffusion reverse process, coupled with a differentiable entropy regularization loss and an adaptive stopping policy. The model dynamically adjusts generation quality according to real-time bandwidth and terminates sampling adaptively, reducing computational overhead while maintaining perceptual quality. This approach fundamentally avoids the compression artifacts and computational waste inherent in the conventional "high-quality generation → post-compression" pipeline.
Balanced Conic Rectified Flow: To address the distribution drift induced by the reflow step in k-rectified flow, this paper proposes conic reflow: constructing conic supervisory trajectories from the inverted noise of real images and their Slerp-perturbed neighbors, substantially reducing the number of required fake pairs while achieving superior generation quality and straighter ODE trajectories.
Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking: Prime (Partial masking scheme) represents each token as a base-\(b\) sub-token sequence and independently masks at the sub-token level, introducing intermediate states into masked diffusion models to enable fine-grained denoising. On OpenWebText, it achieves a perplexity of 15.36, becoming the first MDM to surpass ARM (17.54) without relying on an autoregressive formulation.
BitMark: Watermarking Bitwise Autoregressive Image Generative Models: This paper proposes BitMark—the first watermarking scheme for bitwise autoregressive image generative models (Infinity, Instella). During generation, it steers bit sequences toward a "green list" by adding logit biases, enabling reliable detection (z-test), high image fidelity (negligible FID change), robustness against diverse attacks, and radioactivity (downstream models trained on watermarked images also carry the watermark), providing a critical tool for preventing model collapse.
Blameless Users in a Clean Room: Defining Copyright Protection for Generative Models: This paper reconstructs the theoretical foundations of provable copyright protection for generative models. It demonstrates that the existing Near Access-Freeness (NAF) definition fails to prevent verbatim reproduction ("tainted" models), proposes a "blameless user" framework and a clean-room copyright protection definition (\((\kappa,\beta)\)-clean), under which users who would not reproduce content in a counterfactual "clean-room setting" are also unlikely to reproduce it in the real world. The paper further proves that differentially private training implies clean-room copyright protection under a "golden dataset" assumption.
Blind Strong Gravitational Lensing Inversion: Joint Inference of Source and Lens Mass with Score-Based Models: This work presents the first application of score-based generative model priors to blind strong gravitational lensing inversion — jointly inferring the morphology of background source galaxies and lens mass distribution parameters. By extending GibbsDDRM to the continuous-time domain, the method achieves reconstruction residuals consistent with observational noise and unbiased marginal posteriors over lens parameters.
BlurDM: A Blur Diffusion Model for Image Deblurring: BlurDM integrates the physical formation process of motion blur (progressive blur accumulation due to continuous exposure) into a diffusion model via a dual forward process (simultaneous noise addition and blurring) and a dual denoising-deblurring reverse process. It serves as a latent-space prior generator that consistently enhances four deblurring methods across four datasets, achieving an average gain of +0.31 dB on GoPro and +0.78 dB on RealBlur-J, while adding only ~4 GFLOPs and ~9 ms.
BlurGuard: A Simple Approach for Robustifying Image Protection Against AI-Powered Edit: This paper proposes BlurGuard—a method that applies mild blurring to an image prior to adversarial perturbation generation, causing the perturbation to couple with low-frequency structures and thereby resist post-processing operations such as JPEG compression and Gaussian noise. This approach more effectively prevents AI editing tools such as Stable Diffusion from tampering with protected images, achieving over 20% improvement in protection success rate compared to the non-blurred baseline.
BoltzNCE: Learning Likelihoods for Boltzmann Generation with Stochastic Interpolants: BoltzNCE trains an Energy-Based Model (EBM) via a hybrid Score Matching + InfoNCE objective to approximate the likelihood of a Boltzmann Generator, eliminating expensive Jacobian trace computations. On alanine dipeptide conformation generation, it achieves a 100× inference speedup with a free energy error of only 0.02 \(k_BT\).
Boosting Generative Image Modeling via Joint Image-Feature Synthesis: This paper proposes ReDi (Representation Diffusion), a framework that jointly models VAE image latents and DINOv2 semantic features within a diffusion model — both are simultaneously denoised from pure noise within a single diffusion process. With minimal modifications to the DiT architecture, ReDi achieves a 23× training convergence speedup and state-of-the-art FID, while unlocking a novel Representation Guidance inference strategy.
Breaking AR's Sampling Bottleneck: Provable Acceleration via Diffusion Language Models: This paper establishes a complete convergence theory for masked diffusion language models from an information-theoretic perspective: it proves that the sampling error in KL divergence decays at an \(O(1/T)\) rate and scales linearly with inter-token mutual information, provides a matching lower bound to establish tightness, and theoretically demonstrates that diffusion models can generate high-quality samples in \(T < L\) steps (where \(L\) is the sequence length).
CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop: This paper proposes CADMorph, an iterative plan–generate–verify framework that leverages a pretrained Parameter-to-Shape (P2S) diffusion model and a Masked-Parameter-Prediction (MPP) large language model to achieve geometry-driven parametric CAD editing without requiring triplet training data.
CAMILA: Context-Aware Masking for Image Editing with Language Alignment: This paper proposes CAMILA, a context-aware image editing method that leverages a multimodal large language model (MLLM) to automatically determine whether a given instruction is executable on the input image. It introduces dedicated [MASK] and [NEG] tokens to distinguish editable regions from regions that should remain unchanged, enabling precise multi-instruction editing while effectively filtering out non-executable instructions.
CaMiT: A Time-Aware Car Model Dataset for Classification and Generation: This paper introduces the CaMiT dataset (787K labeled + 5.1M unlabeled car images, 2005–2023) to systematically study temporal drift in fine-grained visual categories, providing benchmarks across four settings: static pre-training, time-incremental pre-training, time-incremental classifier learning, and time-aware image generation.
Can Knowledge-Graph-based Retrieval Augmented Generation Really Retrieve What You Need?: This paper proposes GraphFlow, a framework that models retrieval over knowledge graphs as a flow-matching problem under GFlowNet, jointly training a retrieval policy and flow estimator via a detailed balance objective and local exploration strategy. On the STaRK benchmark, GraphFlow surpasses GPT-4o by approximately 10% in both retrieval accuracy and diversity.
CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices: CDFlow is proposed to construct invertible linear layers via alternating products of circulant and diagonal matrices, reducing parameter complexity from \(\mathcal{O}(n^2)\) to \(\mathcal{O}(mn)\), matrix inversion complexity from \(\mathcal{O}(n^3)\) to \(\mathcal{O}(mn\log n)\), and log-determinant computation from \(\mathcal{O}(n^3)\) to \(\mathcal{O}(mn)\), outperforming comparable methods on density estimation and periodic data modeling.
Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data: This paper proposes CompFlow, a composite flow matching architecture that builds an online flow on top of the offline flow's output distribution to estimate the dynamics shift (Wasserstein distance) between offline and online environments. Combined with an active exploration strategy targeting high-shift regions, CompFlow achieves an average return 14.2% above the strongest baseline across 27 shifted-dynamics RL tasks.
Composition and Alignment of Diffusion Models using Constrained Learning: This paper proposes a unified constrained optimization framework that formalizes reward alignment and multi-model composition of diffusion models as constrained optimization problems. By applying Lagrangian duality, the framework automatically determines optimal weights, eliminating the need for manual hyperparameter search.
Conditional Panoramic Image Generation via Masked Autoregressive Modeling: This paper proposes PAR (Panoramic AutoRegressive model), the first framework to unify text-to-panorama (T2P) and panorama outpainting (PO) under masked autoregressive modeling. PAR addresses the boundary discontinuity inherent in ERP panoramas through a circular translation consistency loss and dual-space circular padding, achieving an FID of 37.37 on Matterport3D while demonstrating strong scalability and zero-shot generalization.
Constrained Discrete Diffusion: This paper proposes CDD (Constrained Discrete Diffusion), which embeds a differentiable constrained optimization projection operator into the denoising process of discrete diffusion models. Without retraining, CDD enforces sequence-level constraints at sampling time, achieving zero constraint violations across three task categories: toxic text generation, molecular design, and instruction following.
Contextual Thompson Sampling via Generation of Missing Data: This paper proposes Generative Thompson Sampling (TS-Gen), which reframes uncertainty in contextual bandits as missing data rather than unknown parameters. A generative model autoregressively imputes missing outcomes to implement Thompson sampling, and a regret bound directly tied to offline prediction loss is established.
Continuous Diffusion Model for Language Modeling: This paper proposes RDLM (Riemannian Diffusion Language Model), which constructs a continuous diffusion process on a statistical manifold (hypersphere) to model discrete distributions. It establishes a theoretical connection between discrete diffusion and continuous flows, and leverages radial symmetry to enable simulation-free training and a dimension-splitting technique for handling large vocabularies. RDLM achieves 1.32 BPC on Text8, surpassing all discrete and continuous diffusion models.
Continuous Uniqueness and Novelty Metrics for Generative Modeling of Inorganic Crystals: This paper identifies four critical flaws in the widely adopted discrete distance function (StructureMatcher) used to evaluate inorganic crystal generative models, and proposes continuous distance functions based on Magpie fingerprints (composition) and AMD vectors (structure) to achieve more reliable uniqueness and novelty metrics.
CORAL: Disentangling Latent Representations in Long-Tailed Diffusion: This paper diagnoses the root cause of tail-class generation degradation in diffusion models trained on long-tailed data as representation entanglement in the U-Net bottleneck layer, and proposes CORAL, which applies a supervised contrastive loss at the bottleneck to disentangle class representations. CORAL consistently outperforms baselines including DDPM, CBDM, and T2H on CIFAR10/100-LT, CelebA-5, and ImageNet-LT.
CORAL: Disentangling Latent Representations in Long-Tailed Diffusion: This paper identifies a phenomenon termed "representation entanglement" in diffusion models trained on long-tailed data, wherein the latent representations at the U-Net bottleneck layer exhibit severe overlap between tail and head class feature spaces. To address this, the authors propose CORAL, which introduces a projection head and a supervised contrastive loss at the bottleneck layer to promote inter-class latent separation, substantially improving the generation quality and diversity of tail classes.
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation: This paper proposes CoRL (Co-Reinforcement Learning), a two-stage framework — Unified RL followed by Refined RL — that simultaneously optimizes both understanding and generation capabilities of Unified Multimodal Language Models (ULMs) via reinforcement learning, achieving synergistic co-evolution of dual capabilities: +7% on generation and +23% on understanding at 1.5B parameters.
Counterfactual Identifiability via Dynamic Optimal Transport: This paper leverages dynamic optimal transport (dynamic OT) theory to resolve—for the first time—the counterfactual identifiability problem in high-dimensional multivariate Markovian SCMs. It proves that the OT flow mechanism yields a unique monotone order-preserving counterfactual transport map, and extends the results to non-Markovian settings (IV/BC/FC criteria).
Coupling Generative Modeling and an Autoencoder with the Causal Bridge: In the presence of unobserved confounders, this paper proposes coupling a generative model with an autoencoder to improve estimation of the causal bridge function—sharing statistical strength across treatment, control, and outcome variables via a shared encoder—and extends the framework to survival analysis.
Cross-fluctuation Phase Transitions Reveal Sampling Dynamics in Diffusion Models: Drawing on fluctuation theory from statistical physics, this work proposes a framework for detecting discrete phase transitions in the sampling process of diffusion models via cross-fluctuations, enabling accelerated sampling, improved conditional generation, zero-shot classification, and style transfer—all without retraining.
Decomate: Leveraging Generative Models for Co-Creative SVG Animation: This paper proposes Decomate, an interactive system that leverages multimodal large language models (MLLMs) to automatically decompose unstructured SVG graphics into semantic components. Designers specify animation behaviors for each component via natural language, and the system generates production-ready HTML/CSS/JS animation code, supporting iterative co-creative workflows.
DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models: This paper proposes DEFT (Decompositional Efficient Fine-Tuning), which efficiently fine-tunes T2I models by decomposing weight updates into two components — subspace projection and low-rank adjustment — outperforming LoRA and PaRa on both personalized and general image generation tasks.
Denoising Weak Lensing Mass Maps with Diffusion Model and Generative Adversarial Network: This work applies diffusion models (DM) to the task of weak gravitational lensing mass map denoising and conducts a systematic comparison with GAN (pix2pix) under identical experimental settings, demonstrating that DM comprehensively outperforms GAN in terms of training stability, robustness under multi-sample averaging, and reconstruction accuracy across multiple statistical estimators.
Detecting Generated Images by Fitting Natural Image Distributions: This paper proposes ConV, a consistency verification framework that exploits the geometric discrepancy between the natural image manifold and generated images. By constructing two gradient-orthogonal functions, ConV achieves training-free generated image detection. An enhanced variant, F-ConV, further amplifies manifold deviation via Normalizing Flows.
Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model: This paper proposes a unified three-stage workflow based on a fine-tuned geospatial foundation model (Granite-GFM): first establishing an empirical baseline via green space cooling effects to verify physical plausibility; then extrapolating urban temperatures under future climate scenarios; and finally simulating the cooling impact of greening interventions via inpainting. This elevates the foundation model from an evaluation tool to an interactive simulation platform for urban planning.
DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models: This paper proposes DEXTER, a data-free framework that optimizes textual prompts to drive a diffusion model to generate images maximizing target classifier activations, then employs an LLM to reason over the synthesized samples and produce globally coherent, human-readable textual explanations, enabling bias discovery and global interpretation of model behavior.
DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling: This paper finds that the global self-attention in pretrained DiTs primarily captures local patterns and thus exhibits substantial redundancy in generative tasks. It proposes DiCo, a purely convolutional diffusion model built from standard convolution modules and a Compact Channel Attention (CCA) mechanism. DiCo achieves an FID of 2.05 on ImageNet-256, surpassing DiT-XL/2, with 2.7× faster inference at 256 resolution and 3.1× faster at 512 resolution.
Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior: This paper proposes Diff-ICMH, a diffusion-based generative image compression framework that preserves semantic integrity via a Semantic Consistency (SC) loss and activates generative priors via a Tag Guidance Module (TGM). Using a single encoder-decoder and a single bitstream, the framework simultaneously serves 10+ machine intelligence tasks and human visual perception without any task-specific adaptation.
DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images: This paper proposes DiffEye, the first diffusion-based framework that directly utilizes raw eye-tracking data to generate continuous and diverse eye movement trajectories conditioned on natural images, while introducing Corresponding Position Embedding (CPE) to align the gaze space with the image semantic space.
Diffusion-Based Electromagnetic Inverse Design of Scattering Structured Media: This paper proposes a conditional diffusion model-based framework for electromagnetic inverse design that directly generates dielectric-sphere metasurface geometries from target differential scattering cross sections (DSCS), bypassing costly iterative optimization. The approach naturally handles the non-uniqueness of the inverse problem and outperforms CMA-ES evolutionary optimization while being orders of magnitude faster.
Diffusion-Classifier Synergy: Reward-Aligned Learning via Mutual Boosting Loop for FSCIL: This paper proposes the Diffusion-Classifier Synergy (DCS) framework, which establishes a closed-loop mutual boosting cycle between a diffusion model and a classifier. A multi-level reward function (feature-level + logits-level) guides the diffusion model to generate images most beneficial to the classifier, achieving state-of-the-art performance on FSCIL benchmarks.
Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation: This paper proposes the DPTM framework, which leverages a latent diffusion model to perform semantic transformation on unreliable target samples, generating a pseudo-target domain and iteratively narrowing the gap with the real target domain via a progressive reconstruction mechanism. DPTM achieves up to 18.6% improvement over existing SFDA state-of-the-art methods under large domain shift scenarios.
Diffusion Adaptive Text Embedding for Text-to-Image Diffusion Models: This paper proposes DATE (Diffusion Adaptive Text Embedding), which dynamically updates text embeddings during diffusion model sampling based on the current denoising intermediate results, improving text-image semantic alignment without any additional training.
Diffusion Classifiers Understand Compositionality, but Conditions Apply: A comprehensive study of zero-shot diffusion classifiers on compositional understanding tasks: covering 3 diffusion models (SD 1.5/2.0/3-m) × 10 datasets × 30+ tasks. The paper introduces Self-Bench, a diagnostic benchmark that eliminates domain gap by using images generated by the diffusion models themselves, and finds that diffusion classifiers do understand compositionality—but performance is conditioned on domain alignment and timestep weighting, hence "conditions apply."
Diffusion Generative Modeling on Lie Group Representations: This paper proposes a novel theoretical framework for constructing diffusion processes on the representation space of Lie groups (rather than on the Lie groups themselves). By mapping the curved dynamics of non-Abelian Lie groups into Euclidean space via generalized score matching, the framework enables simulation-free training of Lie group diffusion models, and demonstrates that standard score matching is a special case corresponding to the translation group.
Diffusion Models Meet Contextual Bandits: This paper proposes diffusion Thompson Sampling (dTS), which employs a pretrained diffusion model as an expressive prior over action parameters in contextual bandit problems. Through an efficient hierarchical posterior approximation, dTS enables fast updates and sampling, significantly outperforming conventional methods in large action spaces.
Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation: This paper proposes Distilled Decoding 2 (DD2), which reinterprets auto-regressive image models as conditional score models and designs a Conditional Score Distillation (CSD) loss to compress multi-step AR sampling into one-step generation. On ImageNet-256, DD2 achieves only a marginal FID degradation from 3.40 to 5.43 while obtaining 8.0× speedup (VAR) and 238× speedup (LlamaGen), closing 67% of the performance gap relative to DD1 and training 12.3× faster.
DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution: This paper presents DOVE, a video super-resolution model built upon the CogVideoX pretrained video generation model. Through a two-stage latent-pixel space training strategy and a curated high-quality HQ-VSR dataset, DOVE achieves single-step inference for video super-resolution, delivering 28× speedup over multi-step diffusion methods while achieving comparable or superior performance.
Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable: This paper proposes Dual Data Alignment (DDA), which generates synthetic training images via pixel-domain and frequency-domain dual alignment to eliminate spurious correlations caused by dataset bias. By forcing the detector to learn only forgery-relevant features, DDA achieves an average accuracy of 90.7% across 11 benchmarks, substantially outperforming existing methods.
EditInfinity: Image Editing with Binary-Quantized Generative Models: This paper proposes EditInfinity, the first work to apply the classical "image inversion–image editing" paradigm to binary-quantized autoregressive generative models (Infinity). By leveraging the inherent property of quantized representations that enables exact intermediate supervision, EditInfinity achieves high-fidelity image inversion. Combined with a piecewise linear smoothing kernel for seamless editing, it comprehensively surpasses diffusion model baselines on PIE-Bench.
EEGReXferNet: A Lightweight Gen-AI Framework for EEG Subspace Reconstruction via Cross-Subject Transfer Learning and Channel-Aware Embedding: This paper proposes EEGReXferNet, a lightweight generative AI framework that achieves EEG subspace reconstruction under a cross-subject transfer learning setting via neighborhood channel-aware input selection, band-specific sub-window convolutional encoding/decoding, a dynamic sliding-window latent space, and reference statistics scaling. The framework reduces parameter count by approximately 45% and achieves inference latency <1ms, while maintaining PSD correlation \(\geq 0.95\) and spectrogram RV coefficient \(\geq 0.85\).
Efficient Rectified Flow for Image Fusion: This paper proposes RFfusion, which introduces Rectified Flow into image fusion for the first time, enabling training-free one-step sampling. A two-stage fusion-oriented VAE training strategy is further designed, achieving comprehensive superiority over existing diffusion-based fusion methods in both speed and quality.
Elucidated Rolling Diffusion Models for Probabilistic Forecasting of Complex Dynamics: This paper proposes ERDM, the first framework to successfully unify the Rolling Diffusion paradigm with the principled design choices of EDM (noise schedule, preconditioning, Heun sampler). By employing a progressive noise schedule that explicitly models growing uncertainty, ERDM significantly outperforms autoregressive EDM baselines on Navier-Stokes and ERA5 weather forecasting benchmarks.
Emergence and Evolution of Interpretable Concepts in Diffusion Models: This work is the first to systematically apply Sparse Autoencoders (SAEs) to multi-step diffusion models (Stable Diffusion v1.4), revealing that image composition emerges as early as the first reverse diffusion step while stylistic concepts form during intermediate stages. Based on these findings, the paper proposes temporally adaptive causal intervention techniques.
Encoder-Decoder Diffusion Language Models for Efficient Training and Inference: This paper proposes E2D2, an encoder-decoder architecture for discrete diffusion language models that performs iterative denoising via a lightweight decoder while periodically updating representations through a large encoder, achieving faster inference (~3× vs. MDLM) and more efficient block diffusion training (halving FLOPs).
Energy Loss Functions for Physical Systems: This paper proposes a physics-based energy loss function framework. By deriving an energy-difference loss grounded in pairwise distances via reverse KL divergence and the Boltzmann distribution, the framework naturally satisfies SE(d) invariance and substantially outperforms MSE and cross-entropy losses on molecular generation and spin ground-state prediction tasks.
Enhancing Diffusion Model Guidance through Calibration and Regularization: To address the vanishing gradient problem caused by overconfident classifiers in classifier-guided diffusion models, this paper proposes two complementary approaches: (1) a Smooth ECE calibration loss for fine-tuning classifiers, yielding ~3% FID improvement; and (2) regularized sampling guidance based on f-divergences (RKL/FKL/JS) that requires no retraining, achieving FID 2.13 on ImageNet 128×128.
Entropy Rectifying Guidance for Diffusion and Flow Models: This paper proposes Entropy Rectifying Guidance (ERG), which manipulates the Hopfield energy landscape of attention layers (via temperature scaling and step-size adjustment) to obtain a weak prediction signal as a substitute for the unconditional prediction in conventional CFG, simultaneously improving quality, diversity, and consistency in text-to-image, class-conditional, and unconditional generation.
Epistemic Uncertainty for Generated Image Detection: This paper proposes WePe (Weight Perturbation), which estimates epistemic uncertainty by applying weight perturbations to a pretrained vision foundation model (DINOv2). The method exploits the divergence between natural and AI-generated images in uncertainty space for detection, requiring no training.
Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems: This paper proposes an equivariant flow matching framework combined with a symmetric coupling strategy to model multimodal probability distributions arising in symmetry-breaking bifurcation problems via generative AI, significantly outperforming deterministic models and VAEs on physical systems (buckling beam, Allen-Cahn equation).
Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation: This paper systematically evaluates 12 text-image compositional alignment metrics against human judgments, finding that no single metric consistently outperforms all others across compositional categories, that VQA metrics are not always superior, and that embedding-based metrics (ImageReward, HPS) are stronger on certain categories.
EVODiff: Entropy-aware Variance Optimized Diffusion Inference: This paper analyzes the inference process of diffusion models from an information-theoretic perspective and proposes EVODiff, a method that reduces conditional entropy by optimizing conditional variance, achieving significant sampling acceleration and quality improvement without modifying the underlying model.
Evolve to Inspire: Novelty Search for Diverse Image Generation: This paper proposes Wander, a framework that leverages novelty search and LLM-driven prompt evolution to generate highly diverse image collections from a single text prompt, surpassing existing evolutionary prompt optimization baselines on the Vendi Score metric.
Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction: This paper proposes InSUR, a multi-dimensional instruction uncertainty reduction framework that stabilizes adversarial optimization via a ResAdv-DDIM sampler, constrains attack scenarios through context-aware encoding, and evaluates semantic fidelity using WordNet-based semantic abstraction. InSUR is the first method to generate 2D/3D semantic-constrained adversarial examples (SemanticAE) from natural language instructions.
Exploring Variational Graph Autoencoders for Distribution Grid Data Generation: This paper systematically evaluates four variational graph autoencoder (VGAE) decoder architectures for synthesizing distribution grid topologies. The Iterative-GCN decoder is found to adequately reproduce structural and spectral characteristics of real grids on small, homogeneous datasets; however, on large, heterogeneous datasets, all methods exhibit critical failure modes including disconnected components and repetitive patterns.
Failure Prediction at Runtime for Generative Robot Policies: This paper proposes FIPER, a framework for runtime failure prediction in generative robot policies (diffusion/flow matching). It jointly evaluates an observation-side metric RND-OE (OOD detection) and an action-side metric ACE (Action Chunk Entropy) to enable early and accurate failure prediction without any failure data, with statistical guarantees provided via conformal prediction.
FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models: This paper proposes FairImagen, a post-processing debiasing framework that applies FairPCA projection in the CLIP prompt embedding space to remove demographic information, combined with empirical noise injection and joint cross-demographic debiasing, achieving significant fairness improvements in text-to-image generation without retraining the model.
FALCON: Few-step Accurate Likelihoods for Continuous Flows: This paper proposes FALCON, which employs a hybrid training objective (flow matching + mean velocity loss + invertibility regularization) to enable continuous normalizing flows to provide sufficiently accurate likelihood estimates under few-step sampling, achieving Boltzmann sampling two orders of magnitude faster than conventional CNFs.
Fast Data Attribution for Text-to-Image Models: This work distills the accurate but computationally expensive Attribution by Unlearning (AbU) method into a lightweight feature embedding space. By training via learning-to-rank, simple cosine similarity retrieval approximates the costly attribution ranking, enabling millisecond-level data attribution at the scale of Stable Diffusion + LAION-400M for the first time.
Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms: This work introduces high-order numerical methods into discrete diffusion model inference for the first time, proposing two second-order solvers — θ-RK-2 and θ-Trapezoidal — and theoretically proving that θ-Trapezoidal improves the discretization error from first-order \(\mathcal{O}(\kappa T)\) to second-order \(\mathcal{O}(\kappa^2 T)\). Experiments spanning 200M–8B models consistently demonstrate improvements across text, image, and mathematical reasoning tasks.
FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies: Grounded in Markov Random Field (MRF) theory, this paper proposes a Local Pixel Dependency (LPD) feature representation that exposes textural inconsistencies in generated images via median-filter reconstruction. Combined with FerretNet, a lightweight convolutional network with only 1.1M parameters, the approach achieves an average detection accuracy of 97.1% across 22 generative models while being trained exclusively on 4 categories of ProGAN data.
Flatten Graphs as Sequences: Transformers are Scalable Graph Generators: This paper proposes AutoGraph, which losslessly flattens graphs into token sequences via Segmented Eulerian Neighborhood Trails (SENT), enabling direct modeling with a decoder-only Transformer. AutoGraph achieves graph generation speeds ~100× faster than diffusion models while reaching state-of-the-art performance on both synthetic and molecular benchmarks.
Flattening Hierarchies with Policy Bootstrapping: This paper proposes SAW (Subgoal Advantage-Weighted Policy Bootstrapping), which distills the long-horizon reasoning advantages of hierarchical RL into a single flat policy by sampling subgoals from in-dataset trajectories and performing policy bootstrapping via advantage-weighted importance sampling. The approach requires no learned subgoal generative model, and matches or surpasses state-of-the-art performance across 20 offline GCRL datasets.
Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators: This paper proposes Flex-Judge, which fine-tunes a multimodal large language model on only 1K text-only reasoning samples to achieve zero-shot generalization across image, video, audio, and molecular evaluation tasks, matching or surpassing commercial APIs such as GPT-4o and specialized evaluators trained on large-scale annotated data.
Flow Matching Neural Processes: This paper proposes FlowNP, which integrates flow matching into the neural process framework. By employing a transformer to predict velocity fields at target points, FlowNP enables parallel sampling from conditional distributions, achieving state-of-the-art performance on three benchmarks spanning 1D Gaussian processes, image data, and meteorological data.
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks: This paper proposes FocalCodec—a low-bitrate speech codec based on Focal Modulation Networks—that compresses speech to 0.16–0.65 kbps using a single binary codebook, achieving performance comparable to or better than multi-codebook state-of-the-art methods on speech resynthesis, voice conversion, and multiple downstream tasks.
FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency: This work is the first to introduce frequency-domain consistency constraints into flow-based visuomotor policies. By projecting action chunk velocity fields into the frequency domain via DCT and imposing an adaptive frequency component loss, it achieves high-quality one-step action generation at 93.5 Hz, outperforming existing one-step generation methods on both simulation and real-robot tasks.
From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging: This paper proposes Cradle2Cane, a two-pass face aging framework: the first pass achieves precise age control via Adaptive Noise Injection (AdaNI), and the second pass reinforces identity consistency through dual identity embeddings (IDEmb) comprising SVR-ArcFace and Rotate-CLIP. The framework achieves an optimal balance between age accuracy and identity preservation across the full lifespan (0–80 years).
GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data: GeneMAN proposes a generalizable single-image 3D human reconstruction framework that requires no parametric body model (e.g., SMPL). By training human-specific 2D/3D diffusion prior models on large-scale multi-source human data, and combining a geometry initialization-sculpting pipeline with multi-space texture refinement, GeneMAN achieves high-fidelity 3D human reconstruction from in-the-wild images, handling diverse body proportions, complex poses, and personal belongings.
Generative Model Inversion Through the Lens of the Manifold Hypothesis: This paper reveals, from a manifold-geometric perspective, that the essence of generative model inversion attacks (MIA) is implicit denoising achieved by projecting loss gradients onto the generator's tangent space. It proposes the gradient-manifold alignment hypothesis (higher alignment → greater vulnerability), and designs a training-free method, AlignMI, that consistently and significantly improves upon multiple state-of-the-art attacks.
GenIR: Generative Visual Feedback for Mental Image Retrieval: This paper proposes GenIR, a multi-round interactive image retrieval framework that leverages text-to-image diffusion models to generate "synthetic visual feedback," explicitly visualizing the system's interpretation of the user's query. This enables users to intuitively identify discrepancies and iteratively refine their queries, achieving substantial improvements over text-only feedback methods on the Mental Image Retrieval (MIR) task.
GeoRemover: Removing Objects and Their Causal Visual Artifacts: GeoRemover is a geometry-aware two-stage framework that decouples object removal into geometric removal (depth domain) and appearance rendering (RGB domain). By modifying the scene's geometric representation, it implicitly eliminates causal visual artifacts—such as shadows and reflections—left by the removed object.
Gradient Variance Reveals Failure Modes in Flow-Based Generative Models: By analyzing the gradient variance of the CFM loss, this paper demonstrates that Rectified Flow inevitably memorizes training pairs under deterministic interpolation rather than learning an optimal transport map, and proves that introducing stochasticity (stochastic interpolants) breaks this memorization channel and restores generalization.
GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning: This paper proposes GraLoRA, which partitions the LoRA weight update matrix into \(k^2\) independent sub-blocks, each equipped with its own low-rank adapter pair. Without increasing parameter count or computational cost, GraLoRA elevates the effective rank from \(r\) to \(kr\), addressing the performance degradation caused by gradient entanglement in high-rank LoRA. On code generation, Pass@1 improves by up to +8.5%.
Graph-based Neural Space Weather Forecasting: This paper proposes a graph neural network-based neural emulator for space weather, trained on Vlasiator hybrid-Vlasov simulation data, enabling both deterministic and probabilistic autoregressive forecasting of near-Earth space conditions. The emulator achieves over 100× speedup relative to the original simulator and quantifies forecast uncertainty through latent-variable ensemble generation.
Graph Diffusion that can Insert and Delete: This paper proposes GrIDDD, the first model to extend discrete denoising diffusion probabilistic models (DDPM) to support dynamic insertion and deletion of graph nodes during generation, allowing molecular graph size to adapt throughout the diffusion process. GrIDDD matches or surpasses existing methods on property targeting and molecular optimization tasks.
Graph Distance as Surprise: Free Energy Minimization in Knowledge Graph Reasoning: This paper establishes a formal connection between the Free Energy Principle (FEP) from neuroscience and knowledge graph reasoning. It proposes using shortest-path graph distance as a measure of surprise, generalizing the tree-structured surprise theory of Murphy et al. to arbitrary directed graphs, and provides a principled theoretical framework for entity grounding in KG-based agents.
GSPN-2: Efficient Parallel Sequence Modeling: GSPN-2 achieves up to 40× speedup over GSPN-1's 2D spatial propagation through algorithm-system co-design — specifically, single-kernel fusion, compact channel propagation, and shared memory optimization — while matching Transformer-level accuracy on ImageNet classification and text-to-image generation at significantly lower computational cost.
Guided Diffusion Sampling on Function Spaces with Applications to PDEs: This paper proposes FunDPS (Function-space Diffusion Posterior Sampling), which trains an unconditional diffusion model in function space and performs plug-and-play posterior sampling for PDE inverse problems via gradient guidance at inference time. Theoretically, it extends the Tweedie formula to infinite-dimensional Banach spaces. Empirically, across 5 PDE tasks with only 3% observations, FunDPS achieves 32% higher accuracy on average than DiffusionPDE while reducing the number of sampling steps by 4×.
GuideFlow3D: Optimization-Guided Rectified Flow For Appearance Transfer: This paper proposes GuideFlow3D, a training-free 3D appearance transfer framework that alternately injects differentiable guidance losses (part-aware appearance loss + self-similarity loss) into the sampling process of a pretrained rectified flow model, enabling robust texture and geometric detail transfer between objects with significant geometric discrepancies.
Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation: This paper proposes a definition of hallucination in text-to-image (T2I) models as bias-driven deviation, establishes a taxonomy of three hallucination categories—attribute, relation, and object—and argues that hallucination evaluation serves as an "upper bound" for prompt alignment evaluation, thereby revealing hidden model biases.
Head Pursuit: Probing Attention Specialization in Multimodal Transformers: This paper reinterprets the classical sparse signal recovery algorithm (SOMP) as a multi-sample interpretability tool, revealing fine-grained semantic specialization of attention heads in LLMs and VLMs. By flipping approximately 1% of heads, specific concepts (e.g., country names, toxic content, colors) can be reliably suppressed or amplified during generation.
Hephaestus: Mixture Generative Modeling with Energy Guidance for Large-scale QoS Degradation: This paper proposes Hephaestus, a three-stage generative framework (Forge-Morph-Refine) that combines a Predicted Path Pressurization (PPS) algorithm, an energy-guided mixture CVAE, and latent-space reinforcement learning optimization to address large-scale network QoS degradation problems.
Hierarchical Koopman Diffusion: Fast Generation with Interpretable Diffusion Trajectory: Grounded in Koopman operator theory, this work lifts the nonlinear denoising dynamics of diffusion models into a linear Koopman space, enabling one-step sampling through hierarchical decomposition while preserving the interpretability and controllability of intermediate generation states.
High-order Equivariant Flow Matching for Density Functional Theory Hamiltonian Prediction: This paper proposes QHFlow, the first method to apply conditional flow matching to density functional theory (DFT) Hamiltonian matrix prediction. By designing high-order SE(3)-equivariant vector fields and symmetry-aware prior distributions, QHFlow reduces Hamiltonian prediction error by 73% on MD17 and accelerates DFT computation by 54% when used as an SCF initializer.
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval: This paper proposes Promptable Embeddings, a method that highlights target visual attributes at retrieval time to improve attribute-focused text-to-image retrieval, and introduces the COCO-Facet benchmark dataset.
HollowFlow: Efficient Sample Likelihood Evaluation using Hollow Message Passing: This paper proposes HollowFlow, a framework that enforces a block-diagonal structure on the Jacobian of the velocity field via Non-Backtracking Graph Neural Networks (NoBGNN) and Hollow Message Passing, reducing the number of backward passes required for likelihood computation in Continuous Normalizing Flows from \(\mathcal{O}(n)\) to a constant \(\mathcal{O}(d)\), achieving up to \(10^2\times\) sampling speedup.
How to Build a Consistency Model: Learning Flow Maps via Self-Distillation: This paper proposes a unified self-distillation framework for directly learning flow maps (the generalized form of consistency models). By exploiting the tangent condition, any distillation scheme is converted into a direct training algorithm that requires no pretrained teacher. Three algorithm families are derived (Eulerian / Lagrangian / Progressive), among which the Lagrangian method avoids spatial gradients and bootstrapping, achieving the most stable training and best performance.
Image Super-Resolution with Guarantees via Conformalized Generative Models: This work applies Conformal Prediction to construct binary "confidence masks" for generative image super-resolution models, reliably identifying trustworthy regions in generated images with rigorous statistical guarantees.
ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation: This paper proposes the ImageSentinel framework, which synthesizes sentinel images that are visually consistent with a private dataset and binds them to randomly generated character retrieval keys, enabling reliable detection of unauthorized use of private datasets by retrieval-augmented image generation (RAIG) systems—achieving near-100% AUC with only 3–10 queries.
Improved Training Technique for Shortcut Models (iSM): Targeting five key performance bottlenecks of Shortcut Models (compounding guidance, fixed guidance, frequency bias, self-consistency deviation, and curved trajectories), this paper proposes iSM, a unified training framework that incorporates intrinsic guidance, multi-level wavelet loss, scaling optimal transport, and twin EMA strategy, achieving substantial improvements on ImageNet 256×256 with one-step FID 5.27 and four-step FID 2.05.
Improving Posterior Inference of Galaxy Properties with Image-Based Conditional Flow Matching: This paper proposes a Conditional Flow Matching (CFM) framework that jointly models morphological information from galaxy images alongside photometric data, substantially improving posterior inference of physical galaxy properties including stellar mass, star formation rate, metallicity, and dust extinction.
ICEdit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer: ICEdit proposes an in-context editing paradigm built upon large-scale Diffusion Transformers (DiT), achieving state-of-the-art editing performance with only 0.1% of the training data through an in-context prompt design, lightweight LoRA-MoE fine-tuning, and VLM-guided early-filter inference-time scaling.
Increasing the Utility of Synthetic Images through Chamfer Guidance: This paper proposes Chamfer Guidance — a training-free inference-time guidance method that uses a small number of real samples as references. By leveraging Chamfer distance, it simultaneously optimizes fidelity and diversity of synthetic images. On ImageNet-1k, using only 32 real images, it achieves 97.5% Precision and 92.7% Coverage, and delivers up to 16% accuracy improvement in downstream classifier training.
Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing: This paper proposes inference-time scaling methods for flow models: stochasticity is introduced via ODE→SDE conversion to enable particle sampling; the search space is expanded through linear→VP interpolant conversion; and a Rollover Budget Forcing (RBF) strategy is designed to adaptively allocate the computational budget. The approach substantially outperforms all existing methods on compositional text-to-image generation and quantity-aware generation tasks.
InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation: This paper proposes InfinityStar, the first purely discrete autoregressive model capable of generating industrial-grade 720p video. Through spacetime pyramid modeling, it unifies T2I/T2V/I2V/interactive long video generation, achieving a VBench score of 83.74 that surpasses HunyuanVideo, with inference speeds 10–32× faster than diffusion models.
Information-Theoretic Discrete Diffusion: This work generalizes the classical I-MMSE identity from continuous diffusion to the discrete domain, establishing the I-MDSE and I-MDCE relations. It proves that DSE/DCE losses are not merely variational upper bounds but exact decompositions of the log-likelihood, and derives time-free formulas, conditional likelihood estimators, and coupled likelihood-ratio estimators. The proposed methods are validated on large-scale models such as LLaDA, demonstrating low variance and out-of-distribution detection capability.
Information Theoretic Learning for Diffusion Models with Warm Start: This paper proposes a likelihood estimation framework that generalizes the classical KL divergence–Fisher information relationship to arbitrary isotropic noise perturbations, combined with warm-start noise injection and importance sampling to eliminate the train-test gap and achieve tighter likelihood upper bounds, attaining state-of-the-art NLL on ImageNet at multiple resolutions.
Instance-Level Composed Image Retrieval: This paper proposes the instance-level composed image retrieval (i-CIR) benchmark and a training-free method, BASIC, which independently estimates image and text query similarities and fuses them via multiplicative combination, achieving state-of-the-art performance on both i-CIR and existing CIR datasets without any training.
Is Artificial Intelligence Generated Image Detection a Solved Problem?: This paper proposes AIGIBench, a comprehensive benchmark that systematically evaluates 11 state-of-the-art detectors across four tasks—multi-source generalization, multi-degradation robustness, data augmentation sensitivity, and test-time preprocessing impact—revealing that existing AIGI detection methods suffer severe performance degradation in real-world scenarios, demonstrating that the problem is far from solved.
ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model: This paper proposes ItDPDM (Information-Theoretic Discrete Poisson Diffusion Model), which achieves exact likelihood estimation for non-negative discrete data via a Poisson noise channel and a Poisson Reconstruction Loss (PRL), eliminating ELBO approximation and dequantization. The model outperforms existing discrete diffusion models in likelihood estimation on synthetic data, CIFAR-10, and MIDI music.
Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning: This paper proposes Janus-Pro-R1, which achieves synergistic advancement in visual understanding and generation through a two-stage training pipeline (SFT + RL). The approach enables MLLMs to form genuine Chain-of-Thought reasoning and trigger Aha Moments during text-to-image generation, surpassing GPT-4o on GenEval while extending naturally to image editing tasks.
KLASS: KL-Guided Fast Inference in Masked Diffusion Models: This paper proposes KLASS (KL-Adaptive Stability Sampling), a training-free sampling method that leverages token-level KL divergence and confidence scores to identify stable tokens for parallel decoding, achieving up to 2.78× speedup on masked diffusion models without sacrificing—and in many cases improving—generation quality.
Knowledge Distillation Detection for Open-weights Models: This paper introduces the task of knowledge distillation detection, proposing a data-free input synthesis and statistical scoring framework to determine whether an open-weights student model has been distilled from a specific teacher model.
Kuramoto Orientation Diffusion Models: This work introduces Kuramoto synchronization dynamics from biological systems into score-based generative models, constructing a forward synchronization / reverse desynchronization diffusion framework over the periodic domain. The proposed approach achieves substantially superior generation quality over standard diffusion models on orientation-dense data such as fingerprints and textures, while remaining competitive on CIFAR-10.
Large-Scale Training Data Attribution for Music Generative Models via Unlearning: This paper applies machine unlearning-based training data attribution (TDA) to a large-scale text-to-music diffusion model (115K tracks), identifying optimal hyperparameter configurations via grid search and comparing against non-counterfactual methods, thereby demonstrating the feasibility of unlearning-based TDA in the music generation domain.
Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification: This paper proposes the Latent Zoning Network (LZN)—a framework that unifies generative modeling, representation learning, and classification within a shared Gaussian latent space. Each data type is equipped with an encoder-decoder pair that maps samples to disjoint latent zones. Only two atomic operations—latent computation and latent alignment—are required to support diverse ML tasks. LZN reduces unconditional generation FID on CIFAR10 from 2.76 to 2.59 and surpasses SimCLR on ImageNet linear classification.
LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching: This paper proposes LeapFactual, a counterfactual explanation algorithm based on Conditional Flow Matching (CFM), which bridges flattened and structured latent spaces via a "Lift-Land" (Leap) mechanism to generate reliable, in-distribution counterfactual samples that remain effective even when the learned decision boundary deviates from the true boundary.
Learnable Sampler Distillation for Discrete Diffusion Models: This paper proposes LSD and LSD+, which distill the intermediate score trajectory knowledge of a high-fidelity teacher sampler into a few-step student sampler via learnable sampling coefficients and non-uniform time scheduling, enabling efficient and high-quality sampling for discrete diffusion models.
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders: This paper proposes a framework for extracting interpretable features from the latent spaces of audio generative models via sparse autoencoders (SAEs). Linear probes are used to map SAE features to human-understandable acoustic concepts (pitch, amplitude, timbre), enabling controllable manipulation and visualization of the audio generation process.
Learning to Integrate Diffusion ODEs by Averaging the Derivatives: This paper proposes the Secant Losses family, which learns to integrate diffusion ODEs via Monte Carlo integration and Picard iteration, progressively extending the tangent of a diffusion model into a secant. The approach achieves an excellent balance between training stability and few-step inference.
Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials: This paper proposes Visual-Contrast Attention (VCA), which generates compact positive/negative visual-contrast tokens via spatial pooling and performs differential interaction, reducing self-attention complexity from \(O(N^2C)\) to \(O(NnC)\) (\(n \ll N\)), while achieving consistent improvements on both image classification and generation tasks.
LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss: This paper proposes LinEAS (Linear End-to-end Activation Steering), which jointly optimizes cross-layer affine transformations in an end-to-end manner using a 1D Wasserstein distributional loss for global activation alignment. With only 32 unpaired samples, LinEAS efficiently steers LLM toxicity and controls concept generation in T2I models.
LLM Meets Diffusion: A Hybrid Framework for Crystal Material Generation: This paper proposes CrysLLMGen, a hybrid framework that combines the complementary strengths of LLMs in discrete atom type prediction and diffusion models in continuous coordinate/lattice parameter modeling, achieving high structural validity and compositional validity simultaneously in crystal material generation.
MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction: This paper proposes MGE-LDM, the first model to simultaneously achieve music mixture generation, partial generation (source completion), and text-driven arbitrary source extraction within a unified latent diffusion framework. It jointly models mixture–submixture–source triplets and leverages diffusion inpainting to handle each task.
Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation: This paper proposes a framework for decoupling semantic and visual features from a pretrained diffusion model backbone to enable visual correspondence matching. Building on this, it introduces the Visual Semantic Matching (VSM) metric, which for the first time simultaneously supports quantification and spatial localization of visual inconsistencies in subject-driven image generation.
Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models: This paper proposes Modality-Decoupled Experts (MoDE), which decouples text and image adapters into independent T-MoE and V-Adapter subspaces, combined with knowledge distillation, to simultaneously mitigate intra-modal and inter-modal forgetting in continual instruction tuning of unified multimodal generation models.
Mitigating Sexual Content Generation via Embedding Distortion in Text-conditioned Diffusion Models: This paper proposes Distorting Embedding Space (DES), a text-encoder-based defense framework that achieves state-of-the-art sexual content mitigation on FLUX.1 and SD v1.5 (reducing ASR to 9.47% and 0.52%, respectively) by transforming unsafe embeddings into safe regions, preserving safe embeddings, and neutralizing "nudity" semantics, while maintaining high-quality benign image generation.
MMaDA: Multimodal Large Diffusion Language Models: This paper presents MMaDA, the first multimodal foundation model that simultaneously achieves text reasoning, multimodal understanding, and text-to-image generation within a unified discrete diffusion architecture. MMaDA bridges the gap between diffusion model pre-training and post-training through mixed long chain-of-thought (CoT) fine-tuning and the UniGRPO reinforcement learning algorithm.
MMG: Mutual Information Estimation via the MMSE Gap in Diffusion: Leveraging the information-theoretic formulation of diffusion models, this paper proves that mutual information equals one-half of the integral over all signal-to-noise ratios of the gap between conditional and unconditional denoising MMSE. The proposed MMG estimator, combined with adaptive importance sampling and the orthogonality principle, significantly improves estimation accuracy and stability.
MGAudio: Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation: This paper proposes MGAudio, the first video-to-audio generation framework that replaces classifier-free guidance (CFG) with model-guided (MG) training, combined with a dual-role audio-video encoder (DRAVE) for simultaneous condition injection and feature alignment. With only 131M parameters, MGAudio achieves state-of-the-art performance on VGGSound (FAD=0.40) and surpasses most competing methods using only 10% of the training data.
Modeling Microenvironment Trajectories on Spatial Transcriptomics with NicheFlow: NicheFlow is a Flow Matching-based generative model that represents cellular microenvironments as point clouds and jointly models the temporal evolution of cell states and spatial coordinates via Variational Flow Matching and optimal transport, substantially outperforming single-cell-level trajectory inference methods on embryonic development, brain development, and aging datasets.
Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models: This paper proposes a unified Gaussianity regularization framework that combines moment matching in the spatial domain with power spectrum matching in the frequency domain. It subsumes existing regularizers (KL divergence, kurtosis, norm) as special cases, and achieves the equivalent effect of PRNO's \(\mathcal{O}(D^2)\) approach at \(\mathcal{O}(D\log D)\) complexity, significantly outperforming all baselines on reward alignment tasks for text-to-image models.
Multimodal Generative Flows for LHC Jets: This paper proposes a Transformer-based multimodal flow matching framework (MMF) that jointly models continuous flow matching and continuous-time Markov jump bridges, enabling unified generation of particle kinematics (continuous) and flavor quantum numbers (discrete) in LHC jets.
Neural Entropy: This paper explores the connection between deep learning and information theory through the lens of diffusion models, introducing a "neural entropy" measure to quantify the amount of information stored in neural networks during the diffusion process, revealing that image diffusion models achieve remarkably high compression efficiency on structured data.
Next Semantic Scale Prediction via Hierarchical Diffusion Language Models: This paper proposes HDLM (Hierarchical Diffusion Language Model), which introduces cluster tokens with coarse-grained semantics as an intermediate hierarchy between clean tokens and mask tokens, enabling "next semantic scale prediction" in discrete diffusion language modeling. The method derives a closed-form ELBO, achieves consistently lower perplexity than MDLM/GIDD on OpenWebText, and reduces generation perplexity by 62% after stochastic perturbation.
Non-Asymptotic Analysis of Data Augmentation for Precision Matrix Estimation: This paper provides a non-asymptotic analysis of data augmentation (DA) for high-dimensional precision matrix (inverse covariance matrix) estimation. It establishes quadratic error concentration bounds for both linear shrinkage estimators and DA estimators, and introduces a novel deterministic equivalent framework for generalized resolvent matrices with dependent structure.
Non-Markovian Discrete Diffusion with Causal Language Models: This paper proposes CaDDi, a framework that enables each denoising step to access the full generation trajectory via a non-Markovian discrete diffusion process, and unifies this process within a causal language model architecture, allowing pretrained LLMs to be directly reused as discrete diffusion models.
NPN: Non-Linear Projections of the Null-Space for Imaging Inverse Problems: This paper proposes Non-linear Projections of the Null-space (NPN)—a novel regularization strategy that trains a neural network to predict, directly from measurements, the projection coefficients of the ground-truth signal onto a low-dimensional subspace of the sensing matrix's null space. These coefficients serve as prior constraints on "invisible features" and can be flexibly integrated into diverse reconstruction frameworks including PnP, unrolled networks, DIP, and diffusion models. Convergence acceleration within the PnP framework is established theoretically.
ObCLIP: Oblivious Cloud-Device Hybrid Image Generation with Privacy Preservation: ObCLIP is proposed as an oblivious cloud-device hybrid image generation scheme. It expands a user prompt into a set of candidate prompts that differ only in sensitive attributes (e.g., gender, race), performs early denoising steps on all candidates in the cloud without revealing the true prompt, and allows the client to select the correct intermediate latent and complete the remaining denoising locally. Temporal and batch redundancy acceleration techniques reduce the additional overhead to below 4.4–7.6×.
OmniCast: A Masked Latent Diffusion Model for Weather Forecasting Across Time Scales: OmniCast is proposed as a weather forecasting method that combines a masked generative framework with a latent diffusion model. By jointly generating future weather sequences rather than iterating autoregressively, it mitigates error accumulation, achieves state-of-the-art performance at the subseasonal-to-seasonal (S2S) scale, remains competitive for medium-range forecasting, and offers inference speeds 10–20× faster.
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers: OmniSync proposes a universal lip synchronization framework based on Diffusion Transformers, introducing three key innovations—a mask-free training paradigm, Flow Matching-based progressive noise initialization, and dynamic spatiotemporal CFG—to substantially outperform prior methods on both real and AI-generated videos, achieving an 87.78% success rate on stylized character lip sync (vs. 67.78% for the previous best).
OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions: OmniVCus proposes a feedforward DiT framework that achieves multi-subject, multimodal-controlled video customization through a data construction pipeline called VideoCus-Factory and two novel embedding mechanisms (Lottery Embedding and Temporally Aligned Embedding), significantly surpassing prior SOTA in identity preservation and controllability.
On Optimal Steering to Achieve Exact Fairness: This paper defines the concept of an ideal distribution—a data distribution under which the Bayes-optimal classifier for any cost-sensitive risk satisfies exact fairness—and proposes an optimization framework that identifies the nearest ideal distribution via KL divergence minimization, providing provable fairness guarantees for both fair preprocessing and LLM representation steering.
On the Emergence of Linear Analogies in Word Embeddings: A generative model based on binary semantic attributes is proposed to analytically prove the emergence mechanism of linear analogy structures in word embeddings (e.g., \(W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}\)), providing a unified explanation for four key empirical observations.
On the Relation between Rectified Flows and Optimal Transport: This paper presents a rigorous theoretical investigation of the relationship between rectified flows (flow matching) and optimal transport (OT). Through the construction of multiple counterexamples, it demonstrates that previously published claims asserting the asymptotic equivalence between gradient-constrained rectified flows and OT do not hold in general, and that stronger assumptions than those previously identified are required to guarantee such equivalence.
One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting: NTN-Diff is a frequency-aware diffusion model that decomposes the global semantic consistency problem into per-band consistency tasks over mid-frequency and low-frequency components. By adopting a "null-text–text–null-text" three-stage denoising strategy, the method simultaneously addresses two longstanding challenges in text-guided image inpainting: preserving unmasked regions and maintaining semantic consistency between masked and unmasked areas.
Orient Anything V2: Unifying Orientation and Rotation Understanding: Orient Anything V2 unifies 3D object orientation and rotation understanding via a scalable synthetic data engine, a symmetry-aware periodic distribution objective, and a multi-frame architecture, achieving zero-shot state-of-the-art performance across three tasks: orientation estimation, 6DoF pose estimation, and symmetry recognition.
OSMGen: Highly Controllable Satellite Image Synthesis using OpenStreetMap Data: OSMGen synthesizes high-fidelity satellite images directly from OSM JSON data (vector geometry, semantic tags, location, and temporal information), and generates temporally consistent before-after image pairs via DDIM inversion, enabling urban change simulation and data augmentation.
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models: This paper presents OVERT, the first large-scale benchmark for evaluating over-refusal in text-to-image (T2I) models, comprising 4,600 benign prompts and 1,785 harmful prompts across 9 safety categories. It systematically evaluates over-refusal behavior in 5 mainstream T2I models, revealing a strong correlated trade-off between safety and utility.
Pairwise Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model: This paper proposes A2A-FM, a method that simultaneously learns optimal transport mappings across all pairs of conditional distributions within the Flow Matching framework via a novel cost function. It is theoretically shown to converge to pairwise optimal transport in the infinite-sample limit, and is particularly suited for non-grouped data with continuous conditional variables.
Perturb a Model, Not an Image: Towards Robust Privacy Protection via Anti-Personalized Diffusion Models: This paper proposes the Anti-Personalized Diffusion Model (APDM), which for the first time shifts privacy protection from the data level (image perturbation) to the model level (parameter update). Through a Direct Protective Optimization (DPO) loss and a Learning to Protect (L2P) dual-path optimization strategy, APDM robustly prevents diffusion models from personalizing to specific subjects while preserving the model's generation and personalization capabilities for other subjects.
Physics-Constrained Flow Matching: Sampling Generative Models with Hard Constraints: This paper proposes Physics-Constrained Flow Matching (PCFM), a zero-shot inference framework that enforces arbitrary nonlinear equality constraints to machine precision during sampling from pretrained flow matching models. The framework alternates among forward shooting with projection, OT-interpolation backward updates, and relaxed penalty correction at each sub-step, achieving up to 99.5% improvement over baselines on PDE problems involving shocks and discontinuities.
Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection: A physics conservation law-based paradigm for AI-generated video detection is proposed. A normalized spatiotemporal gradient (NSG) statistic is defined to capture the ratio of spatial probability gradients to temporal density changes. Pre-trained diffusion models are used to estimate NSG, and detection is performed via MMD. The method surpasses the state of the art by 16% in Recall and 10.75% in F1.
PID-controlled Langevin Dynamics for Faster Sampling of Generative Models: This work introduces PID control theory into Langevin dynamics sampling, leveraging gradient history (integral term) to build momentum for traversing energy barriers and gradient trends (derivative term) to suppress oscillations, achieving fast and stable convergence. The approach requires no additional training and delivers over 10× sampling acceleration on both SGMs and EBMs.
PixPerfect: Seamless Latent Diffusion Local Editing with Discriminative Pixel-Space Refinement: This paper proposes PixPerfect, a general-purpose pixel-level refinement framework that eliminates color discrepancies, texture mismatches, and visible seams in local editing with latent diffusion models (LDMs) through a discriminative pixel-space loss and a comprehensive artifact simulation pipeline, achieving substantial improvements in visual fidelity across inpainting, object removal, and insertion tasks.
Pragmatic Heterogeneous Collaborative Perception via Generative Communication Mechanism: This paper proposes GenComm — a generative communication mechanism for heterogeneous multi-agent collaborative perception. By extracting spatial messages and employing a conditional diffusion model, the ego agent locally generates aligned collaborator features without modifying any existing network, enabling new heterogeneous agents to be onboarded at minimal cost.
Preconditioned Langevin Dynamics with Score-Based Generative Models for Infinite-Dimensional Linear Bayesian Inverse Problems: This paper rigorously analyzes score-based generative model (SGM)-driven Langevin posterior samplers in infinite-dimensional Hilbert spaces, derives for the first time convergence bounds that explicitly depend on score approximation errors, and identifies an optimal preconditioner that jointly depends on the forward operator and score errors, guaranteeing uniform convergence rates across all posterior modes.
Predictive Feature Caching for Training-free Acceleration of Molecular Geometry Generation: This paper transfers predictive feature caching strategies from the image generation domain to molecular geometry generation, exploiting the temporal smoothness of hidden states along sampling trajectories to achieve training-free 2–3× inference acceleration, with up to 7× speedup when combined with other optimization techniques.
Preventing Shortcuts in Adapter Training via Providing the Shortcuts: This paper proposes Shortcut-Rerouted Adapter Training, which actively provides dedicated pathways for confounding factors during adapter training (e.g., a LoRA absorbing distribution shifts, a ControlNet absorbing pose/expression), thereby constraining the adapter to learn only the target attribute (e.g., identity). The auxiliary modules are discarded at inference time, yielding a disentangled adapter.
Progressive Inference-Time Annealing of Diffusion Models for Sampling from Boltzmann Densities: This paper proposes PITA (Progressive Inference-Time Annealing), a framework that combines temperature annealing and diffusion smoothing as two complementary interpolation strategies. PITA trains an initial diffusion model at high temperature, then applies a novel Feynman-Kac PDE with SMC resampling to progressively anneal toward lower temperatures at inference time, training a sequence of diffusion models up to the target temperature. This approach achieves equilibrium sampling of alanine dipeptide and tripeptide in Cartesian coordinates for the first time.
Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models: This paper identifies that combining training-based concept unlearning with training-free safety guidance (negative prompt guidance) yields degraded performance, and proposes replacing explicit negative prompts with implicit concept embeddings obtained via Concept Inversion, effectively restoring the defensive capability of training-free methods on unlearned models.
Ψ-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models: This paper proposes the Ψ-Sampler framework, which introduces initial particle sampling based on the preconditioned Crank-Nicolson Langevin (pCNL) algorithm into SMC-based inference-time reward alignment. By initializing particles from a reward-aware posterior distribution, the framework substantially improves alignment performance on layout-guided generation, quantity-aware generation, and aesthetic preference generation.
Rare Text Semantics Were Always There in Your Diffusion Transformer: This paper discovers that scaling up the variance of text token embeddings before the joint attention blocks in MM-DiT enables diffusion models to render rare text semantics, without any additional training or external modules.
Real-Time Execution of Action Chunking Flow Policies: This paper proposes Real-Time Chunking (RTC), which frames asynchronous action chunk execution as an inpainting problem. By freezing already-executed actions and "inpainting" the remainder to be consistent with the prefix, RTC enables smooth real-time execution of diffusion/flow policies without any retraining.
Rectified-CFG++ for Flow Based Models: To address the off-manifold drift caused by standard CFG in Rectified Flow models, this paper proposes Rectified-CFG++—an adaptive predictor-corrector guidance strategy that replaces extrapolative guidance with conditional flow prediction combined with time-scheduled interpolative correction. The method comprehensively outperforms standard CFG on large-scale models including Flux, SD3, SD3.5, and Lumina.
Recurrent Memory for Online Interdomain Gaussian Processes: This paper proposes OHSVGP (Online HiPPO Sparse Variational Gaussian Process), which introduces the HiPPO (High-order Polynomial Projection Operator) framework from deep learning into sparse variational Gaussian processes as interdomain inducing variables. By leveraging time-varying orthogonal polynomial basis functions, the method achieves long-term memory retention in online learning, with kernel matrices updated efficiently via ODE recursion.
Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models: This paper proposes the Diffusion Chain of Lateral Thought (DCoLT), which treats each intermediate step in the reverse process of a diffusion language model as a latent "thinking" action and optimizes the entire reasoning trajectory via outcome-based reinforcement learning. DCoLT achieves state-of-the-art performance on mathematics and code generation benchmarks with both SEDD and LLaDA diffusion language models.
Remasking Discrete Diffusion Models with Inference-Time Scaling: This paper proposes the ReMDM sampler, which enables iterative error correction in discrete mask diffusion models by allowing already-decoded tokens to be remasked during generation. This mechanism supports inference-time compute scaling and yields substantial quality improvements on text, image, and molecular design tasks.
RepLDM: Reprogramming Pretrained Latent Diffusion Models for High-Quality, High-Efficiency, High-Resolution Image Generation: This paper proposes RepLDM, a reprogramming framework that enables pretrained latent diffusion models to generate high-quality, high-resolution images without retraining, via two stages: an attention guidance stage and a progressive upsampling stage, while substantially improving efficiency.
RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I Generation: This paper proposes RespoDiff, a framework that introduces two learnable transformation modules at the bottleneck layer of a diffusion model UNet — a Responsibility Alignment Module (RAM) and a Semantic Alignment Module (SAM) — trained via score matching objectives to achieve fair and safe text-to-image generation while preserving image quality and semantic fidelity.
Riemannian Consistency Model: This work is the first to extend Consistency Models (CM) to Riemannian manifolds. By leveraging exponential map parameterization and covariant derivatives, it derives both discrete- and continuous-time RCM objectives, enabling high-quality few-step generation on non-Euclidean geometries such as spheres, flat tori, and SO(3).
Riemannian Flow Matching for Brain Connectivity Matrices via Pullback Geometry: This paper proposes DiffeoCFM, which leverages pullback metrics induced by global diffeomorphisms to equivalently reformulate conditional flow matching on Riemannian manifolds as standard CFM in Euclidean space. The method enables efficient generation of brain connectivity matrices (SPD/correlation) while strictly preserving manifold constraints, achieving state-of-the-art performance on 3 fMRI and 2 EEG datasets.
RLVR-World: Training World Models with Reinforcement Learning: This paper proposes the RLVR-World framework, extending the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to world model training. By directly optimizing target metrics (e.g., prediction accuracy, perceptual quality) as verifiable rewards, the framework achieves significant improvements on both language and video world models.
RLZero: Direct Policy Inference from Language Without In-Domain Supervision: This paper proposes RLZero, a framework that converts natural language instructions into behavioral policies in target environments via an "Imagine → Project → Imitate" pipeline. A video generation model is used to "imagine" observation sequences from language; these are then projected into the target domain; finally, an unsupervised pretrained RL agent imitates the projected sequences via a closed-form solution — all without any in-domain supervision or annotated trajectories.
Robustness in Both Domains: CLIP Needs a Robust Text Encoder: This paper proposes LEAF (Levenshtein Efficient Adversarial Finetuning), the first adversarial fine-tuning method targeting the CLIP text encoder. LEAF substantially improves robustness under character-level text perturbations across zero-shot classification, text-image retrieval, and image generation, while preserving performance in the image domain.
Safe and Stable Control via Lyapunov-Guided Diffusion Models: This paper proposes S²Diff, a model-based diffusion planning framework that leverages Control Lyapunov Barrier Functions (CLBF) to guide diffusion sampling for generating trajectory-level control policies. Without requiring control-affine assumptions or quadratic programming, S²Diff simultaneously guarantees safety and stability on a variety of nonlinear dynamical systems, achieving an average safety rate of 98.75%.
SAO-Instruct: Free-form Audio Editing using Natural Language Instructions: This paper proposes SAO-Instruct, the first audio editing model supporting fully free-form natural language instructions. Training data consisting of editing triplets is constructed via three pipelines — Prompt-to-Prompt, DDPM inversion, and manual editing — and Stable Audio Open is fine-tuned to achieve context-preserving, targeted audio modification.
Scalable, Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching: This paper proposes TCCM (Time-Conditioned Contraction Matching), a flow matching-inspired semi-supervised anomaly detection method for tabular data. By learning a time-conditioned velocity field that contracts normal data toward the origin, TCCM computes anomaly scores in a single forward pass, achieving top AUROC and AUPRC rankings across 47 ADBench datasets while running 1573× faster than DTE.
ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion: ScaleDiff is a framework that eliminates redundant overlap computation in conventional patch-based methods via Neighborhood Patch Attention (NPA). Combined with Latent Frequency Mixing (LFM) and Structure Guidance (SG), it extends pretrained diffusion models to high resolutions (e.g., 4096²) without any additional training, achieving state-of-the-art quality among training-free methods and significant inference acceleration (8.9× faster than DemoFusion) on both U-Net and DiT architectures.
Scaling Can Lead to Compositional Generalization: Through theoretical proofs and large-scale experiments, this paper demonstrates that standard MLPs can achieve compositional generalization solely by scaling data and model size, without explicit modular architectural design. Moreover, when compositional generalization succeeds, task components can be linearly decoded from hidden-layer activations — a metric that correlates positively with compositional success rates in diffusion-based image generation.
Scaling Diffusion Transformers Efficiently via μP: This paper extends Maximal Update Parametrization (μP) from standard Transformers to diffusion Transformers (DiT, PixArt-α, MMDiT, etc.), demonstrating that optimal hyperparameters found on small proxy models transfer stably to large models, significantly reducing the hyperparameter tuning cost for large-scale diffusion models.
Scaling Offline RL via Efficient and Expressive Shortcut Models: This paper proposes SORL, which leverages the self-consistency property of shortcut models to enable efficient single-stage training with variable inference steps for policy optimization in offline RL, while supporting both sequential and parallel test-time scaling.
SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency: SceneDecorator presents a training-free framework that, for the first time, systematically addresses scene planning and scene consistency in story generation via VLM-guided global-to-local scene planning and a long-term scene-sharing attention mechanism, achieving significant improvements over existing methods on scene alignment and consistency metrics.
SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation: SceneDesigner introduces a CNOCS map representation combined with a two-stage reinforcement learning training strategy, achieving for the first time precise 9D pose control (position, size, and orientation) over multiple objects, significantly outperforming existing methods in both controllability and generation quality.
Schrödinger Bridge Matching for Tree-Structured Costs and Entropic Wasserstein Barycentres: This paper extends the Iterative Markovian Fitting (IMF) procedure to the tree-structured Schrödinger Bridge problem, proposing the TreeDSBM algorithm. For Wasserstein barycentre computation, it elegantly merges IMF iterations with fixed-point iterations, requiring only inexpensive bridge-matching steps for efficient solution.
Score-informed Neural Operator for Enhancing Ordering-based Causal Discovery: This paper proposes SciNO (Score-informed Neural Operator), a probabilistic generative model designed in a smooth function space that stably approximates the log-density Hessian diagonal to improve ordering-based causal discovery, achieving a 42.7% reduction in order divergence on synthetic graphs and 31.5% on real-world data.
Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models: This paper proposes Semantic Surgery, a training-free zero-shot inference-time concept erasure framework that calibrates text embeddings via vector subtraction prior to the diffusion process, incorporates Co-Occurrence Encoding for multi-concept erasure, and employs a visual feedback loop to address latent concept persistence (LCP). The method comprehensively outperforms state-of-the-art approaches across object, NSFW, style, and celebrity erasure tasks.
Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models: This paper proposes Shallow Diffuse, a method that exploits the local linearity and low-rank Jacobian of the posterior mean predictor (PMP) in diffusion models to embed watermarks at intermediate diffusion timesteps. This design decouples the watermark from the generation process, achieving, for the first time, both high fidelity and high robustness simultaneously under both server-side and user-side deployment scenarios.
Shortcutting Pre-trained Flow Matching Diffusion Models is Almost Free Lunch: This paper proposes SCFM (ShortCutting Flow Matching), an highly efficient post-training distillation method that compresses pre-trained flow matching models (e.g., Flux with 12B parameters) into 3-step samplers via velocity field self-distillation, requiring less than 1 A100-Day, without step-size embeddings or adversarial distillation.
Show-o2: Improved Native Unified Multimodal Models: This paper presents Show-o2, a natively unified multimodal model built upon autoregressive modeling and Flow Matching. By constructing unified visual representations in a 3D causal VAE space via dual-path spatial(-temporal) fusion, Show-o2 supports multimodal understanding and generation across text, images, and video, with a two-stage training strategy that effectively preserves language knowledge.
SparseDiT: Token Sparsification for Efficient Diffusion Transformer: This paper proposes SparseDiT, which achieves 55% FLOPs reduction and 175% inference throughput improvement on DiT-XL 512×512 with only 0.09 FID degradation. The method employs a three-stage spatial architecture (bottom Poolingformer + middle Sparse-Dense Token Module + top full-density processing) combined with a dynamic pruning-rate schedule along the temporal dimension, and successfully extends to video generation and text-to-image generation tasks.
Split Gibbs Discrete Diffusion Posterior Sampling: This paper proposes SGDD (Split Gibbs Discrete Diffusion), a plug-and-play posterior sampling algorithm for discrete diffusion models based on the split Gibbs sampling principle. By introducing auxiliary variables and a Hamming-distance-based regularization potential, SGDD decomposes posterior sampling into alternating likelihood and prior sampling steps, achieving substantial improvements over baselines on DNA sequence design, discrete image inverse problems, and music infilling tasks.
SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing: SplitFlow decomposes a target prompt into multiple sub-prompts, computes an independent editing flow for each, and combines them into a unified editing trajectory via latent trajectory projection and adaptive velocity field aggregation. This resolves gradient entanglement and achieves higher fidelity and editability in text-guided image editing without requiring inversion.
StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models: StableGuard embeds global binary watermarks into the LDM generation pipeline (via MPW-VAE) and leverages changes in watermark perturbation patterns for tamper localization (via MoE-GFN), achieving the first end-to-end unified framework for copyright protection and tamper detection.
State-Covering Trajectory Stitching for Diffusion Planners: This paper proposes SCoTS (State-Covering Trajectory Stitching), a reward-free trajectory augmentation framework that iteratively stitches short trajectory segments in a temporal distance-preserving latent space to systematically expand state-space coverage, significantly improving the generalization of diffusion planners on long-horizon and out-of-distribution tasks.
StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold: This paper proposes StelLA, which decomposes the LoRA adaptation matrix into a three-factor form \(USV^\top\) and constrains \(U\) and \(V\) to the Stiefel manifold for Riemannian optimization, enabling explicit subspace learning during training. StelLA consistently outperforms existing LoRA variants across multiple downstream tasks.
System-Embedded Diffusion Bridge Models: This paper proposes System-embedded Diffusion Bridge Models (SDB), which directly embed a known linear measurement system into the coefficients of a matrix-valued SDE, enabling decoupled control over denoising in the range space and information synthesis in the null space. SDB achieves consistent performance improvements across multiple inverse problems and demonstrates strong robustness to system mismatch.
T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models: This paper proposes T2SMark, a two-stage watermarking scheme for diffusion models based on Tail-Truncated Sampling (TTS). By embedding watermark bits in the tail regions of the Gaussian distribution while sampling randomly from the central region, T2SMark is the first method to achieve an optimal balance between watermark robustness and generation diversity.
Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security: This paper demonstrates that text-to-image (T2I) models leave identifiable "signatures" in their generated images due to differences in training data, architecture, and scale. Even without controlling the input prompt, an adversary can de-anonymize models on leaderboards via simple centroid classification in CLIP embedding space, achieving 87% Top-1 accuracy, thereby enabling ranking manipulation attacks.
Text to Sketch Generation with Multi-Styles: This paper proposes M3S (Multi-Style Sketch Synthesis), a training-free framework that achieves single- and multi-style sketch generation conditioned on text prompts and reference style sketches, via linearly smoothed K/V feature injection, joint AdaIN style tendency control, and style-content disentangled guidance.
ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation: This paper proposes ThermalGen, an adaptive flow-based generative model that achieves, for the first time, high-fidelity RGB-to-Thermal image translation across diverse viewpoints, sensors, and environmental conditions via an RGB-conditioned architecture and a style disentanglement mechanism. Three new large-scale satellite–aerial RGB-T paired datasets are also released.
TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising: This work introduces TIDMAD — the first ultra-long time series denoising benchmark dataset for dark matter searches — comprising training/validation/science data from the ABRACADABRA experiment, a denoising score metric, and a complete analysis pipeline, enabling AI algorithms to directly produce physics-community-standard dark matter search results.
Token Perturbation Guidance for Diffusion Models: This paper proposes Token Perturbation Guidance (TPG), which constructs a negative score signal by applying norm-preserving shuffling perturbations to intermediate token representations in diffusion models, enabling training-free, condition-agnostic guidance. TPG improves the FID of SDXL by nearly 2× in unconditional generation and approaches CFG-level performance in conditional generation.
Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration: This paper proposes Tortoise and Hare Guidance (THG), a training-free acceleration strategy for diffusion sampling that reformulates the classifier-free guidance (CFG) ODE as a multirate ODE system. The noise estimation term is integrated with fine-grained steps (tortoise equation), while the additional guidance term is integrated with coarse-grained steps (hare equation), reducing the number of function evaluations (NFE) by up to 30% with negligible degradation in generation quality.
Toward a Unified Geometry Understanding: Riemannian Diffusion Framework for Graph Generation and Prediction: This paper proposes GeoMancer, a framework that replaces numerically unstable exponential maps with a Riemannian GyroKernel autoencoder to disentangle multi-level graph features onto task-specific product manifolds. By further introducing manifold-constrained diffusion and a self-guidance generation strategy, GeoMancer achieves unified modeling and state-of-the-art performance across molecular generation, node classification, and graph regression tasks.
Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations: This paper unifies conditional guidance under a fixed-point iteration framework, showing that CFG and its variants are all special cases of single-step iterations over short intervals. It theoretically proves their suboptimality and proposes Foresight Guidance (FSG)—performing multi-step iterations over longer intervals in early diffusion stages to achieve better alignment quality with less computation.
Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge: This paper proposes LDDBM (Latent Denoising Diffusion Bridge Model), which extends denoising diffusion bridge models into a shared latent space and incorporates contrastive alignment loss and predictive loss to achieve a general-purpose framework for arbitrary modality translation.
Towards Resilient Safety-Driven Unlearning for Diffusion Models Against Downstream Fine-tuning: This paper proposes ResAlign, a framework that leverages Moreau envelope approximation and meta-learning to make safety-driven unlearning in diffusion models resilient against harmful capability recovery induced by downstream fine-tuning, even when fine-tuning is performed exclusively on benign data.
Towards Robust Zero-Shot Reinforcement Learning: This paper proposes BREEZE, a framework that systematically addresses out-of-distribution (OOD) extrapolation errors and insufficient expressivity in FB-based zero-shot RL through behavior-regularized representation guidance, task-conditioned diffusion policy extraction, and attention-enhanced representation modeling. BREEZE achieves state-of-the-art or near-state-of-the-art robust zero-shot generalization on ExORL and D4RL Kitchen benchmarks.
Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling: This paper proposes TIRE (Track, Inpaint, REsplat), a three-stage pipeline that locates unobserved regions via video tracking, progressively infills textures using a subject-driven inpainting model, and back-projects multi-view consistent results into 3D, enabling identity-preserving 3D/4D generation.
Training-Free Constrained Generation with Stable Diffusion Models: This paper proposes a training-free constrained generation method that embeds Proximal Langevin Dynamics into the reverse denoising process of Stable Diffusion. Image-space constraints are backpropagated to the latent space via the decoder, enabling strict constraint satisfaction on generated outputs without retraining.
Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models: This paper proposes Safe Text embedding Guidance (STG), a training-free approach for safe text-to-image generation that dynamically adjusts text embeddings during diffusion sampling based on a safety function evaluated on the expected denoised image. STG effectively removes unsafe content while maximally preserving the original semantic intent.
Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models: This paper proposes a black-box watermark forging method based on image preference models. Given only a single watermarked image, the method extracts the watermark via backpropagation and transfers it to arbitrary new images, effectively forging multiple post-hoc watermarking schemes without access to the underlying watermarking algorithm.
Tree-Guided Diffusion Planner: This paper proposes the Tree-guided Diffusion Planner (TDP), which formalizes test-time diffusion planning as a tree search problem. Through bi-level sampling—particle-guided generation of diverse parent trajectories for exploration, combined with fast conditional denoising to generate child trajectories for exploitation—TDP achieves a strong exploration–exploitation balance and substantially outperforms existing methods under non-convex objectives and non-differentiable constraints.
Two-Steps Diffusion Policy for Robotic Manipulation via Genetic Denoising: By revealing the distributional mismatch caused by clipping operations in diffusion policies, this paper proposes GDP—a method combining denoising schedule optimization and genetic algorithm-based population selection—that enables off-the-shelf DDPM diffusion policies to match or surpass 100-step baselines with only 2-step inference, without any retraining.
UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset: This work constructs UltraHR-100K, a large-scale dataset comprising 100K ultra-high-resolution images with rich annotations, and proposes a Frequency-Aware Post-Training (FAPT) method combining Detail-Oriented Timestep Sampling (DOTS) and Soft-Weighted Frequency Regularization (SWFR) based on DFT, enabling pretrained T2I models to generate fine-grained details at ultra-high resolutions.
Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation: By systematically analyzing three key properties that hinder visual semantic learning in autoregressive image generation — local conditional dependency, inter-step semantic inconsistency, and the absence of spatial invariance — this paper proposes ST-AR, a training framework that incorporates masked image modeling and contrastive learning into the next-token prediction objective. Without relying on any pretrained representation model, ST-AR improves the FID of LlamaGen-XL by approximately 49% (from 19.42 to 9.81), achieving performance comparable to a 3B-parameter model trained for 300 epochs within only 50 epochs.
Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Models: Under a Mixture of Low-Rank Gaussians (MoLRG) data model, this paper theoretically proves that the unimodal dynamics of representation quality across noise levels arise from a trade-off between denoising strength and class discriminability, and empirically demonstrates that the emergence of unimodal dynamics serves as a reliable indicator of model generalization.
UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback: This paper proposes UniLumos, a unified image and video relighting framework that enhances physical plausibility by incorporating RGB-space depth and normal geometry feedback into a flow matching backbone, while achieving 20× inference speedup through path consistency learning.
Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations: This paper identifies the massive activations phenomenon in Diffusion Transformers (DiTs) that renders features indiscriminable, reveals its intrinsic connection to AdaLN, and proposes a training-free framework DiTF for extracting semantically discriminative features, surpassing DINO and SD models on visual correspondence tasks.
UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation: This paper proposes UtilGen, a utility-centric generative data augmentation framework that evaluates the downstream task utility of synthetic data via a meta-learning weight network, and employs a dual-level optimization strategy—model-level DPO and instance-level (prompt + noise) optimization—to adaptively generate high-utility synthetic training data, achieving an average improvement of 3.87% across 8 benchmarks.
V-CECE: Visual Counterfactual Explanations via Conceptual Edits: V-CECE proposes the first black-box visual counterfactual explanation framework that systematically reveals the explanatory gap between human and neural network semantic understanding. It guarantees edit-set optimality via WordNet knowledge graphs and the Hungarian algorithm, and executes concept-level edits using Stable Diffusion. The key finding is that CNN classifiers are severely misaligned with human semantic reasoning (requiring 5+ edit steps), whereas LVLMs (Claude 3.5 Sonnet) are highly aligned with humans (requiring only 2–3 steps).
Value Gradient Guidance for Flow Matching Alignment: This paper proposes VGG-Flow, which leverages the Hamilton-Jacobi-Bellman (HJB) equation from optimal control theory to reformulate flow matching alignment as a gradient matching task—matching the residual velocity field to the gradient of the value function—enabling efficient reward alignment while preserving the prior distribution.
Vicinity-Guided Discriminative Latent Diffusion for Privacy-Preserving Domain Adaptation: This paper proposes Discriminative Vicinity Diffusion (DVD), which for the first time employs latent diffusion models for discriminative knowledge transfer. By training a diffusion model within the vicinity latent space of source-domain features to generate source-style cues, DVD enables domain adaptation without access to source data, surpassing state-of-the-art methods on standard SFDA benchmarks.
Watermarking Autoregressive Image Generation: This paper is the first to adapt LLM watermarking (KGW green/red scheme) to the token level of autoregressive image generation models. It identifies and addresses the key challenge of insufficient Reverse Cycle Consistency (RCC) through tokenizer–detokenizer fine-tuning and a watermark synchronization layer, achieving robust image watermark detection with theoretical guarantees.
What We Don't C: Manifold Disentanglement for Structured Discovery: This paper proposes WWDC (What We Don't C), a method that employs conditionally guided latent flow matching to remove known information from existing VAE representations, enabling unknown features to be more readily discovered and accessed in the residual manifold, thus facilitating iterative scientific discovery.
When Are Concepts Erased From Diffusion Models?: This paper proposes two mechanistic models of concept erasure (guidance-based avoidance vs. destruction-based removal) and designs a suite of five independent probing methods—spanning optimization search, in-context probing, noise trajectory probing, classifier-guided probing, and dynamic concept tracing—to systematically demonstrate that most existing erasure methods merely "circumvent" concepts rather than genuinely eliminating the underlying knowledge.
Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models: This paper proposes the HeadHunter framework and SoftPAG method, refining the granularity of attention perturbation in diffusion models from the layer level down to individual attention heads. It is the first work to reveal that different attention heads govern distinct visual concepts (structure, style, texture, etc.), enabling more precise and composable generation guidance.
Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training: Through numerical experiments and theoretical analysis, this paper identifies two critical timescales in diffusion model training — a generalization time \(\tau_{\text{gen}}\) and a memorization time \(\tau_{\text{mem}}\) — where the latter scales linearly with training set size \(n\) while the former remains constant. The resulting implicit dynamical regularization enables early stopping to prevent memorization even in heavily overparameterized regimes.
Why Diffusion Models Don't Memorize: The Role of Implicit Regularization: This paper reveals, through both numerical experiments and theoretical analysis, an implicit dynamical regularization mechanism in diffusion model training: the gap between the timescale for generating high-quality samples \(\tau_\text{gen}\) and the timescale for memorization \(\tau_\text{mem}\) grows linearly with training set size \(n\), providing theoretical justification for early stopping.
Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation: Through theoretical analysis on Gaussian mixture models and large-scale experiments on the SmolLM2 family via multi-level distillation, this paper reveals the core mechanism of knowledge distillation in generative models: distillation induces a tradeoff in the student model between precision (generation quality) and recall (distribution coverage), governed by the entropy of the teacher distribution.
WMCopier: Forging Invisible Image Watermarks on Arbitrary Images: This paper proposes WMCopier, the first diffusion-model-based no-box watermark forging attack that requires no prior knowledge of the target watermarking algorithm. By training an unconditional diffusion model to learn the watermark distribution, injecting watermark signals via shallow DDIM inversion, and refining results through iterative optimization, WMCopier achieves high forging success rates against both open-source and commercial watermarking systems, including Amazon.