Skip to content

🎨 Image Generation

🔬 ICLR2026 · 357 paper notes

📌 Same area in other venues: 📷 CVPR2026 (434) · 💬 ACL2026 (5) · 🧪 ICML2026 (141) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (218) · 📹 ICCV2025 (213)

🔥 Top topics: Diffusion Models ×135 · Text-to-Image ×26 · Alignment/RLHF ×17 · Image Editing ×15 · Layout & Composition ×13

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

The first systematic analysis of conditional embeddings in Diffusion Transformers reveals extreme angular similarity (inter-class cosine similarity >99%) and dimensional sparsity (only 1-2% of dimensions carry semantic information). Generation quality remains largely unchanged after pruning 2/3 of low-magnitude dimensions, uncovering a hidden semantic bottleneck in conditional embeddings.

A Noise is Worth Diffusion Guidance

This paper proposes NoiseRefine: instead of modifying the diffusion model itself, it trains a lightweight network to "refine" random Gaussian noise into a structured noise. This enables generating images with quality close to CFG guidance using only a single forward pass without any sampling guidance, thereby eliminating the overhead of dual forward passes per step.

A Physics-Inspired Optimizer: Velocity Regularized Adam

This paper proposes VRAdam (Velocity-Regularized Adam), which translates a physical stability mechanism—the "quartic kinetic energy term"—into a global dynamic learning rate that automatically contracts with velocity \(\eta_t=\alpha_0/(1+\min(\beta_3\|v_t\|^2,\alpha_1))\). Embedded into AdamW, it automatically decelerates when weight updates are too large, suppressing oscillations near the Edge of Stability. Complemented by rigorous Lyapunov stability and \(O(\ln N/\sqrt N)\) convergence proofs, it consistently outperforms AdamW across image classification, language modeling, GFlowNets, GPT-2 pre-training, and LLM fine-tuning.

A Probabilistic Hard Concept Bottleneck for Steerable Generative Models

This paper reformulates the concept bottleneck in generative models into a probabilistic hard binary concept layer, VHCB. This allows users to directly sample images from specified concepts or perform interventions on existing generations. Systematic validation on StyleGAN2 and DDPM demonstrates superior steerability and reduced concept leakage compared to soft concept bottlenecks.

AC-Sampler: Accelerate and Correct Diffusion Sampling with Metropolis-Hastings Algorithm

AC-Sampler truncates the diffusion generation process at an intermediate timestep, generates candidates using a score-based Langevin proposal, and applies Metropolis-Hastings (MH) acceptance rates to correct them toward the true marginal distribution. This simultaneously reduces NFE and improves FID without fine-tuning the base model.

ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization

ACCORD formalizes "concept coupling" (entanglement between subjects and contexts) in text-to-image personalization as a statistical dependence problem for the first time. It decomposes the total dependence discrepancy into two computable sources: "denoising dependence discrepancy" and "prior dependence discrepancy," eliminating them via two plug-and-play regularization losses (DDLoss + PDLoss). This improves both text controllability and fidelity across subject, style, and face personalization.

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

RepTok fine-tunes the [cls] token of a pre-trained self-supervised ViT into a "single continuous token" latent space. Combined with a flow matching decoder for high-fidelity reconstruction and a non-attention MLP-Mixer for generation in this 1D space, it achieves competitive FIDs on ImageNet/MS-COCO using less than 10% of the training compute compared to competitors.

AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

AEGIS transforms the "erasure target" from manually selected fixed safety words to an Adversarial Erasure Target (AET) that iteratively approaches the semantic center of the concept. It further employs Gradient Rectification with Projection (GRP) to maintain generation quality without requiring retention data, effectively minimizing adversarial prompt attack success rates with negligible performance loss.

AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport

AlignFlow utilizes Semi-Discrete Optimal Transport (SDOT) to pre-calculate a deterministic "noise distribution \(\rightarrow\) full dataset" alignment mapping before training. This serves as a plug-and-play coupling for various flow generative models, achieving straighter trajectories, faster convergence, and comprehensive FID reductions with less than 1% additional overhead.

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Ours proposes AlignTok—instead of training a VAE from scratch or forcing it to learn semantics via "semantic regularization," it transforms a semantically-rich pre-trained visual foundation encoder (DINOv2) into a continuous tokenizer through a three-stage progressive alignment. This yields a latent space that is both semantically well-structured and capable of precise reconstruction; on ImageNet 256×256, it allows the diffusion model to reach a gFID of 1.90 in just 64 epochs, achieving a convergence speed approximately 5× faster than VA-VAE.

\(\alpha\)-DPO: Robust Preference Alignment for Diffusion Models via \(\alpha\) Divergence

This paper demonstrates from a distribution matching perspective that Diffusion-DPO is equivalent to minimizing the forward KL divergence and is therefore naturally sensitive to noisy preference pairs. It proposes replacing FKL with \(\alpha\)-divergence combined with a dynamic \(\alpha\) schedule, making diffusion model preference alignment significantly more robust under label-flipping noise.

AlphaFlow: Understanding and Improving MeanFlow Models

This paper decomposes the training objective of MeanFlow into two terms: "Trajectory Flow Matching + Trajectory Consistency." It identifies that the strong negative correlation between their gradients leads to optimization conflicts. Consequently, the authors propose the \(\alpha\)-Flow objective family, which unifies Flow Matching, Shortcut, and MeanFlow. Using a curriculum strategy that anneals \(\alpha\) from 1 to 0, they achieve a 1-NFE FID of 2.58 and a 2-NFE FID of 2.15 on ImageNet-256 using pure DiT trained from scratch.

Amortising Inference and Meta-Learning Priors in Neural Networks (BNNP)

Proposes BNNP (Bayesian Neural Network Process), a neural process that treats BNN weights as latent variables and the BNN itself as the decoder. By performing layer-wise amortised variational inference to jointly learn BNN priors and inference networks across multiple datasets, it addresses for the first time whether approximate inference still matters under a good prior—the answer is "yes," there is no free lunch.

Antithetic Noise in Diffusion Models

To be supplemented after in-depth reading.

Any-Order Flexible Length Masked Diffusion

This paper proposes FlexMDM, a masked diffusion model capable of inserting new tokens to model variable-length sequences during generation. It theoretically preserves the "any-order parallel decoding" capability of masked diffusion. While maintaining perplexity parity with fixed-length masked diffusion, FlexMDM achieves significantly better length distribution fitting. Furthermore, it requires only 16 H100 GPUs for three days to transform a pre-trained LLaDA-8B into a variable-length model, showing clear improvements on GSM8K (58%→67%) and code completion (52%→65%).

Any-step Generation via N-th Order Recursive Consistent Velocity Field Estimation

This paper proposes RCGM, which unifies few-step generation methods such as consistency models, MeanFlow, and shortcuts as 1st-order special cases of "N-th Order Recursive Velocity Field Estimation." By extending to 2nd-order and higher, the high-order targets avoid expensive JVPs and remain compatible with aggressive EMA smoothing. This enables the stable expansion of few-step generation training to 20B large models, achieving 1.48 FID in 2 steps on ImageNet 256×256.

Arbitrary-Shaped Image Generation via Spherical Neural Field Diffusion

ASIG generates an entire scene at once on a subdivided icosahedral sphere using "Mesh-based Spherical Latent Diffusion," and then employs a "Spherical Neural Field" to perform arbitrary sampling from this sphere based on coordinate conditions. This achieves explicit control over view, FOV, and resolution within a unified framework for the first time, outputting distortion-free images in perspective, panoramic, fisheye, or irregular shapes, with quality significantly exceeding various specialized methods.

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

AsynDM significantly improves semantic alignment in text-to-image generation without fine-tuning by assigning different timestep schedules to different pixels (denoising prompt-related regions more slowly), allowing them to utilize clearer contextual references.

AttriCtrl: A Generalizable Framework for Controlling Semantic Attribute Intensity in Diffusion Models

AttriCtrl quantifies aesthetic attributes such as "brightness/detail/realism/safety" into a unified scalar range of \([0,1]\). By leveraging a lightweight "Value Encoder" to translate these numerical values into token sequences for injection into diffusion models, users can perform continuous, decoupled, and plug-and-play intensity control over single or multiple semantic attributes like adjusting a knob.

Autoregressive-based Progressive Coding for Ultra-Low Bitrate Image Compression

ARPC utilizes the "next-scale prediction" of Visual Autoregressive (VAR) models for ultra-low bitrate image compression. The encoder uses a multi-scale residual quantizer to decompose images into \(K\) sets of coarse-to-fine discrete tokens. By transmitting only the first \(k\) sets and letting VAR autoregressively generate the remaining \(K-k\) sets, a single model achieves continuous bitrate adjustment. Furthermore, VAR is reused as a probability estimator for lossless arithmetic coding, and a grouped mask quantizer is employed to further minimize bits. At bitrates \(<0.05\) bpp, ARPC outperforms 13 diffusion and token-based baselines in perceptual quality while being \(2 \sim 6 \times\) faster during decoding.

Autoregressive Image Generation with Randomized Parallel Decoding

This paper proposes ARPG, a visual autoregressive model based on the "guided decoding" framework. By decoupling position guidance (query) from content representation (key-value), it achieves fully randomized order training and generation while supporting efficient parallel decoding. On ImageNet-1K 256×256, it achieves a 1.94 FID in 64 steps, with over 20× throughput improvement and over 75% reduction in memory consumption.

Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models

This paper discovers that per-sample gradients of diffusion models at low SNR timesteps are approximately collinear, causing the empirical Fisher Information Matrix (FIM) to be essentially rank-1. Consequently, a rank-1 EWC penalty is proposed—which is as computationally cheap as diagonal approximation yet captures the principal curvature direction—combined with generative distillation to nearly eliminate forgetting in class-incremental image generation.

BAR: Refactor the Basis of Autoregressive Visual Generation

BAR abstracts the concept of "token sequences" in autoregressive image generation as "projections of image vectors onto a set of basis vectors." By utilizing an end-to-end learnable linear transformation matrix \(A\), it unifies various manually designed prediction units and orders (such as VAR/xAR/RAR/PAR/FAR). The model learns the optimal basis automatically, achieving an FID of 1.15 on ImageNet-256.

Beyond Text-to-Image: Liberating Generation with a Unified Discrete Diffusion Model

Muddit integrates text and images into a single absorbing-state (masked) discrete diffusion framework. Utilizing an MM-DiT shared generator initialized from the text-to-image model Meissonic, it performs text-to-image generation, image-to-text generation, and VQA by only switching the condition signal \(c\). With 1B parameters, it matches or exceeds the quality and efficiency of significantly larger autoregressive unified models.

BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

When text prompts and structural conditions (depth/edges, etc.) conflict, existing controllable generation models often satisfy only one. This paper proposes BideDPO, a bidirectional decoupled DPO framework that splits "text alignment" and "condition alignment" into two independent preference pairs. It utilizes adaptive loss balancing for dynamic weighting and includes a pipeline to automatically construct "conflict-aware preference data" through an iterative self-enhancement loop. On the self-built DualAlign benchmark, it improves text alignment success rates by up to 35%+ while simultaneously enhancing condition fidelity.

Branched Schrödinger Bridge Matching

The authors propose the BranchSBM framework, which extends Schrödinger Bridge Matching to branching scenarios by parameterizing multiple time-dependent velocity fields and growth processes. This approach models bifurcating dynamic trajectories from a single initial distribution to multiple target distributions, significantly outperforming single-branch methods in tasks such as LiDAR surface navigation and single-cell perturbation modeling.

Bridging Degradation Discrimination and Generation for Universal Image Restoration

BDG achieves fine-grained degradation discrimination through Multi-Angle Multi-Scale Gray Level Co-occurrence Matrix (MAS-GLCM) and designs a three-stage diffusion training (generation → bridging → restoration) to seamlessly fuse degradation discrimination capabilities with generative priors, yielding significant fidelity gains in all-in-one restoration and real-world super-resolution tasks.

Bridging Generalization Gap of Heterogeneous Federated Clients Using Generative Models

FedVTC proposes that in model-heterogeneous federated learning, each client uses a Variational Transposed Convolutional (VTC) network to generate synthetic data from aggregated feature distribution statistics to fine-tune local models. This significantly improves generalization without requiring public datasets while reducing communication and memory overhead.

Bridging the Distribution Gap to Harness Pretrained Diffusion Priors for Super-Resolution

DM-SR maintains the pretrained diffusion model entirely intact, training only an image encoder to "translate" the low-resolution (LR) image directly into the "noisy image" distribution familiar to the diffusion model. By using a fixed denoiser for single-step generation, it achieves current SOTA perceptual quality in one-step diffusion.

Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models

This paper provides the first systematic decomposition of the training variance in Masked Diffusion Models (MDM) into three terms: "mask pattern noise + mask rate noise + data noise." Based on this, six variance reduction methods are designed, centered on P-POTS (Pareto-Optimal \(t\)-Sampler) and MIRROR (Complementary Mask Inverse Sampling). These methods improve MDM accuracy on complex reasoning by 7–8% and reduce run-to-run fluctuations to levels comparable to Autoregressive Models (ARM).

BézierFlow: Learning Bézier Stochastic Interpolant Schedulers for Few-Step Generation

BézierFlow shifts the focus of "what to optimize for few-step generation" from discrete ODE timesteps to continuous Stochastic Interpolant (SI) schedulers. By parameterizing the scheduler with Bézier control points, it achieves a 2–3× FID improvement for pre-trained diffusion/flow models in \(\le 10\) sampling steps with only 15 minutes of lightweight training.

Carré du champ Flow Matching: Improving the Quality-Generalisation Trade-off in Generative Models via Geometry-Aware Noise

This paper proposes CDC-FM (Carré du champ Flow Matching), which replaces the isotropic homogeneous Gaussian noise in standard Flow Matching with anisotropic, spatially varying noise determined by the local geometry of the data manifold. This significantly suppresses memorisation and enhances generalisation without sacrificing sample quality, making it particularly suitable for scientific scenarios with sparse data or strong geometric structures.

CASteer: Cross-Attention Steering for Controllable Concept Erasure

CASteer is a training-free framework for concept erasure in diffusion models. It pre-calculates "steering vectors" in the cross-attention layers for each concept using paired positive/negative prompts. During inference, it dynamically subtracts this direction based on the projection size of the current activation onto the vector. This allows for precise erasure (of nudity, violence, specific characters/styles) only on patches where the concept truly appears, while leaving other content nearly untouched, outperforming all training-based SOTA methods on multiple benchmarks.

Charts Are Not Images: On the Challenges of Scientific Chart Editing

This paper argues that "charts are not images"—charts are structured data renderings constrained by graphical grammars, and editing them is essentially a structural transformation rather than a pixel manipulation. Accordingly, it proposes FigEdit, a benchmark with over 30K samples covering 10 chart types and five progressive tasks, revealing that mainstream image editing models exhibit inflated pixel-based scores while frequently failing at actual semantic editing.

ChronoEdit: Towards Temporal Reasoning for In-Context Image Editing and World Simulation

This work reformulates image editing as a "two-frame video generation" problem, leveraging the temporal priors of pre-trained large video models to ensure the physical consistency of edits. By inserting discardable "temporal reasoning tokens" during inference to imagine a plausible editing trajectory, the proposed method achieves SOTA performance on world simulation editing tasks.

CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration

CIAR introduces speculative decoding for autoregressive image generation into an edge-cloud collaborative framework. It employs an on-device "Inter-Head" to output continuous probability intervals for each visual token to quantify uncertainty. This allows low-uncertainty regions to be generated locally on the device, while only high-uncertainty boundary detail tokens and their interval features are uploaded to the cloud for verification. Combined with Inter-DRO alignment training, it achieves a 2.18× speedup and reduces cloud request volume by 70% with negligible impact on image quality.

CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

CineTrans observes that the attention maps of video diffusion models naturally exhibit a block-diagonal structure, characterized by "strong intra-shot and weak inter-shot correlation." By manipulating attention with a block-diagonal mask constructed directly from shot timestamps and fine-tuning on the self-constructed Cine250K multi-shot dataset, the model can generate cinematic multi-shot transitions at any specified position. This mechanism is also effective in a training-free manner by simply applying the mask.

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

Consistency Mid-Training (CMT) is proposed, which inserts a lightweight intermediate training stage between pre-trained diffusion models and flow map post-training. By having the model learn to map arbitrary points on ODE trajectories back to clean samples, it achieves a trajectory-aligned initialization. This significantly reduces training costs (by up to 98%) while reaching SOTA two-step generation quality.

Co-occurring Associated REtained concepts in Diffusion Unlearning

When diffusion models erase harmful concepts (e.g., nudity), they often inadvertently suppress benign concepts that co-occur with them (e.g., "person"). This paper defines such concepts as CARE (Co-occurring Associated REtained concepts) and proposes the CARE score for quantification. The proposed ReCARE framework automatically constructs a benign co-occurring vocabulary (CARE-set) from target images to simultaneously guide retention and erasure, achieving the best overall performance in robustness, utility, and CARE retention across nudity, Van Gogh style, and Tench tasks.

CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer

CoCoDiff is a training-free style transfer framework that directly extracts pixel-level semantic correspondence between content and style images from the intermediate features of pre-trained Stable Diffusion. It then utilizes a cyclic-consistent attention injection mechanism to "paste" styles onto structurally aligned regions, outperforming methods requiring additional training or annotations across FID, LPIPS, ArtFID, and CFSD metrics.

CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation

To be supplemented after further reading.

CoDi: Subject-Consistent and Pose-Diverse Text-to-Image Generation

To be supplemented after further reading.

CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation

CoEmoGen transforms abstract emotions into sentence-level, contextually coherent visual semantic descriptions. By utilizing hierarchical LoRA within Stable Diffusion, it simultaneously models polarity-shared low-level visual styles and specific emotion-unique high-level semantics, leading to images that align better with target emotions and exhibit more natural semantics and scalability than methods like EmoGen.

Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition

This paper proposes General Policy Composition (GPC), which performs a convex combination of distribution scores from multiple pre-trained diffusion or flow policies at test time. This approach yields a stronger policy surpassing any single parent without extra training. It theoretically proves that convex combinations can improve single-step score errors, which propagate to the entire trajectory via Grönwall bounds.

Composition of Pretrained Diffusion Models: A Logic-Based Calculus

This paper elevates the empirical PoE/MoE composition of pretrained diffusion models to a fuzzy logic-based Dombi score calculus. It demonstrates more stable mode coverage and sampling correction in multi-prompt Stable Diffusion, complex SAT-style compositions, and multi-objective molecular generation.

Compositional amortized inference for large-scale hierarchical Bayesian models

Extends Compositional Score Matching (CSM) to hierarchical Bayesian models by introducing a new error-damping estimator and mini-batch strategy to solve numerical instability under massive data groups. This work achieves the first amortized inference for large-scale hierarchical models exceeding 750,000 parameters (250,000+ data groups) and validates its effectiveness in real-world fluorescence lifetime imaging (FLIM).

Compositional Visual Planning via Inference-Time Diffusion Scaling

The authors freeze a pre-trained short-horizon video diffusion model and, at inference time, reformulate long-horizon planning as a chain-like factor graph of overlapping video segments. By performing synchronous and asynchronous message passing on Tweedie clean estimates (rather than noisy intermediate states) to enforce boundary consistency, they stitch short segments into globally coherent robotic manipulation plans without additional training, generalizing to start-goal combinations never seen during training.

Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss

The paper theoretically analyzes the advantages of autoregressive diffusion loss models over conditional diffusion models in condition error correction (exponential decay of gradient norm). It proposes a condition refinement method based on Optimal Transport (Wasserstein Gradient Flow) to solve the "condition inconsistency" problem in the autoregressive process, achieving an FID of 1.31 on ImageNet (based on MAR).

Condition Matters in Full-head 3D GANs

This work identifies that view conditioning in full-head 3D GANs causes severe directional bias (where generation quality at conditioned views far exceeds others). It proposes replacing camera views with view-invariant semantic features (frontal CLIP features) as conditions, paired with the BalanceHead360 dataset containing 11.2 million 360° balanced images synthesized via Flux.1 Kontext. This achieves high-fidelity, diverse full-head generation with consistency across all views for the first time.

Conditionally Whitened Generative Models for Probabilistic Time Series Forecasting

The authors propose CW-Gen (Conditionally Whitened Generative Models), which replaces the standard Gaussian terminal distribution in diffusion models/flow matching by jointly estimating the conditional mean and sliding window covariance matrix. They theoretically prove that sampling quality inevitably improves when the estimator satisfies sufficient conditions, consistently enhancing multivariate time series probabilistic forecasting performance across 5 datasets and 6 generative models.

Consis-GCPO: Consistency-Preserving Group Causal Preference Optimization for Vision Customization

Consis-GCPO reformulates GRPO reinforcement learning in subject-driven generation (reference-to-image/video) as a "discrete-time causal optimization" problem. By performing counterfactual interventions—specifically "masking text" and "masking reference images"—at each denoising step, it quantifies the instantaneous causal contribution of linguistic and visual conditions. These are converted into timestep-weighted advantages for targeted optimization, achieving higher subject consistency and stronger text following in complex multi-subject scenarios.

Consistent Text-to-Image Generation via Scene De-Contextualization

Reveals that the root cause of ID shift in T2I models is "scene contextualization" (scene tokens injecting contextual information into ID tokens) and proposes a training-free Scene De-Contextualization (SDeC) method. By analyzing the directional stability of SVD eigenvalues, SDeC identifies and suppresses potential scene-ID associations in prompt embeddings to achieve per-scene identity-consistent generation.

Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

This paper proposes MaskGRPO, the first GRPO reinforcement learning framework capable of stably scaling to multimodal Discrete Diffusion Models (DDM). By providing a computable importance estimation and KL approximation for the intractable likelihood of DDM, and customizing re-masking and sampling strategies for "language" and "vision" modalities—fading-out AR re-masking for text and high-truncation random re-masking with the emerge sampler for images—it nearly doubles RL gains in mathematical reasoning, code, and text-to-image generation while accelerating training by up to 30%.

Constantly Improving Image Models Need Constantly Improving Benchmarks

This paper proposes the ECHO framework, which automatically distills real user discussions from social media (creative prompts + qualitative feedback) into structured benchmarks. By extracting over 31,000 in-the-wild prompts regarding GPT-4o Image Gen, it uncovers new tasks not covered by existing benchmarks, increases the performance gap between SOTA and other models by 3.2x, and converts community complaints into quantifiable fine-grained metrics.

ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation

This paper formalises the task of "automatically generating fluent, natural inputs that precisely trigger specific internal features or behaviours of a model" as context modification. It proposes ContextBench, a benchmark containing 715 tasks across three categories (SAE activation, story inpainting, and backdoor trigger recovery). Based on the white-box method EPO, it introduces two improvements—LLM-assisted mutation and LLaDA diffusion inpainting—achieving Pareto improvements across the conflicting objectives of "activation strength" and "linguistic fluency."

ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

ContextGen builds upon the FLUX.1-Kontext Diffusion Transformer by "inserting composite layout maps and reference images together into a single context token sequence." Combined with hierarchical attention masking (Contextual Layout Anchoring (CLA) in the initial/final layers for global structure and Identity Consistency Attention (ICA) in middle layers for instance-level injection) and non-overlapping position indices, it achieves SOTA performance in both layout accuracy and identity fidelity for multi-subject controllable generation, even surpassing GPT-4o in identity preservation.

Continual Unlearning for Text-to-Image Diffusion Models: A Regularization Perspective

This paper presents the first systematic study of continual unlearning in T2I diffusion models. It identifies that existing unlearning methods suffer from "utility collapse" due to cumulative parameter drift under sequential requests. To mitigate this, the authors propose a suite of additional regularization strategies (L1/L2 norms, selective fine-tuning, model merging) and a semantic-aware gradient projection method.

Continuously Augmented Discrete Diffusion model for Categorical Generative Modeling

CADD assigns an additional "continuous latent variable" track to each [MASK] position in discrete masked diffusion—masked tokens no longer collapse into an uninformative absorbing state but instead carry a continuous vector that is gradually noisy yet retains semantic information. During denoising, this vector acts as a "soft prompt" to guide discrete prediction, achieving consistent improvements over pure masked diffusion baselines across text, image, and code generation tasks.

Contrastive Diffusion Guidance for Spatial Inverse Problems

Addressing "spatial inverse problems" where the forward operator is non-differentiable, non-smooth, and only partially known (a typical scenario being the reconstruction of floor plans from human walking trajectories), CoGuide shifts likelihood-based diffusion guidance from the original pixel space to a smooth embedding space trained via contrastive learning. By using the inner product of embedding vectors as a likelihood proxy to steer denoising, the method stably directs noise towards floor plans consistent with observed trajectories, outperforming six baselines in sparse and medium trajectory scenarios.

COSMO-INR: Complex Sinusoidal Modulation for Implicit Neural Representations

Through harmonic distortion analysis and Chebyshev polynomial approximation, this paper rigorously proves that odd/even symmetric activation functions suffer from systematic attenuation in the post-activation spectrum. It proposes modulating activation functions with a complex sinusoidal term \(e^{j\zeta x}\) to retain full spectral support. The authors design the COSMO-RC activation function and a regularized prior embedder architecture, achieving an average PSNR lead of +5.67 dB over the strongest baseline on Kodak image reconstruction and +3.45 dB on NeRF.

CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design

By adding only 4.1% parameters to FLUX.1-dev, CreatiDesign unifies three types of heterogeneous conditions ("subject images + semantic layouts (descriptions + boxes) + global prompts") into a single token sequence. These are interacted with via multi-modal attention, while a set of attention masks ensures each condition precisely controls its specific canvas area without semantic leakage. Complemented by an automated pipeline that generates a 400,000-sample dataset, the model surpasses both single-condition expert models and existing multi-condition models in subject fidelity, layout alignment, and overall harmony.

CREPE: Controlling Diffusion with Replica Exchange

The paper proposes CREPE, an inference-time control method for diffusion models based on Replica Exchange (Parallel Tempering). As a computational dual to SMC, it operates in parallel across the denoising step dimension and serially across the sample dimension. It offers high sample diversity, supports online refinement, and handles various tasks including temperature annealing, reward tilting, model composition, and CFG debiasing.

CroCoDiLight: Repurposing Cross-View Completion Encoders for Relighting

This paper reveals that the latent space of the geometric vision pre-trained model CroCo implicitly encodes illumination information. By using minimal data (two orders of magnitude less than used for CroCo), the authors decouple its patch latent representations into "a global illumination vector + per-patch intrinsic vectors," leveraging a series of photometric tasks such as relighting, shadow removal, and albedo estimation without intensive retraining.

Cross-ControlNet: Training-Free Fusion of Multiple Conditions for Text-to-Image Generation

This paper proposes Cross-ControlNet, a completely training-free multi-condition T2I framework. Grounded in two observations—natural spatial alignment of intermediate features across ControlNet branches and quantifiable condition strength via variance—the framework utilizes PixFusion (pixel-level variance-guided fusion), ChannelFusion (channel-level consistency ratio gated fusion), and KV-Injection (FG/BG decoupled KV injection) to fuse multi-path control signals during inference. It achieves approximately a 5.4% mIoU improvement over the strongest training-free baseline under conflicting conditions and generalizes to FLUX (DiT architecture) at zero cost.

D-AR: Diffusion via Autoregressive Models

D-AR designs a "sequential diffusion tokenizer" to re-encode the image diffusion process into a sequence of coarse-to-fine discrete tokens. This allows an unmodified Llama decoder to generate images using standard next-token prediction while decoding corresponding diffusion denoising steps in real-time. It achieves FIDs of 2.09 and 2.00 on ImageNet 256×256 with 775M and 1.4B parameters, respectively.

Deconstructing Guidance: A Semantic Hierarchy for Precise Diffusion Model Editing

This paper discovers that the magnitude of the "guidance difference vector" \(\Delta\epsilon\) in CFG encodes the semantic scale of editing (objects = large magnitude, background = small magnitude). Using the Tweedie formula, this is proven to be an inevitable consequence of Fisher information density. Based on this, the training-free, plug-and-play Prism-Edit is proposed. By hierarchically deconstructing guidance signals and directionally amplifying suppressed background signals, it makes the long-standing challenge of "background modification" stable and controllable for the first time.

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The authors perform a rigorous gradient decomposition of the widely used DMD distillation objective and discover that the actual "engine" compressing multi-step diffusion models into few-step generators is not distribution matching, but a long-overlooked CFG Augmentation term. Distribution matching acts merely as a "regularizer" for stable training. Based on this "Spear/Shield" division of labor, they propose decoupled re-noising schedules (d-DMD), achieving consistent performance gains across SDXL, Lumina, and 6B large models.

Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling

A pre-trained flow model (SiT/DiT) is reinterpreted as an encoder-decoder: the encoder processes only the current timestep \(t\), while the decoder processes only the next timestep \(r\). Without modifying the architecture, it is converted into a "flow map" that predicts average velocity. After fine-tuning for a few dozen epochs, it generates high-quality images on ImageNet 256×256 with FID=2.16 (1-step) / 1.51 (4-steps), achieving over 100x acceleration compared to the original flow model.

Delay Flow Matching

The authors replace the Ordinary Differential Equation (ODE) underlying Flow Matching (FM) with a Delay Differential Equation (DDE). By making the vector field dependent on historical states, the framework naturally supports trajectory crossings, precise transfer between heterogeneous distributions, and modeling of delayed dynamical systems. It outperforms the ODE-based FM on synthetic data, single-cell trajectory inference, and image generation.

DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

DeLeaker performs dynamic reweighting of attention maps during the denoising process of DiT text-to-image models—suppressing cross-entity attention while reinforcing self-identity alignment. This training-free and input-free method mitigates "semantic leakage" and introduces the first dedicated dataset SLIM along with a VLM-based evaluation framework.

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Addressing the sparse reward problem in Flow Matching + GRPO alignment: this work proposes using step-wise reward gains from ODE denoising predictions of intermediate latents as dense rewards. It adaptively adjusts the time-step-specific noise injection of the SDE sampler based on these dense rewards to calibrate the exploration space, outperforming Flow-GRPO in human preference alignment, compositional generation, and text rendering tasks.

DeRaDiff: Denoising Time Realignment of Diffusion Models

DeRaDiff transfers "decoding-time realignment" from language models to diffusion models: by aligning for only a single run, the model can simulate an aligned version trained with any KL regularization strength online using a scalar \(\lambda\) during sampling, thereby eliminating expensive hyperparameter sweeps for regularization strength.

Designing Rules to Pick a Rule: Aggregation by Consistency

Addressing the challenge of choosing among various rank aggregation rules (Borda, plurality, veto, etc.) with diverse pros and cons, this paper proposes the "Rule Picking Rule" (RPR) framework and a specific instantiation called AbC. By randomly splitting voters into two groups and selecting the rule that yields the most consistent rankings across both halves, AbC automatically identifies the most suitable aggregation rule for any given dataset without prior commitment to specific axioms or generative models.

Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability

This paper demonstrates that norm-based memorization detection metrics are only effective under isotropic log-probability distributions and fail in low-noise anisotropic regions. It proposes a denoising-free detection metric combining high-noise norm and low-noise angular alignment (cosine similarity), which outperforms existing denoising-free methods on SD v1.4/v2.0 and is over \(5\times\) faster.

Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value

This paper points out that the optimal loss value of diffusion models is not 0 but an unknown positive constant. Consequently, a "high loss" fails to distinguish between "intrinsically hard-to-fit data" and "insufficient model capacity." The authors derive a closed-form solution for this optimal loss, design a scalable estimator (cDOL) for large datasets, and utilize it to diagnose diffusion training, design superior training schedules (improving FID by 2%–25% on CIFAR-10/ImageNet), and make diffusion scaling laws more consistent with power laws.

DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

This paper proposes DiffInk, the first Latent Diffusion Transformer framework designed for full-line handwriting generation. It comprises InkVAE, which learns a structured latent space through dual regularization (OCR + style classification), and InkDiT, which performs conditional denoising within this latent space. DiffInk significantly outperforms Prev. SOTA on Chinese handwriting generation (AR 94.38% vs 91.48%) with an 800× speedup.

DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities

DiffSDA utilizes a diffusion-based probabilistic framework to unsupervisedly decompose video, audio, and time-series data into "static factors" and "dynamic factors." It achieves disentanglement using only a single score matching loss (instead of the usual array of regularization terms in VAE/GANs). It is the first to achieve high-quality swapping, zero-shot transfer, and multi-factor exploration on real high-resolution videos.

DiffSparse: Accelerating Diffusion Transformers with Learned Token Sparsity

DiffSparse reformulates token cache acceleration for Diffusion Transformers as a differentiable optimization problem of "allocating sparsity rates per layer and time step under a fixed compression rate." It uses a learnable sparsity cost predictor to output a cost matrix, solves for the global optimal allocation via dynamic programming, and employs a two-stage training process to eliminate "full-step computation" required by traditional methods. On PixArt-α, it saves 54% GFLOPs while surpassing the original model's FID.

Diffusion & Adversarial Schrödinger Bridges via Iterative Proportional Markovian Fitting

This paper reveals that the "alternating forward-backward" heuristic used in practice to stabilize IMF training implicitly involves IPF iterations. By unifying IMF and IPF into IPMF (Iterative Proportional Markovian Fitting), the authors provide the first convergence proof for bidirectional IMF and transform the "initial coupling" into a tunable knob to navigate the trade-off between generation quality and input-output similarity.

Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models

Diffusion Blend is proposed to achieve multi-preference alignment by blending the backward diffusion processes of multiple reward-finetuned models at inference time. DB-MPA supports arbitrary linear combinations of rewards; DB-KLA enables dynamic KL regularization control; and DB-MPA-LS eliminates inference overhead through stochastic LoRA sampling. The method theoretically proves error bounds for the blending approximation and approaches the MORL oracle upper bound in experiments.

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

The authors propose SQDF (Soft Q-based Diffusion Finetuning), which fine-tunes diffusion models within a KL-regularized RL framework using a training-free differentiable soft Q-function estimation and reparameterized policy gradients. Combined with three innovative components—a discount factor, consistency models, and an off-policy replay buffer—it effectively mitigates reward over-optimization while optimizing target rewards, maintaining sample naturalness and diversity.

Diffusion Negative Preference Optimization Made Simple

Addressing the cumbersome practice of "training two models + weight merging" for explicit negative preference modeling in diffusion alignment, this paper proposes Diff-SNPO. It utilizes the inherent conditional/unconditional branches of CFG as outlets for positive/negative preferences within a single network. By adapting a bounded objective from Bounded DPO, it resolves the "progressive blurring" issue of naive approaches, outperforming dual-model Diff-NPO on Pick-a-Pic v2 with half the computational cost.

Diffusion Transformers with Representation Autoencoders

The long-used VAE in latent diffusion is replaced with a "frozen pretrained representation encoder (DINOv2 / SigLIP2 / MAE) + a trained lightweight ViT decoder." By implementing three specific modifications for high-dimensional latents, the Diffusion Transformer is successfully adapted, achieving an unconditional FID of 1.51 and a guided FID of 1.13 on ImageNet 256×256. The convergence speed is 47× faster than SiT and 16× faster than REPA.

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

DiffusionNFT proposes a novel online RL paradigm for diffusion models: instead of performing policy optimization on the reverse sampling process (e.g., GRPO), it utilizes a contrastive training approach on the forward process via a flow matching objective for positive and negative samples. This defines an implicit policy improvement direction, achieving 3-25× speedups over FlowGRPO while eliminating the need for CFG.

Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

The paper proposes DrPose, which enhances 3D human reconstruction quality in challenging/acrobatic poses by maximizing PoseScore (joint consistency between multi-view latent images and GT 3D poses) via direct reward fine-tuning and KL regularization. It utilizes the DrPose15K dataset (synthesized from Motion-X poses and the MIMO video generator) to bridge the gap in 3D human data diversity.

Directional Textual Inversion for Personalized Text-to-Image Generation

This paper discovers that token embeddings learned by Textual Inversion (TI) suffer from "norm inflation," leading to decreased text alignment in complex prompts. It proposes Directional Textual Inversion (DTI), which fixes the embedding norm to an in-distribution scale and optimizes only the direction on the unit hypersphere using Riemannian SGD. Combined with a von Mises-Fisher prior, this method significantly improves prompt faithfulness.

Discrete Adjoint Matching

This paper proposes Discrete Adjoint Matching (DAM), which derives adjoint variables on discrete state spaces from a purely statistical perspective (rather than control theory). By generalizing continuous Adjoint Matching to discrete generative models based on Continuous-Time Markov Chains (CTMC), it achieves effective fine-tuning of diffusion-based LLMs (LLaDA-8B), increasing accuracy on Sudoku from 11.5% to 89.2%.

Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching

Given a pre-trained discrete flow matching/diffusion model and the density ratio between target and source distributions, this paper derives exact transition rate guidance formulas. This reduces the sampling overhead from multiple forward passes per step to a single forward pass and unifies energy guidance, classifier guidance, and RLHF preference alignment into one framework.

Discrete Variational Autoencoding via Policy Search

The training of discrete VAE encoders is reformulated as a KL-regularized policy search problem. By using natural gradients of a non-parametric target distribution to update the parametric encoder through weighted maximum likelihood, this approach completely bypasses Gumbel-Softmax, straight-through estimators, and backpropagation through sampling paths. This allows autoregressive discrete encoders to train stably on high-dimensional data like ImageNet, outperforming quantization-based methods.

DistillKac: Few-Step Image Generation via Damped Wave Equations

The telegrapher equation (damped wave equation) and its stochastic Kac representation are proposed as the foundation for generative probability flows to replace the Fokker-Planck equation. This framework achieves finite-speed propagation, and an endpoint distillation method is introduced for few-step generation, achieving FID=4.14 in 4 steps and FID=5.66 in 1 step on CIFAR-10.

Diverse Text-to-Image Generation via Contrastive Noise Optimization

Ours proposes Contrastive Noise Optimization (CNO), which enhances the generation diversity of diffusion models through a pre-processing approach by imposing an InfoNCE contrastive loss on initial noise within the Tweedie denoising prediction space, maintaining fidelity without modifying the sampling process or the model itself.

Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models

TADA shifts away from expanding the entire training set by 10–30x using diffusion models. Instead, it identifies the 30–40% "slow-learning" samples that are difficult to learn early in training and selectively augments them using real-image-guided diffusion to generate synthetic images that "preserve semantic features while replacing noise." Theoretical and experimental results demonstrate that augmenting only this subset is more effective than full-set augmentation, enabling SGD to outperform SAM on CIFAR-100/TinyImageNet.

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

The paper proposes SHINE, a training-free image composition framework. By utilizing three components—Manifold-Steered Anchor Loss, Degradation-Suppression Guidance, and Adaptive Background Blending—it leverages the inherent physical priors of pre-trained T2I models (such as FLUX) to achieve high-quality object insertion under complex lighting conditions (shadows, water reflections, etc.).

DoFlow: Flow-based Generative Models for Interventional and Counterfactual Forecasting

Ours proposes DoFlow, a causal generative model based on Continuous Normalizing Flows (CNF) that unifiedly implements observational, interventional, and counterfactual time series forecasting on a causal DAG. It also enables anomaly detection through explicit likelihood and demonstrates effectiveness on synthetic and real-world medical data.

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

The first framework to introduce the strong generative priors of FLUX (DiT) into drag editing. It replaces traditional point-level supervision with region-level affine supervision, combined with gradient mask hard constraints and adapter-enhanced inversion, significantly improving the quality of drag editing.

Dragging with Geometry: From Pixels to Geometry-Guided Image Editing

GeoDrag incorporates the 3D perspective rule "near pixels move more, far pixels move less" into drag-based image editing. By using a unified displacement field that encodes both 3D geometry (depth) and 2D planar priors, it achieves structure-consistent dragging in a single latent-space forward pass. It utilizes Voronoi partitioning to resolve cancellation issues in multi-point dragging, improving drag accuracy (DAI) by 1.4x and Mean Distance (MD) by 1.1x on DragBench compared to the second-best methods, all without requiring LoRA warm-up.

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

This paper identifies a responsibility imbalance in current unified multimodal models where the understanding module acts merely as a translator, forcing the generative module to simultaneously serve as both "designer" and "painter." By constructing the DIM dataset (14M long-context image-text pairs + 233K CoT editing blueprints), the design responsibility is shifted to the understanding module, allowing a 4.6B parameter model to outperform models five times its size.

Dual-Path Condition Alignment for Diffusion Transformers

DUPA replaces the representation alignment in REPA (which uses external vision encoders to label noisy images) with an unsupervised self-alignment mechanism. By independently noising the same image twice and aligning the two sets of conditional features extracted by the model itself, it requires no external images, parameters, or additional compute. On ImageNet 256×256, it achieves FID=1.46 in only 400 epochs, outperforming all methods that do not rely on external supervision.

Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction

The paper proposes Dual-Solver, which generalizes diffusion model multi-step samplers through three sets of learnable parameters (prediction type interpolation \(\gamma\), integration domain selection \(\tau\), and residual adjustment \(\kappa\)). By using the classification loss of a frozen pretrained classifier (MobileNet/CLIP) to learn parameters without requiring teacher trajectories, it outperforms methods like DPM-Solver++ in the low NFE range of 3-9.

Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer

To be supplemented after a thorough reading of the paper.

Dynamic Classifier-Free Diffusion Guidance via Online Feedback

This paper replaces the static classifier-free guidance scale in diffusion models with a dynamic schedule selected online at each step. By using lightweight latent space evaluators to score candidate CFG scales during each reverse diffusion step and greedily selecting the optimal value, the method simultaneously improves text alignment, visual quality, text rendering, and counting capabilities with negligible additional sampling cost.

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Proposes T2I-CoReBench, the first comprehensive benchmark to systematically evaluate both the Composition and Reasoning capabilities of T2I models. It covers 12 evaluation dimensions, 1080 high-difficulty prompts, and approximately 13,500 checklist questions. A large-scale evaluation of 38 models reveals that reasoning capabilities lag far behind compositional ones, identifying reasoning as the core bottleneck in current T2I generation.

EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model

EchoGen introduces "subject-driven generation" to the Visual Autoregressive (VAR/Infinity) framework for the first time. By using a dual-path injection mechanism—a semantic path and a content path—the model decouples subject "identity" from "details." It achieves fidelity comparable to or better than diffusion-based methods on DreamBench, while reducing sampling latency from 10s+ to 0.5–5.2s.

Edit-Based Flow Matching for Temporal Point Processes

The paper proposes EDITPP, which models the generation of Temporal Point Processes (TPP) as an edit flow on a Continuous-Time Markov Chain (CTMC). By using three types of atomic edits—insertion, deletion, and substitution—it transports noise sequences to target event sequences. It achieves or approaches SOTA performance in both unconditional generation and conditional forecasting tasks while reducing edit steps and significantly accelerating sampling.

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

Constructs a high-quality dataset, EditReward-Data, containing 200K expert-annotated preference pairs, and trains the EditReward model. This model achieves SOTA human alignment across multiple image editing benchmarks and significantly improves downstream editing model performance when used as a data filter.

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

The authors propose the first systematic "benchmark evaluation → reward model → reinforcement learning training" pipeline for image editing: constructing the EditReward-Bench benchmark, training the EditScore series of reward models (7B-72B, outperforming GPT-5), and successfully applying it to Online RL training to significantly enhance the performance of image editing models.

EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing

EdiVal-Agent decomposes multi-turn image editing evaluation into object decomposition, object state tracking, instruction generation, and tool-assisted scoring. It utilizes three metrics—EdiVal-IF, EdiVal-CC, and EdiVal-VQ—to provide a fine-grained assessment of whether the model correctly edits targets, preserves unedited content, and maintains visual quality.

Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo

This work proposes a provable version of Annealed Langevin Monte Carlo (ALMC): starting with a warm start on a strongly convex objective that considers only "measurement consistency," it then anneals along the "posterior path of the noisy prior." This achieves both "KL proximity to the noisy posterior" and "Fisher proximity to the true posterior" within polynomial time.

Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization

This paper transforms the "projection direction selection" in Sliced Wasserstein distance from fixed low-discrepancy sampling into a learnable Bayesian Optimization process. It proposes four plug-and-play strategies (BOSW/RBOSW/ABOSW/ARBOSW), achieving or approaching SOTA in multiple SW-in-the-loop tasks without modifying downstream losses or gradient formulations.

Efficient Zero-shot Inpainting with Decoupled Diffusion Guidance

This paper proposes DING (Decoupled INpainting Guidance), which decouples denoiser inputs from state variables in likelihood guidance to construct precisely samplable Gaussian posterior transitions, achieving faster, memory-efficient, and higher-quality zero-shot image inpainting without any task-specific fine-tuning.

Eliminating VAE for Fast and High-Resolution Generative Detail Restoration

By replacing the VAE encoder and decoder with ×8 pixel-(un)shuffle, this work converts Latent Diffusion Super-Resolution (GenDR) into Pixel-space Super-Resolution (GenDR-Pix). Combined with multi-stage adversarial distillation and a PadCFG inference strategy, it achieves a 2.8× speedup and 60% VRAM savings while maintaining negligible visual degradation, enabling 4K image restoration in under 1 second with only 6GB of VRAM.

Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning

Ours proposes FLOP (Fast Learning of Order and Parents), a score-based causal discovery algorithm for linear models. By employing fast parent selection and iterative Cholesky score updates, it significantly reduces runtime, making Iterative Local Search (ILS) feasible. It achieves near-perfect graph recovery on standard causal discovery benchmarks, re-establishing the validity of discrete search in causal discovery.

Enhanced Generative Model Evaluation with Clipped Density and Coverage

This paper proposes Clipped Density and Clipped Coverage metrics for evaluating generative models. By clipping single-sample contributions, limiting the radius of outlier nearest-neighbor spheres, and performing linear calibration, fidelity and coverage scores are made robust against outliers and interpretable as the "equivalent proportion of good samples."

Entering the Era of Discrete Diffusion Models: A Benchmark for Schrödinger Bridges and Entropic Optimal Transport

The first evaluation benchmark for Schrödinger Bridge (SB) / Entropic Optimal Transport (EOT) in discrete space: it utilizes CP decomposition to construct distribution pairs with analytically known optimal solutions and simultaneously proposes three new algorithms: DLightSB, DLightSB-M, and α-CSBM.

Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance

ERK-Guid is proposed to utilize the step difference error of embedded Runge-Kutta solvers as a guidance signal to adaptively correct local truncation error (LTE) in stiff regions, enhancing diffusion model sampling quality without requiring additional network evaluations.

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

SpatialGenEval is proposed as a benchmark covering 10 spatial sub-domains through 1,230 long, information-dense prompts. It systematically evaluates 23 SOTA T2I models, revealing that spatial reasoning is the primary bottleneck. Additionally, the SpatialT2I dataset is constructed to achieve data-centric improvements in spatial intelligence.

Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model

Ours proposes ECAD (Evolutionary Caching to Accelerate Diffusion models), which utilizes genetic algorithms to automatically search for optimal cache scheduling strategies on the speed-quality Pareto frontier. Without modifying model parameters and using only 100 calibration prompts, it achieves 2-3x inference acceleration for diffusion models while maintaining or even improving generation quality.

Exploring the Design Space of Transition Matching

This paper conducts a large-scale systematic ablation (56 models of 1.7B params, 549 evaluations) on the "head" module in Transition Matching (TM), which has long been treated as a fixed attachment. It proposes a zero-overhead stochastic sampler and derives the optimal recipe, DTM++ (MLP head + log-normal time weighting + high-frequency stochastic sampling), achieving SOTA across aggregated metric rankings.

FACM: Flow-Anchored Consistency Models

FACM jointly trains Flow Matching (as an "anchor") and Consistency Models (as a "shortcut" goal) within a single model. By employing an "extended time interval" technique to decouple the two tasks into different time domains, it fundamentally resolves the training collapse issue in continuous-time consistency models. It achieves FID scores of 1.70/1.32 with NFE=1/2 on ImageNet \(256 \times 256\), respectively, and scales effectively to 14B text-to-image models.

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

The first systematic study on structured visual content (charts, math formulas, diagrams, etc.) generation and editing. This work constructs a 1.3 million code-aligned training dataset (including CoT reasoning annotations), a unified VLM+Diffusion architecture, and a StructBench benchmark containing 1700+ samples, revealing that reasoning capability is the key bottleneck for current models handling structured visuals.

FALCON: Few-step Accurate Likelihoods for Continuous Flows

FALCON introduces a "cyclic reversibility" regularization to few-step flow maps, enabling both fast sampling and low-cost accurate likelihood estimation within 4–16 steps. This reduces the inference cost of continuous flow Boltzmann Generators by two orders of magnitude and outperforms current state-of-the-art discrete normalizing flows.

FARI: Robust One-Step Inversion for Watermarking in Diffusion Models

FARI identifies a geometric asymmetry where the "curvature of the inversion trajectory is significantly lower than that of the generation trajectory." Based on this, it distills multi-step DDIM inversion into a single step and employs lightweight adversarial LoRA fine-tuning to specifically enhance watermark extraction robustness. With just 20 minutes of fine-tuning on a single A6000, one-step inversion surpasses 50-step DDIM in watermark verification robustness.

FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference

FastFlow is a training-free, plug-and-play inference acceleration framework for flow matching (FM). It approximates redundant denoising steps that are "nearly linear" at zero cost using finite difference extrapolation, and employs a Multi-Armed Bandit (MAB) to online decide the safe jump length at each step. It achieves over 2.6× acceleration on image/video generation and editing tasks with minimal quality degradation.

SSCP: Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

The authors propose Single-Step Completion Policy (SSCP), which compresses multi-step generative policies into single-step inference by predicting a "completion vector" (the normalized direction from any intermediate state to the target action) within a flow-matching framework. On D4RL, it performs equitably with multi-step diffusion/flow policies while being 64× faster in training and 4.7× faster in inference, further extending to flatten hierarchical policies in GCRL.

Flow Along the \(K\)-Amplitude for Generative Modeling

This paper proposes K-Flow, which reinterprets the "time" in flow matching as a scaling parameter \(k\) that organizes frequencies/scales. By allowing generation to unfold along the K-amplitude (frequency bands/coefficients) space from low to high frequencies, the model achieves natural scale-controllable generation (omitting class conditions, frequency editing, training-free restoration) and obtains competitive FIDs in image generation.

Flow Map Learning via Non-Gradient Vector Flow

SGFlow utilizes a partial differential equation (PDE) identity containing only Jacobian-vector products (JVP)—without model inversion—to transform flow map learning into a non-conservative dynamics objective with stop-gradients. Training from scratch ensures the true flow map is the unique stationary point, achieving few-step sampling on CIFAR-10 with lower memory usage and superior FID.

Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

The paper significantly improves the sample efficiency of offline-to-online RL under limited interaction budgets by injecting controllable noise during flow matching training to expand policy coverage and employing an entropy-guided sampling mechanism to dynamically balance exploration and exploitation during online fine-tuning.

Flow Matching with Semidiscrete Couplings

Replaces the "batch-wise \(n \times n\) optimal transport" in OT-guided Flow Matching with a one-time fit of an \(N\)-dimensional dual potential vector and an online Maximum Inner Product Search (MIPS) to assign noise to data points. By removing the quadratic dependence on batch size \(n\) found in OT-FM, this method consistently outperforms standard FM and OT-FM across multiple datasets, conditional/unconditional tasks, and even single-step mean-flow generation.

Flow Straight and Fast in Hilbert Space: Functional Rectified Flow

This paper rigorously extends rectified flow to infinite-dimensional separable Hilbert spaces, proving that its "marginal-preserving" property remains valid in functional spaces. It unifies functional flow matching and functional probability flow ODEs as nonlinear special cases within this framework while removing unverifiable measure-theoretic assumptions present in existing theories.

FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing

To address the issues of unsmooth trajectories and poor source consistency in inversion-free flow editing (FlowEdit), FlowAlign employs an optimal control framework with a source similarity regularization at the terminal point. This decouples the editing velocity field into "semantic guidance" and "source consistency" terms, significantly improving source structure preservation with only 1 additional NFE and naturally supporting reverse ODE reconstruction.

FlowCast: Advancing Precipitation Nowcasting with Conditional Flow Matching

Ours is the first to apply Conditional Flow Matching (CFM) as an end-to-end probabilistic generative model for precipitation nowcasting. It learns a direct mapping from noise to data in a compressed latent space, surpassing diffusion models in predictive accuracy and probabilistic performance with significantly fewer sampling steps.

FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching

The FlowCast framework is proposed to introduce speculative decoding into Flow Matching models. By leveraging the local smoothness of the velocity field, current velocity predictions are used as zero-cost drafts to extrapolate future states. Selective skipping of redundant steps via MSE verification achieves >2.5× acceleration without quality loss.

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

This work constructs FLUX-Reason-6M, a reasoning-oriented text-to-image dataset containing 6 million FLUX-generated images and 20 million bilingual descriptions (core feature: "Generating Chain-of-Thought, GCoT" annotations). It also introduces PRISM-Bench, a fine-grained evaluation benchmark with seven tracks using advanced VLMs as judges, revealing the actual performance gap between open-source and closed-source text-to-image models in dimensions like text rendering and long-text instruction following.

Follow-Your-Preference: Towards Preference-Aligned Image Inpainting

Instead of proposing a new method, this paper returns to basics to systematically investigate fundamental questions regarding "preference alignment for image inpainting using DPO and public reward models"—whether reward models are reliable, how preference data scales, and the origins of reward hacking. It finds that simply ensembling and ranking 9 reward models eliminates individual biases and significantly surpasses SOTA, establishing a simple yet solid baseline for this emerging direction.

Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

Follow-Your-Shape is proposed as a training-free and mask-free shape-aware image editing framework. It constructs a Trajectory Divergence Map (TDM) by calculating token-level velocity differences between inversion and editing trajectories to precisely locate the editing region. Combined with stage-based KV injection, it achieves significant shape transformations while strictly preserving the background.

Foresight Diffusion: Improving Sampling Consistency in Predictive Diffusion Models

Addressing the issue that diffusion models applied to predictive learning suffer from high variance between samples and poor alignment with ground truth trajectories, ForeDiff decouples "condition understanding" from "target denoising" into two independent streams. It utilizes a pre-trained deterministic predictor to extract representations for guiding generation, simultaneously improving prediction accuracy and sampling consistency in robotic video prediction and scientific spatiotemporal forecasting.

Forget Many, Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models

ScaPre utilizes a training-free and data-free closed-form solution to simultaneously address "update conflicts" and "collateral damage to similar concepts" in large-scale concept unlearning. It stably forgets 50 concepts within 120 seconds, unlearning 5 times more concepts than the strongest baseline without degrading generation quality.

Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster

Instead of struggling to make a factorized reverse process approximate a complex target, FLDD makes the forward noising process learnable. This ensures the induced reverse target is factorized and easily matched by existing samplers, reducing discrete diffusion sampling from hundreds of steps to 10 without changing the sampler or increasing inference overhead.

Free Lunch for Stabilizing Rectified Flow Inversion

Two training-free methods, PMI (Proximal-Mean Inversion) and mimic-CFG, are proposed to stabilize Rectified Flow inversion by performing proximal gradient correction of the velocity field toward its historical mean, achieving SOTA reconstruction and editing quality with fewer NFE on PIE-Bench.

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

This paper quantifies the division of labor between CoT and RL in autoregressive T2I using "entropy"—CoT expands the exploration space while RL contracts it toward high-reward regions. Observing that reward is strongly negatively correlated with the mean and variance of image token entropy, the authors propose EG-GRPO: reallocating optimization budgets based on token entropy (low-entropy tokens follow KL for stability, while high-entropy tokens receive entropy rewards for structured exploration). It achieves SOTA on T2I-CompBench and WISE.

From Parameters to Behaviors: Unsupervised Compression of the Policy Space

Proposes unsupervised compression of the policy space based on the manifold hypothesis—training an autoencoder with behavior reconstruction loss (rather than parameter reconstruction loss) to compress the high-dimensional policy parameter space \(\Theta \subseteq \mathbb{R}^P\) into a low-dimensional latent behavior space \(\mathcal{Z} \subseteq \mathbb{R}^k\) (up to a 121801:1 compression ratio). Validated on environments such as Mountain Car, Reacher, Hopper, and HalfCheetah, it demonstrates that the intrinsic dimension of the behavior manifold depends on environmental complexity rather than network size. Furthermore, PGPE optimization in the latent space achieves faster convergence than SOTA methods like PPO and SAC in 7 out of 8 tasks.

From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation

TensorAR is proposed to upgrade standard AR image generation from next-token prediction to next-tensor prediction. By predicting overlapping tensors (sets of continuous tokens) at each step, subsequent tensors achieve iterative refinement through overlap with previous ones. A discrete diffusion noise mechanism is introduced to solve the training information leakage problem. As a plug-and-play module, it is compatible with AR models such as LlamaGen, Open-MAGVIT2, and Janus-Pro, consistently improving generation quality across class-to-image and text-to-image tasks.

GarmentGPT: Compositional Garment Pattern Generation via Discrete Latent Tokenization

GarmentGPT quantizes the continuous boundary curves of sewing patterns into discrete codebook tokens using RVQ-VAE. It then enables a fine-tuned VLM to autoregressively "select words" to generate these tokens, transforming pattern generation from low-level coordinate regression into high-level symbolic compositional reasoning. The work is supported by a million-scale dataset of paired real human portraits and patterns.

GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver

This paper proposes the Generalized Adversarial Solver (GAS): it adopts a "Generalized Solver" parameterization—learning additive corrections on theoretical solver coefficients and incorporating all historical points into a linear multi-step signature—and couples it with adversarial loss. GAS systematically reduces the FID of diffusion models below that of existing solver distillation methods in 4–10 step few-step sampling scenarios.

Gauge Flow Matching: Efficient Constrained Generative Modeling over General Convex Set and Beyond

This paper proposes Gauge Flow Matching (GFM), which uses an explicit bijective gauge mapping to equivalently transform constrained generation problems on arbitrary compact convex sets to a unit ball. This allows for low-complexity reflection/projection within the ball to strictly guarantee feasibility before mapping back to the original space. GFM achieves "100% constraint satisfaction + high quality + high speed" with overhead close to standard flow matching and extends to non-convex sets like star-convex and geodesically convex sets.

GenCompositor: Generative Video Compositing with Diffusion Transformer

GenCompositor introduces the task of "Generative Video Compositing," using a specially designed DiT pipeline to inject external foreground videos into background videos based on user-specified trajectories and scales. It maintains background consistency while inheriting foreground identity and dynamics, significantly outperforming alternative solutions in video harmonization, trajectory control, and ablation studies.

GenCP: Towards Generative Modeling Paradigm of Coupled Physics

GenCP is proposed to model coupled multi-physics simulation as a probability density evolution problem. It utilizes flow matching to learn conditional velocity fields from decoupled data and synthesizes coupled solutions via Lie-Trotter operator splitting during inference. This achieves "decoupled training, coupled inference" with theoretically controllable error guarantees.

GenDR: Lighten Generative Detail Restoration

GenDR is proposed as a lightweight one-step diffusion super-resolution (SR) model for generative detail restoration. It identifies a fundamental divergence between T2I and SR objectives (T2I requires multi-step + 4-channel latents, whereas SR needs fewer steps + 16-channel latents). The authors build a custom SD2.1-VAE16 base model (0.9B, using REPA for representation alignment to expand latent space without increasing model size) and propose CiD/CiDA consistency score identity distillation (integrating SR-specific priors into score distillation + adversarial learning + representation alignment). The minimalist pipeline, consisting only of UNet and VAE, achieves 77ms inference, outperforming existing SOTAs in both quality and efficiency.

Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds

The "Flow Map" framework from Euclidean space is generalized to arbitrary Riemannian manifolds by proposing Generalised Flow Maps (GFM). Using three self-distillation losses, geometric generative models capable of "one-step/few-step" sampling on manifolds are trained from scratch, unifying and enhancing consistency models, shortcut models, and MeanFlow for manifold settings.

Generalization of Diffusion Models Arises with a Balanced Representation Space

This paper represents a significant breakthrough in the generalization theory of diffusion models. By analyzing the optimal solutions of a two-layer nonlinear ReLU DAE, the authors provide a unified characterization of both memorization and generalization behaviors. They creatively propose a representation-centric understanding from the perspective of representation space. The theoretical conclusions are consistently validated across EDM, DiT, and Stable Diffusion v1.4, leading to two practical applications: memorization detection and controllable editing. The work balances theoretical depth with practical utility.

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

The Generate Any Scene data engine is proposed, which systematically enumerates scene graphs based on a visual element taxonomy of 28K objects × 1.5K attributes × 10K relations and translates them into caption-VQA pairs. It supports four applications: self-improvement (SD1.5 +4%), targeted distillation (TIFA +10% with <800 data points), scene graph reward models (DPG-Bench +5% vs CLIP), and content moderation enhancement.

Generating Directed Graphs with Dual Attention and Asymmetric Encoding

Directo is proposed as the first directed graph generation model based on Discrete Flow Matching. It captures the directional dependencies of directed edges through a direction-aware dual attention mechanism and asymmetric positional encoding, while establishing a standardized evaluation framework for directed graph generation.

Generating Metamers of Human Scene Understanding

MetamerGen utilizes a dual-stream (foveal + peripheral) conditioned latent diffusion model to synthesize "scenes as understood by the human brain" from a small number of fixation points during free viewing. Through "same/different" behavioral experiments, the authors identify scene metamers—images judged as "identical" by human observers—to decompose which levels of visual features determine human scene understanding.

Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling

Image generation in Masked Autoregressive (MAR) models is decomposed into a two-stage sampling process: "slow checkerboard skeleton generation" followed by "fast single-step detail reconstruction." Combined with extra diffusion steps allocated to high-frequency detail tokens, this achieves a 3.72× speedup for MAR-H without training and with almost no loss in FID/IS.

Generative Blocks World: Moving Things Around in Pictures

The proposed method decomposes an image scene into a set of draggable 3D convex polytopes (blocks world). Users can directly move, scale, or rotate these primitives in 3D or move the camera. Real-time rendering is performed via a FLUX flow model conditioned on depth and texture hints, achieving geometrically consistent and identity-preserving 3D-aware image editing.

Generative Modeling from Black-Box Corruptions via Self-Consistent Stochastic Interpolants

This paper introduces Self-Consistent Stochastic Interpolants (SCSI), which recovers clean data distributions and enables training of generative models by iteratively learning a self-consistent transport of "observation distribution \(\to\) latent clean distribution \(\to\) re-corruption back to observation distribution," requiring only corrupted samples and a black-box simulator without clean samples or explicit likelihoods.

GeoDiv: Framework for Measuring Geographical Diversity in Text-to-Image Models

The GeoDiv framework is proposed to utilize the world knowledge of LLMs and VLMs to systematically evaluate the geographical diversity of T2I models across two dimensions: the Socio-Economic Visual Index (SEVI) and the Visual Diversity Index (VDI). It reveals systematic impoverishment biases in models against countries such as India and Nigeria.

Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers

GeoEdit utilizes 3D reconstruction-driven geometric transformations + DiT-based in-context inpainting, coupled with a soft-biased Effects-Sensitive Attention specifically for lighting and shadows. This enables object translation, rotation, and scaling that are both geometrically precise and physically realistic.

GGBall: Graph Generative Model on Poincaré Ball

The authors propose GGBall, the first graph generative framework entirely based on the Poincaré ball model. By utilizing a Hyperbolic Vector Quantized Autoencoder (HVQVAE) and a Riemannian Flow Matching prior, it achieves SOTA performance in hierarchical and molecular graph generation, reducing the average generation error on hierarchical datasets by 18%.

GLASS Flows: Efficient Inference for Reward Alignment of Flow and Diffusion Models

Ours proposes GLASS (Gaussian Latent Sufficient Statistic) Flows—a new "flow-within-a-flow" sampling paradigm. By reparameterizing stochastic Markov transitions \(p_{t'|t}(x_{t'} | x_t)\) as inner ODE solving problems via Gaussian sufficient statistics (reusing pretrained denoisers without retraining), it achieves Feynman-Kac Steering without the trade-off between ODE efficiency and SDE randomness. This consistently outperforms Best-of-N ODE baselines on FLUX models, setting a new SOTA for inference-time reward alignment.

GoT-R1: Unleashing Reasoning Capability of Autoregressive Visual Generation with Reinforcement Learning

GoT-R1 transfers the success of "exploring reasoning strategies via Reinforcement Learning" (like GRPO in LLMs) to autoregressive image generation. By utilizing a dual-stage multi-dimensional reward scored by an MLLM to simultaneously supervise the "reasoning chain" and the "final image," the model significantly improves generation fidelity for compositional prompts involving multiple objects, precise spatial relations, and attribute binding.

Group Critical-token Policy Optimization for Autoregressive Image Generation

This paper proposes GCPO, which identifies truly "critical" tokens in autoregressive image generation from three perspectives: causal dependency, spatial structure of entropy gradients, and intra-group token diversity. By performing RLVR optimization on only 30% of these tokens with dynamic advantage weights, the method outperforms GRPO applied to the entire token sequence.

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

This paper exposes a neglected evaluation pitfall: human preference metrics such as HPSv2 and ImageReward exhibit a strong preference for large guidance scales, allowing scores to be inflated by simply increasing CFG. The authors propose the GA-Eval framework with "effective guidance scale calibration" for fair comparisons, revealing that the "improvements" claimed by eight recent diffusion guidance methods are largely dividends of increased effective guidance scales.

Guidance Watermarking for Diffusion Models

This paper proposes a "Guidance Watermarking" method: using any off-the-shelf post-hoc watermark decoder to backpropagate gradients and guide the diffusion sampling trajectory. This converts any post-hoc watermarking scheme into an in-generation watermark at zero cost, without retraining the diffusion model or the decoder, while inheriting or even enhancing the decoder's robustness.

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

This paper introduces the Recursive Likelihood Ratio (RLR) optimizer, which unifies gradient estimation for each step of the diffusion chain into a design space of "First-Order (FO) + Half-Order (HO) + Zero-Order (ZO)". By leveraging the inherent stochastic noise of diffusion models for likelihood ratio estimation, RLR achieves an unbiased, low-variance, and memory-controllable gradient estimator for fine-tuning, simultaneously addressing the structural bias of truncated BP and the high variance of Reinforcement Learning (RL).

HiCache: A Plug-in Scaled-Hermite Upgrade for Taylor-Style Cache-then-Forecast Diffusion Acceleration

HiCache discovers that the finite difference approximations of DiT features follow a multivariate Gaussian distribution. Based on this, it replaces the monomial Taylor basis in TaylorSeer with "Scaled Hermite Polynomials" and uses dual scaling to ensure numerical stability. It achieves a 5.55× speedup on FLUX.1-dev with image quality surpassing the original model, while offering a plug-and-play upgrade to existing caching methods with zero extra FLOPs.

Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion

Proposes HECRL, a hierarchical entity-centric offline goal-conditioned RL framework. By combining a value-based GCRL agent with a factored subgoal diffusion model, it achieves a 150%+ success rate improvement in multi-entity long-horizon tasks.

HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation

The paper proposes HierLoc, which remodels geolocation as an image-entity alignment problem in hyperbolic space. By replacing 5M+ image embeddings with 240k hierarchical geographic entity embeddings, it reduces the mean geodesic error by 19.5% and improves sub-region accuracy by 43% on the OSV5M dataset.

HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models

HiGS is a training-free, additional-forward-free sampling plugin for diffusion models. It corrects the sampling direction using the difference between current model predictions and the EMA of historical predictions, significantly improving image clarity, structure, and detail under low NFE or low CFG scales.

HOG-Diff: Higher-Order Guided Diffusion for Graph Generation

This paper proposes HOG-Diff, a graph diffusion framework that utilizes higher-order topological structures (e.g., rings, triangles, motifs) as generation guidance. By extracting higher-order skeletons via Cell Complex Filtering (CCF) combined with generalized OU diffusion bridges, it achieves "coarse-to-fine" progressive graph generation, reaching SOTA performance on 8 benchmarks for molecule and general graph generation.

Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

All inputs and outputs (including predictions after CFG) of continuous-token autoregressive (AR) image generation are constrained to a hypersphere of fixed radius. By replacing the diagonal Gaussian VAE with a Hyperspherical VAE, the scale degree of freedom that causes variance collapse is eliminated. This allows pure next-token raster-order AR to outperform diffusion and masked generative models for the first time at equivalent parameter scales (SphereAR-H 943M achieves FID 1.34 on ImageNet 256×256).

I-DRUID: Layout to Image Generation via Instance-Disentangled Representation and Unpaired Data

Addressing two major issues in Layout-to-Image (L2I) generation—"attribute leakage" caused by entangled instance features in attention layers and "poor cross-scenario generalization" due to insufficient paired data—I-DRUID introduces an Instance-Disentangled Module + Disentangled Constraint to extract clean semantic features. It then employs prompt-only Reinforcement Learning without paired images to adapt the model to new scenarios via AI feedback. This synergistic approach achieves SOTA performance on both UNet and MM-DiT architectures.

Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning

Recall proposes the first multi-modal guided attack framework. By optimizing adversarial image prompts in the latent space (requiring only one reference image) and combining them with original text prompts to exploit the image-conditioning channel of diffusion models, it achieves an average ASR of 65%~97% across 10 SOTA unlearning methods. This significantly outperforms text-only attack methods and reveals the vulnerability of current unlearning mechanisms to image-modality attacks.

ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

ImageDoctor upgrades text-to-image (T2I) quality evaluation from "providing a single score" to "clinical diagnosis." Built on a Multimodal Large Language Model (MLLM), it follows a "look-think-predict" workflow to locate defect regions, perform reasoning, and output four-dimensional scores (alignment, aesthetics, plausibility, overall) along with pixel-level defect heatmaps. This dense feedback is integrated into DenseFlow-GRPO as a reward, improving T2I model preference alignment by approximately 10% compared to scalar rewards.

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

ImagenWorld constructs an explainable image generation benchmark capable of "locating which object or region the model failed on" using 3.6K condition sets × 6 tasks × 6 domains and 20,000 fine-grained human annotations. It systematically reveals common failure patterns in local editing and text-dense content across 14 generative and editing models.

ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation

ImageRAG translates the RAG concept from LLMs to image generation: it first produces a draft using a T2I model, then uses a VLM's guided Chain-of-Thought to identify "incorrectly drawn or missing" concepts, retrieves reference images as needed, and feeds them back into the model. This process significantly improves the generation of rare and fine-grained concepts without requiring any additional training.

Implicit Inversion turns CLIP into a Decoder

Without training any generative decoder or fine-tuning CLIP, this work achieves text-to-image generation, style transfer, and image reconstruction by "inverting" a frozen CLIP image encoder. By utilizing frequency-aware Implicit Neural Representations (INR) to back-project images from text embeddings, the authors reveal significant untapped generative capabilities within discriminative models.

Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment (CODA)

The CODA framework is proposed to address slot entanglement and weak alignment in diffusion-based object-centric learning by introducing register slots to absorb residual attention, fine-tuning cross-attention projections, and incorporating a contrastive alignment loss. It significantly improves object discovery and compositional generation quality on both synthetic and real-world datasets.

Improving Classifier-Free Guidance in Masked Diffusion: Low-Dim Theoretical Insights with High-Dim Impact

This paper derives the exact effects of CFG on 1D/2D masked diffusion toy models that allow for analytical solutions. It discovers that the partition function \(Z_w\) in existing discrete CFG is erroneously coupled into the jump rates, leading to premature unmasking and quality degradation. The authors propose a "column normalization" fix (one line of code) and theoretically demonstrate that an increasing guidance schedule ("weak early, strong late") is the optimal approach for discrete diffusion.

Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation

This paper identifies that the root cause of minority class collapse in diffusion models trained on long-tail data is "model capacity being monopolized by majority classes." It proposes Capacity Manipulation (CM): using LoRA-like low-rank decomposition to explicitly split parameters into "general/majority" and "minority expert" components, then employing consistency and diversity losses to force minority class knowledge into the reserved capacity. It incurs no additional inference overhead and is orthogonal to existing methods.

Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies (UPO)

This paper proposes Unmasking Policy Optimization (UPO), which models the denoising process of Masked Diffusion Models as a KL-regularized MDP. By training a lightweight unmasking policy model using reinforcement learning to replace heuristic schedulers like max-confidence, the study demonstrates theoretically and experimentally that the learned policy generates samples closer to the true data distribution.

Inference-Time Scaling of Diffusion Models Through Classical Search

This work systematically transfers classical AI search (BFS/DFS global tree search + annealed Langevin MCMC local search) to the diffusion model inference stage, jointly scaling "local search" and "global search" for the first time to refresh the efficiency-performance Pareto frontier across image generation, long-horizon planning, and offline RL.

Inference-Time Scaling of Discrete Diffusion Models via Importance Weighting and Optimal Proposal Design

This paper introduces Sequential Monte Carlo (SMC) into the inference stage of discrete diffusion models. Through computable importance weights and near-optimal proposal designs, it enhances reward alignment, CFG sampling, and controllable generation for cross-modal, biological, and image tasks without retraining the base model.

Intention-Conditioned Flow Occupancy Models

InFOM is proposed to construct an intention-conditioned occupancy model using flow matching. By inferring latent intentions from data via variational inference, it enables RL pre-training on unlabeled data. It achieves a \(1.8\times\) median return gain and a \(36\%\) success rate improvement across 36 state-based and 4 visual tasks.

Interaction Field Matching: Overcoming Limitations of Electrostatic Models

The authors generalize Electrostatic Field Matching (EFM) into a framework of "arbitrary pairwise interaction fields" (IFM). By designing a specific field inspired by strong interaction between quarks, they ensure field lines are straight, non-leaking, and non-reversing, fundamentally solving EFM's issues of reverse field lines, out-of-bounds termination, and uncontrollable training volume.

Interleaving Reasoning for Better Text-to-Image Generation

This paper proposes Interleaving Reasoning Generation, which enables unified multimodal models to generate images following a trajectory of "text thinking \(\rightarrow\) initial image \(\rightarrow\) text reflection \(\rightarrow\) improved image." By training this process with six decomposed learning tasks in the IRGL-300K dataset, the model outperforms BAGEL self-CoT and other unified models on multiple T2I benchmarks, particularly improving instruction following, world knowledge, and detail quality.

Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

By coupling the "sampling trajectory" and "log-likelihood (cumulative divergence)" into the same flow map for joint distillation, F2D2 reduces the NFE for both sampling and likelihood evaluation in flow matching models from thousands to just a few steps. This achieves few-step exact likelihood evaluation for CNF/diffusion-type models for the first time.

JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation

Proposes JointDiff, a joint continuous-discrete diffusion framework that unifies Gaussian diffusion (for trajectories) and multinomial diffusion (for possession events) for the first time. It introduces the CrossGuid module to support Weak Possession Guidance (WPG) and text-guided semantic controllable generation, achieving SOTA performance in multi-agent trajectory generation for sports.

LapFlow: Laplacian Multi-scale Flow Matching for Generative Modeling

LapFlow decomposes images into Laplacian pyramid residuals and utilizes a unified Mixture-of-Transformers (MoT) with causal attention to generate all scales in parallel. It eliminates the explicit renoising bridges required by cascaded methods, achieving superior FID on CelebA-HQ and ImageNet with lower GFLOPs and faster inference.

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

The authors propose rCM (score-regularized continuous-time consistency model), which extends continuous-time consistency distillation to 14B parameter text-to-image/video models for the first time. By combining forward divergence (consistency) and backward divergence (score distillation), the method matches the quality of DMD2 while preserving diversity, achieving 15-50× acceleration.

Latent Denoising Makes Good Tokenizers

This paper points out that modern generative models are essentially performing "reconstruction from destruction" (denoising). It proposes l-DeTok: during tokenizer training, interpolative noise and random masks are injected into the latent space, and the decoder is tasked with reconstructing the original image from the heavily corrupted latent. This ensures the produced latents are naturally aligned with downstream denoising objectives, consistently improving generation quality across six different generative models without requiring any semantic distillation.

Latent Diffusion Model without Variational Autoencoder

SVG is proposed to replace the VAE latent space with frozen DINOv3 self-supervised features for building diffusion models. By supplementing fine-grained details via a lightweight residual encoder, it achieves faster training, more efficient inference, and cross-task universal visual representations.

Latent Stochastic Interpolants

This paper proposes Latent Stochastic Interpolants (LSI), which utilizes a single ELBO objective derived from continuous time to bring the Stochastic Interpolants framework into an end-to-end jointly trained latent space. By optimizing the encoder, decoder, and the latent SI generative model together, LSI achieves FIDs comparable to pixel-space SI on ImageNet with significantly lower sampling FLOPs.

Latent Wavelet Diffusion for Ultra-High-Resolution Image Synthesis

LWD extracts spatial saliency from latent signals via wavelet energy maps and concentrates training loss on high-frequency regions using time-dependent binary masks. Combined with scale-consistent VAE fine-tuning, it enhances 2K–4K ultra-high-definition generation quality without architectural changes or additional inference overhead.

LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing

LaTo quantizes facial landmark coordinates directly into discrete tokens using VQ-VAE for injection into a DiT (rather than rendering them as images for a VAE). Combined with position-mapped embeddings and landmark-aware CFG, it achieves instruction-driven, fine-grained controllable face editing with strong identity preservation.

LayerSync: Self-aligning Intermediate Layers

LayerSync discovers that deep intermediate representations of diffusion Transformers can inherently serve as semantic teachers. Through parameter-free inter-layer cosine alignment, it encourages shallower layers to align with stronger representative layers, enhancing generation quality and accelerating training without relying on external models or extra data, and generalizes to image, audio, video, and human motion generation.

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

LazyDrag replaces the fragile implicit point-matching mechanism in previous drag-based editing methods—which relied on attention similarity—with an "explicit correspondence map" constructed directly from drag instructions. This allows MM-DiT to perform stable editing under full-strength inversion for the first time, completely eliminating the need for test-time optimization (TTO) while unlocking high-fidelity completion and text-guided generation.

Learn to Guide Your Diffusion Model

This paper learns the manually set fixed guidance scale in Classifier-Free Guidance (CFG) as a function of the condition and the denoising time interval. The function is trained using self-consistency distribution matching. It achieves a better trade-off between sample quality, distribution matching, and prompt alignment in ImageNet, CelebA, and text-to-image generation compared to fixed CFG or limited interval guidance.

Learning a Distance Measure from the Information-Estimation Geometry of Data

The paper proposes the Information-Estimation Metric (IEM), a novel distance function induced by the geometry of data probability density. By comparing score vector fields across various noise levels to measure signal distance, the unsupervised IEM achieves performance comparable to supervised methods in predicting human perceptual judgments.

Learning an Image Editing Model without Image Editing Pairs

Ours proposes NP-Edit (No-Pair Edit), a training paradigm that requires no "before-after" image pairs. It unrolls a few-step diffusion generator during training, utilizes differentiable gradient feedback from a Vision-Language Model (VLM) to judge instruction following and content preservation, and employs Distribution Matching Distillation (DMD) to pull outputs back to the real image manifold. Under a 4-step sampling setting, it matches models trained on large-scale paired data and outperforms RL-based methods like Flow-GRPO that also use VLM rewards.

Learning AND-OR Templates for Compositional Representation in Art and Design

This paper extends the AND-OR Template from object recognition to scene composition in art and design. By employing a maximum entropy log-linear model to provide decomposable consistency scores and utilizing EM-style block-pursuit with semi-supervised structural expansion, the authors learn interpretable templates. These templates demonstrate lightweight, interpretable, and data-efficient structural priors in aesthetic classification, human preference alignment, photography guidance, and AIGC compositional constraints.

Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

HyCa conceptualizes the evolution of latent features in Diffusion Transformers as a hybrid system where "different dimensions follow different ODEs." By offline selecting the most appropriate numerical solver for each cluster of dimensions to predict or reuse features, it achieves 5.5× to 6.2× near-lossless training-free acceleration on FLUX, HunyuanVideo, and Qwen-Image.

LLM2Fx-Tools: Tool Calling for Music Post-Production

LLM2Fx-Tools is the first framework to apply LLM tool calling to audio effect modules. It understands audio input through a multimodal LLM, utilizes CoT reasoning to select effect types, determines their order, and estimates parameters, achieving interpretable and controllable music post-production.

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

The authors propose Locality-aware Parallel Decoding (LPD), which reduces the generation steps for \(256 \times 256\) images from 256 to 20 by flexibly parallelizing the autoregressive modeling architecture and employing locality-aware generation order scheduling, achieving at least a \(3.4\times\) reduction in latency.

Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

HiRM proposes a concept erasure strategy that "decouples update location from erasure target"—it updates only the weights of the first layer of the CLIP text encoder while applying erasure supervision on the high-level semantic representations of the last layer. By misdirecting target concept representations toward random (HiRM-R) or semantic (HiRM-S) directions, it achieves efficient erasure of styles, objects, and nudity on UnlearnCanvas and NSFW benchmarks, with zero-shot transferability to the Flux architecture.

LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

This work proposes the concept of "visual logic" and introduces LogiStory, a framework combining multi-agent planning with causal verification. It transforms multi-image story visualization from generating "beautiful isolated pictures" into a reasoning problem that "explicitly models causal coherence between characters, actions, and scenes," accompanied by the LogicTale benchmark with causal annotations.

Long-Text-to-Image Generation via Compositional Prompt Decomposition

PRISM "refracts" a lengthy descriptive prompt into several semantic components within the text representation space. It allows frozen pre-trained T2I models to independently denoise each component and adopts the concept conjunction of energy-based models to sum the noise predictions into a single-step compositional denoising. This enables T2I models to render long paragraphs exceeding 500+ tokens without fine-tuning the backbone or losing details.

Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall

This paper identifies the "sampling wall" problem in discrete diffusion models (where categorical distribution information collapses into one-hot vectors after sampling) and proposes the Loopholing mechanism. By introducing a deterministic latent path to propagate rich distribution information, it reduces generation perplexity by up to 61%, significantly narrowing the gap with autoregressive models.

LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

Proposes LVTINO, the first zero-shot video inverse problem solver based on Video Consistency Model (VCM) priors. By injecting auto-differentiation-free measurement consistency constraints into the VCM sampling process, it achieves superior perceptual quality and temporal consistency over frame-wise image methods across various video inverse problems (e.g., super-resolution, deblurring, inpainting) with minimal Neural Function Evaluations (NFE).

Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

To be added after further reading.

Market Games for Generative Models: Equilibria, Welfare, and Strategic Entry

This paper formalizes a three-layer model-platform-user market game to analyze the existence conditions of pure strategy Nash equilibrium, market structure, and social welfare impacts under generative model competition, while designing optimal entry strategies for model providers.

Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

This paper systematically reveals that "Massive Activations (MA)" in Diffusion Transformers (DiTs) are specifically responsible for local detail synthesis while having almost no effect on global semantics. Based on this finding, it proposes a training-free self-guidance strategy called Detail Guidance (DG), which utilizes a "degraded model after MA disruption" to reversely guide the original model toward generating more refined details.

MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

MeanCache shifts the feature caching of diffusion/Flow Matching from an "instantaneous velocity" perspective to an "average interval velocity" perspective. It reconstructs smoother average velocities from instantaneous ones using cached Jacobian-Vector Products (JVP) and determines when to cache and how long to reuse via a budget-constrained "peak-suppression shortest path" scheduler. It achieves speedups of 4.12× on FLUX.1, 4.56× on Qwen-Image, and 3.59× on HunyuanVideo, with image quality superior to existing caching methods.

Measurement Score-based Diffusion Model (MSM)

Instead of attempting to learn the "score of clean images," this method directly learns a "local measurement score" for subsampled, noisy data in the measurement domain. By aggregating these via random masks, the full measurement is reconstructed, allowing diffusion models to be trained entirely on degraded observations for both unconditional generation and linear inverse problems.

MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

MILR migrates "reasoning-enhanced image generation" into a unified latent vector space shared by text and images. At test-time, it utilizes policy gradient (REINFORCE) in conjunction with image quality critics to jointly optimize intermediate representations of text/image tokens. Without modifying any model parameters, the method achieves SOTA performance across GenEval/T2I-CompBench/WISE, notably improving the base model by 80% on the knowledge-intensive WISE benchmark.

Mirror Flow Matching with Heavy-Tailed Priors for Generative Modeling on Convex Domains

Addressing constrained generative modeling on convex domains, this paper identifies two major issues: "log-barrier mirror maps induce heavy-tailed dual distributions" and "mismatch between Gaussian priors and heavy-tailed targets." It proposes Regularized Mirror Flow Matching with a Student-t prior, which ensures finite moments for the dual distribution and provides the first theoretical guarantee of polynomial tail bounds for space-time Lipschitz velocity fields and Wasserstein convergence rates.

Mitigating Noise Shift in Denoising Generative Models with Noise Awareness Guidance

The authors observe that the noise levels encoded in the intermediate states of diffusion/flow models systematically bias toward "larger" values during sampling (termed noise shift). They propose Noise Awareness Guidance (NAG)—a classifier-free guidance applied along the "noise condition" axis rather than the "class condition" axis. This pulls deviated trajectories back to the intended noise schedule, significantly enhancing generation quality.

Mitigating Semantic Collapse in Generative Personalization with Test-Time Embedding Adjustment

This paper identifies and characterizes the "Semantic Collapsing Problem" (SCP) in generative personalization—where the learned personalized token \(V^*\) expands in magnitude and shifts in direction within the embedding space, eventually overpowering all context in complex prompts. The authors propose a training-free Test-time Embedding Adjustment (TEA) to pull the magnitude and direction of \(V^*\) back toward the original semantic concept \(c\), significantly improving text-image alignment.

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Addressing the issue where serial thinking-aware paradigms ("reasoning before drawing") degrade image quality due to reasoning error propagation, this paper proposes MMaDA-Parallel, a pure discrete diffusion parallel multimodal framework. It allows text and images to interact bidirectionally and generate synchronously across the entire denoising trajectory. By employing Parallel RL (ParaRL) to provide semantic rewards along the trajectory, cross-modal consistency is reinforced, improving Output Alignment by 6.9% over the SOTA open-source model Bagel on the self-constructed ParaBench.

Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

Mod-Adapter is proposed, a tuning-free multi-concept personalization method that predicts concept-specific modulation directions within the DiT modulation space. It achieves decoupled customized generation of objects and abstract concepts (pose, lighting, material, etc.), significantly outperforming existing methods in multi-concept personalization.

MOLM: Mixture of LoRA Markers

The MOLM watermarking framework is proposed, reinterpreting LoRA adapters as watermark markers. It embeds verifiable and robust watermarks into frozen generative models through a binary key-driven routing mechanism, eliminating the need for per-key retraining.

Monocular Normal Estimation via Shading Sequence Estimation

This paper proposes RoSE, a method that reformulates monocular normal estimation as a shading sequence estimation problem. It leverages image-to-video (I2V) generation models to predict shading sequences under multiple illuminations and then converts these sequences into normal maps via a simple least squares method, achieving SOTA performance on real-world benchmarks.

MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement

MOSAIC reformulates multi-subject personalized generation as a "representation optimization" problem. By utilizing the SemAlign-MS dataset with dense semantic correspondence annotations, it employs an "Alignment Loss" to force point-to-point alignment between reference and target attention, and a "Disentanglement Loss" to push different subjects into orthogonal attention subspaces. This maintains high fidelity even with more than 4 reference subjects, avoiding the identity blending and attribute leakage collapse observed in existing methods beyond 3 subjects.

Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening

Ours proposes Motion Prior Distillation (MPD), an inference-time distillation method that distills motion residuals from the forward path into the backward path. This fundamentally resolves the motion prior conflict in time reversal sampling, enabling more coherent generative frame interpolation without additional training.

Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts

This paper proposes "Mixture of Subspaces with Mixture of Low-rank Gaussians" (MoLR-MoG) modeling, characterizing real image data as a union of multiple low-dimensional linear subspaces, with a Gaussian mixture residing within each subspace. This induces a nonlinear score function with an inherent MoE structure, theoretically reducing the estimation error to \(\sqrt{\sum_k n_k}\sqrt{\sum_k n_k d_k}/\sqrt{n}\) (escaping the curse of dimensionality) and proving local strong convexity for convergence guarantees. Empirically, it generates clear images using a network with 10× fewer parameters than a U-Net.

Multiplicative Diffusion Models: Beyond Gaussian Latents

This paper proposes Multiplicative Score-based Generative Models (MSGM), which replace the classical additive Gaussian noise in diffusion models with skew-symmetric multiplicative noise. This ensures the forward process converges to a non-Gaussian latent distribution that naturally aligns with the data while keeping the data norm distribution invariant, enabling more accurate generation of rare extreme events in heavy-tailed and anisotropic data.

MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

Proposes MVAR (Markovian Visual AutoRegressive), which reduces the attention calculation complexity of VAR models from \(\mathcal{O}(N^2)\) to \(\mathcal{O}(Nk)\) by introducing the scale Markov assumption (depending only on adjacent scales rather than all preceding scales) and spatial Markov attention (restricting the neighborhood size \(k\)). It achieves comparable or superior performance on ImageNet 256×256, reduces inference VRAM by 3.0-4.2×, and can be trained using only 8 RTX 4090 GPUs.

MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

Ours proposes a new task called multi-view customization and designs the MVCustom framework. By utilizing a video diffusion backbone combined with dense spatio-temporal attention to achieve overall frame consistency, and introducing two inference-stage techniques—depth-aware feature rendering and consistency-aware latent completion—it is the first to simultaneously achieve camera pose control, subject identity preservation, and cross-view geometric consistency.

Neon: Negative Extrapolation From Self-Training Improves Image Generation

Neon is proposed as a post-processing method requiring <1% additional training computation. It involves fine-tuning the model on its own synthetic data to induce degradation, followed by negative extrapolation away from the degraded weights. The study proves that mode-seeking samplers cause anti-alignment between synthetic and real data gradients, making negative extrapolation equivalent to optimizing toward the real data distribution. It improves xAR-L to a SOTA FID of 1.02 on ImageNet 256×256.

NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

Ours proposes NeuralOS, a dual-component architecture utilizing an RNN for state tracking and a diffusion renderer to predict operating system GUI frame sequences directly from user input events (mouse movement/clicks/keyboard), achieving the first end-to-end simulation of an operating system via neural generative models.

Next Visual Granularity Generation

Ours proposes the Next Visual Granularity (NVG) generation framework, which decomposes images into structured sequences at different granularity levels. By generating from global layout to fine details step-by-step, it achieves consistent FID improvements over the VAR series.

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

A 14B causal Transformer is utilized to perform next-token prediction directly on continuous image tokens, paired with a lightweight 157M flow matching head as a sampler. Without relying on heavy diffusion backbones or vector quantization, this approach achieves pure autoregressive text-to-image quality comparable to top-tier diffusion models.

OmniPortrait: Fine-Grained Personalized Portrait Synthesis via Pivotal Optimization

OmniPortrait decomposes "identity customization" into two coarse-to-fine steps: first, a frozen denoiser and an encoder-only Pivot ID Encoder provide a coarse-grained identity "pivot"; then, during inference, a training-free RB-Guidance performs reference image matching and gradient optimization on intermediate diffusion features. This captures fine-grained details of the reference face without compromising text editability, achieving new SOTA in both identity similarity (SIM) and text alignment (CLIP-T).

OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

OmniText is a training-free generalist framework that requires no parameter updates. By manipulating the self-attention and cross-attention of the off-the-shelf text diffusion model TextDiff-2, it unifies "text removal + content control + style control." It covers six types of text-image manipulation (TIM) tasks: removal, editing, insertion, rescaling, repositioning, and style transfer. OmniText outperforms similar text synthesis methods and approaches the performance of task-specific models across multiple metrics.

On the Design of One-Step Diffusion via Shortcutting Flow Paths

This paper unifies various "train-from-scratch one-step diffusion (shortcut models)" into a design framework of "approximating a two-step flow map target with a one-step prediction." This allows for the decoupling of entangled components (flow paths, time samplers, network parameterization, loss metrics) for comparative experiments. Based on this, improvements such as plug-in velocity and progressive time samplers are proposed, achieving a new SOTA FID50k of 2.85 for 1-NFE generation on ImageNet-256×256 (2.53 with 2× training steps) without requiring pre-training, distillation, or curriculum learning.

One Step Further with Monte-Carlo Sampler to Guide Diffusion Better

To address the systematic gradient bias in training-free guidance (DPS family) caused by approximating the conditional expectation \(\mathbb{E}_{x_0|x_t}[f(x_0)]\) with a single point \(\hat{x}_0(x_t)\), this paper proposes ABMS. By taking an additional backward denoising step and performing Monte Carlo sampling on the intermediate state before averaging, ABMS obtains more accurate guidance gradients. It is plug-and-play, combined with hypersphere-constrained step size control and "dual-focus" evaluation, achieving consistent improvements in generation quality across tasks such as handwriting trajectory, image inverse problems, molecular inverse design, and text style transfer.

Overshoot and Shrinkage in Classifier-Free Guidance: From Theory to Practice

This paper reanalyzes Classifier-Free Guidance (CFG) using the "dynamical phase transition" framework from statistical physics. It proves that in sufficiently high dimensions, CFG can precisely recover the target distribution (the "blessing of dimensionality"), while accurately characterizing mean overshoot and variance shrinkage observed in lower dimensions. Consequently, the authors propose Power-Law CFG, which nonlinearly amplifies score differences. This approach theoretically alleviates both artifacts and consistently improves image quality and diversity across SOTA models like DiT, EDM2, and Text-to-Image models.

PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models

PairFlow utilizes a closed-form discrete flow velocity field (determined by Hamming distance) to invert source samples from data. With preprocessing costs less than 1.7% of training time, it enables discrete flow models to achieve few-step generation performance that matches or exceeds distillation methods requiring pretrained teachers and fine-tuning.

Pareto-Conditioned Diffusion Models for Offline Multi-Objective Optimization

Proposes Pareto-Conditioned Diffusion (PCD), which reformulates offline multi-objective optimization as a conditional sampling problem. It directly generates high-quality solutions conditioned on objective tradeoffs without explicit surrogate models, achieving the best consistency across various benchmarks.

Pareto Variational Autoencoder

To address the issues of Gaussian VAEs underestimating tail probabilities and over-regularizing the latent space, this paper proposes a multivariate heavy-tailed distribution based on the \(\ell_1\)-norm—the symmetric Pareto (symPareto). By substituting the KL divergence with the γ-power divergence from information geometry, the authors construct ParetoVAE with a closed-form loss. It significantly outperforms VAEs based on Gaussian, Laplace, or Student's t distributions in heavy-tailed tasks such as graph degree reconstruction, word frequency analysis, and image denoising.

Partition Generative Modeling: Masked Modeling Without Masks

Ours proposes "Partition Generative Modeling" (PGM), which replaces the [MASK] mechanism of Masked Generative Models (MGM) by "partitioning a sequence into two mutually invisible groups that predict each other." This allows the model to process only "clean tokens" during sampling (saving computation like an autoregressive model) while retaining parallel, any-order generation (flexible like MGM). PGM is 5–5.5× faster than MDLM on OpenWebText with lower perplexity and approaches MaskGIT's FID on ImageNet with 7.5× the throughput.

PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models

Proposes PCPO, which fixes the disproportionate credit assignment inherent in the policy gradients of diffusion/flow models through stable objective reconstruction and principled timestep re-weighting, significantly accelerating convergence and mitigating model collapse.

PCPO: Proportionate Credit Policy Optimization for Preference Alignment of Image Generation Models

This paper identifies that applying policy gradients (PPO/GRPO) to diffusion/flow model alignment results in the sampler's mathematical structure assigning severely disproportionate credit weights \(w(t)\) across denoising timesteps, which is the root cause of training instability and model collapse. PCPO rectifies this through a "numerically stable log-hinge objective reconstruction + principled reweighting to uniformize timestep weights," significantly accelerating convergence, mitigating collapse, and outperforming SOTA baselines like DanceGRPO.

Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method

To be supplemented after further reading.

PI-Light: Physics-Inspired Diffusion for Full-Image Relighting

Proposes π-Light (PI-Light), a two-stage full-image relighting framework: the first stage performs intrinsic property decomposition (albedo, normal, roughness, etc.) via a physics-guided diffusion model, and the second stage achieves re-rendering under target lighting conditions through a physics-guided neural rendering module. It introduces batch-aware attention mechanisms and physics-inspired losses to achieve superior generalization to real-world scenes.

PICABench: How Far are We from Physically Realistic Image Editing?

This paper points out that current instruction-based image editing models prioritize "semantic correctness" while neglecting "physical realism" (e.g., removing an object without removing its shadows and reflections). The authors construct PICABench, a benchmark covering three major dimensions—Optics, Mechanics, and State Transition—across eight sub-dimensions. They introduce PICAEval, a region-level QA evaluation protocol, and automatically generate the PICA-100K training set by using "Text-to-Image for scene rendering + Image-to-Video for simulating physical changes," significantly improving the physical consistency of existing editing models through fine-tuning.

PixNerd: Pixel Neural Field Diffusion

PixNerd replaces the final linear projection of the Diffusion Transformer with a "per-patch implicit neural field head" that dynamically generates weights from Transformer features. This head decodes fine-grained pixels within large patches, enabling single-stage, end-to-end diffusion in the original pixel space without relying on VAEs or cascaded multi-scale architectures. It achieves a 1.93 FID on ImageNet \(256 \times 256\) with nearly 8x lower latency than previous pixel-space diffusion models.

PolyGraph Discrepancy: a classifier-based metric for graph generation

Ours proposes PolyGraph Discrepancy (PGD), which approximates the variational lower bound of the Jensen-Shannon distance by training a classifier to distinguish between real and generated graphs. This approach solves three core problems: the lack of an absolute scale in MMD metrics, incomparability between different descriptors, and high bias/variance in small sample sizes.

PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework

PosterCraft abandons the modular "VLM layout planning + separate background generation" paradigm. Instead, it employs a standard diffusion backbone (Flux-dev) through a four-stage cascaded training pipeline (text rendering optimization → high-quality poster fine-tuning → aesthetic-text reinforcement learning → visual-language feedback refinement). With specialized, automatically constructed datasets for each stage, it achieves end-to-end generation of posters with accurate text, coordinated layouts, and overall aesthetic appeal, approaching commercial closed-source systems in text metrics.

PQGAN: Product-Quantised Image Representation for High-Quality Image Synthesis

PQGAN integrates classic Product Quantisation (PQ) into the quantization module of VQGAN, partitioning each latent vector into \(S\) subspaces for individual quantization. This constructs an exponentially large "virtual codebook" via combinations of small sub-codebooks. It improves ImageNet reconstruction PSNR from 27 dB to 37.4 dB and reduces FID to 0.036, outperforming even continuous VAEs. Furthermore, it can be directly integrated into pre-trained diffusion models to double resolution or achieve several-fold speedups.

Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift

To address the overfitting problem in personalized text-to-image fine-tuning—where the model replicates reference images and ignores prompts—this paper proves that existing objective functions inherently fail to preserve the pre-trained distribution. It proposes a regularization term based on Lipschitz continuity, essentially an L2 constraint on parameter offsets, which preserves the original generative capacity while reducing training time by more than half.

Product of Experts for Visual Generation

This paper unifies controllable image/video generation as a "sampling problem from a product distribution of multiple heterogeneous expert models"—treating generative models as priors, discriminative models (VLMs) as soft constraints, and physics simulators as hard constraints. By utilizing "Annealed MCMC + SMC Resampling" during inference without retraining, the approach achieves superior controllability and fidelity compared to single large-scale models.

Projected Coupled Diffusion for Test-Time Constrained Joint Generation

Background: Diffusion models have become universal modeling tools for generation tasks involving images, videos, language, graphs, and robot trajectories. Many practical systems require more than just "unconditional sample generation" but involve incorporating additional objectives during inference—such as classifier guidance, inpainting, reward guidance, or projected diffusion—to guide existing models toward specific conditions or constraints without retraining.

ProReGen: Progressive Residual Generation under Attribute Correlations

ProReGen reformulates correlated attribute conditions \(x_1, x_2\) into orthogonal components \(x_1, \gamma\). It first trains a backbone generator using abundant majority samples, then learns residual generation layers using sparse minority samples, thereby improving the generation accuracy of conditional VAEs, GANs, and Diffusion Models on rare attribute combinations.

Purrception: Variational Flow Matching for Vector-Quantized Image Generation

Ours proposes Purrception, an image generation method that adapts Variational Flow Matching (VFM) to the Vector-Quantized (VQ) latent space. By learning a categorical posterior distribution over codebook indices while calculating the velocity field in continuous embedding space, it bridges continuous transport dynamics and discrete supervision, achieving faster training convergence and comparable FID scores to SOTA on ImageNet-1k 256×256.

Pyramidal Patchification Flow for Visual Generation

The Diffusion Transformer is enabled to use larger patches (fewer tokens) at high-noise timesteps and smaller patches (more tokens) at low-noise timesteps. By sharing a single DiT backbone and learning individual linear projections for different patch sizes, denoising inference is accelerated by approximately 1.6× to 2.0× with almost no loss in image quality.

Quantization-Aware Diffusion Models for Maximum Likelihood Training

Addressing the fundamental contradiction where digital images are discrete quantized values but diffusion models treat them as continuous signals, this paper introduces a "soft rounding + super-exponentially decaying residual" parameterization for the signal predictor. This ensures the reverse SDE converges to quantized points at \(t\to0\), pushing density estimation to the limit—reducing CIFAR-10 NLL from the previous SOTA of 2.42 bpd to 0.27 bpd.

Quasi-Monte Carlo Methods Enable Extremely Low-Dimensional Deep Generative Models

This paper proposes QLVM (Quasi-Monte Carlo Latent Variable Model): by discarding the VAE encoder and the variational lower bound, it directly approximates the marginal likelihood using randomized Quasi-Monte Carlo (QMC) lattice integration to train a decoder. This enables training deep generative models in extremely low-dimensional latent spaces (1/2/3D) that outperform VAEs/IWAEs of the same dimensionality and are inherently visualizable.

QVGen: Pushing the Limit of Quantized Video Generative Models

Propose QVGen, a Quantization-Aware Training (QAT) framework for video diffusion models, which introduces an auxiliary module to reduce gradient norms for improved convergence and designs a rank decay strategy to gradually eliminate the inference overhead of the auxiliary module during training, achieving near full-precision video generation quality under 4-bit quantization for the first time.

reAR: Rethinking Visual Autoregressive Models via Token-wise Consistency Regularization

reAR argues that the core bottleneck of visual autoregressive generation is not the single-token prediction accuracy itself, but the inconsistency between the discrete token sequence produced by the generator and the tokenizer decoder. By using noisy context regularization and codebook embedding regularization to constrain the hidden representations of each token during training, reAR significantly improves ImageNet image generation quality without altering the tokenizer, generation order, or inference process.

Reconciling Visual Perception and Generation in Diffusion Models

GenRep performs discriminative perception and generative modeling simultaneously within a single diffusion model. It uses Monte Carlo methods to distill distribution knowledge from the diffusion model to perception tasks and conversely utilizes high-level semantics learned by perception to guide the generative denoising process. By employing gradient alignment to coordinate the two objectives, GenRep achieves leading performance across both perception and generation benchmarks.

ReDDiT: Rehashing Noise for Discrete Visual Generation

ReDDiT extends the single [mask] absorbing state in discrete diffusion to a set of random multi-index absorbing states (rehashing noise). It employs a rehash sampler utilizing torch.multinomial for low-discrepancy sampling, replacing the Gumbel-max-based remask heuristics in MVTM. This approach reduces the gFID on ImageNet-256 from a baseline of 6.18 to 1.61, marking the first time discrete diffusion matches continuous diffusion in generation quality.

RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

RefAny3D is proposed as a 3D asset-referenced image generation framework. By employing a dual-branch generation strategy that jointly models RGB images and point maps, it achieves precise geometric and textural consistency between the generated images and the 3D reference assets.

Referring Layer Decomposition

The authors propose the Referring Layer Decomposition (RLD) task to predict complete RGBA layers from a single RGB image based on flexible user prompts (spatial, textual, or hybrid). They also construct the RefLade dataset with 1.11 million samples and an automated evaluation protocol.

ReFocusEraser: Refocusing for Small Object Removal with Robust Context-Shadow Repair

Addressing the issue of detail loss when diffusion models remove small objects, ReFocusEraser utilizes "Camera-adaptive magnification + LoRA fine-tuning" to enlarge and repair small targets first, followed by "Mask-based stitching + Seam-Shadow Aware Decoder" to seamlessly re-insert them into the original image while automatically removing residual shadows. This elevates the PSNR from 25.0 to 31.3 on the RORD dataset.

RegionE: Adaptive Region-Aware Generation for Efficient Image Editing

RegionE observes that in instruction-based image editing, the generation trajectories of unedited regions are approximately linear, while those of edited regions are more curved but exhibit similar velocities between adjacent steps. By employing adaptive region partitioning, region-level KV injection, and velocity decay caching, it accelerates Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit by approximately 2.06-2.57x without training new models, while largely preserving the output quality of the original models.

Reinforcing Diffusion Models by Direct Group Preference Optimization

This paper proposes DGPO (Direct Group Preference Optimization), which decouples the "intra-group relative preference" concept of GRPO from the policy-gradient framework. This allows diffusion models to perform online RL post-training using efficient deterministic ODE samplers, improving SD3.5-M from 0.63 to 0.97 on GenEval while training approximately 20× faster than Flow-GRPO (nearly 30× on GenEval).

RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

RePrompt trains a small language model (Qwen2.5-3B) using reinforcement learning to perform explicit chain-of-thought reasoning before generating structured enhanced prompts. By directly optimizing downstream generation results with a "level-image" integrated reward, it achieves new SOTA performance in compositional abilities (spatial positioning, counting) on GenEval and T2I-Compbench, with inference latency significantly lower than iterative optimization methods.

Rethinking Global Text Conditioning in Diffusion Transformers

This paper systematically analyzes the "global conditioning path via modulation of pooled text embeddings" in Diffusion Transformers. It finds that while nearly ineffective in conventional usage, repurposing this path from a "condition" to a "guidance direction" significantly improves image/video quality and controllability for text-to-image/video and image editing in a training-free, near-zero-overhead manner.

RIDER: 3D RNA Inverse Design with Reinforcement Learning-Guided Diffusion

The RIDER framework is proposed, marking the first introduction of reinforcement learning into RNA 3D inverse design. It first pre-trains a conditional diffusion model, RIDE, to learn sequence-structure relationships, and then fine-tunes it using RL to directly optimize 3D structural similarity rather than Native Sequence Recovery (NSR), achieving over 100% improvement across all 3D self-consistency metrics.

RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation

RMFlow is proposed to compensate for 1-NFE transport errors in MeanFlow by incorporating a noise-injection refinement step. By adding a maximum likelihood objective during training to minimize the KL divergence between the learned and target distributions, it achieves near-SOTA 1-NFE results across T2I, molecule generation, and time-series generation.

RNE: plug-and-play diffusion inference-time control and energy-based training

The Radon-Nikodym Estimator (RNE) is proposed. Based on the density ratio between path distributions, it reveals the fundamental connection between marginal densities and transition kernels, providing a unified plug-and-play framework for diffusion density estimation, inference-time control, and energy-based training.

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

ProMoE is proposed as a Mixture-of-Experts framework for Diffusion Transformers. By employing a two-step router (conditional routing + prototype routing) and a routing contrastive loss, it providing explicit semantic guidance to promote expert specialization. It significantly outperforms existing MoE and dense models on ImageNet.

SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions

SafeFlowMatcher is proposed as a safe planning framework that combines flow matching with Control Barrier Functions (CBF). It decouples path generation from safety certification via a Predictor-Corrector (PC) integrator, maintaining the efficiency of flow matching while providing formal safety guarantees.

Safety-Guided Flow (SGF): A Unified Framework for Negative Guidance in Safe Generation

This paper unifies two "negative guidance" safe generation methods (Shielded Diffusion and Safe Denoiser) under an energy framework based on the Maximum Mean Discrepancy (MMD) potential function. Leveraging Control Barrier Function (CBF) theory, it mathematically proves that applying negative guidance only within an early "critical time window" of denoising and decaying it to zero thereafter effectively ensures safety while maintaining image quality.

SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

SAIL enables a diffusion model to act as its own "teacher": starting from a minimal seed of human-annotated preference data, the model generates its own samples, ranks them using an implicit reward derived from the diffusion loss, and fine-tunes itself in a closed loop. Using only approximately 6% of the preference data, it outperforms DiffusionDPO on HPSv2, Pick-a-Pic, and PartiPrompts.

Sample-Efficient Evidence Estimation of Score-Based Priors for Model Selection

Proposes DiME, a model evidence estimator integrated along the time marginals of the diffusion posterior. It requires no prior scores or density evaluations and accurately estimates model evidence under diffusion priors using only a small number of posterior samples (e.g., 20) for prior selection and model validation.

Sample Reward Soups: Query-efficient Multi-Reward Guidance for Text-to-Image Diffusion Models

Without training the diffusion model, this paper replaces "querying black-box rewards for every weight combination" with "interpolated search gradients." This allows text-to-image models to align with multiple black-box rewards simultaneously during inference, significantly reducing reward queries in the early denoising stages (up to 2.7×) while avoiding the reward over-optimization common in fine-tuning methods.

Scalable Energy-Based Models via Adversarial Training: Unifying Discrimination and Generation

This paper proposes Dual Adversarial Training (DAT), which replaces the unstable SGLD sampling in JEM with adversarial training (PGD for contrastive samples + BCE loss) to learn the energy function. Combined with adversarial training for the discriminative branch and a two-stage training strategy, it scales energy-based discriminative-generative hybrid models to ImageNet 256×256 for the first time, achieving SOTA-level robust classification and generation quality (FID 3.29, comparable to the autoregressive VAR-d16 and surpassing ADM-G/LDM-4-G).

Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

This paper addresses the long-standing issues of training instability and low codebook utilization in Vector Quantized (VQ) tokenizers. It proposes VQBridge (a compress–process–recover projector), which is used only during training and discarded at inference. Combined with learning rate annealing, this approach achieves 100% codebook utilization across various configurations from 16k to 262k entries, reaching an rFID of 0.88. When integrated with LlamaGen for image generation, it outperforms VAR and DiT in terms of FID.

Scale-wise Distillation of Diffusion Models

SwD proposes a "distillation by scale" framework that transforms any pretrained diffusion model into a few-step generator. It progressively increases the resolution at each sampling step—running initial steps at low resolution and only reaching full resolution at the end. This reduces single-step computation by half without increasing total steps. Additionally, a patch-level distillation loss based on MMD is introduced, which independently approaches SOTA performance, accelerating text-to-image by ~2× and text-to-video by ~3× without degrading image quality.

Scaling Group Inference for Diverse and High-Quality Generation

Addressing the pain point where "users view a set of images (4-8) but i.i.d. sampling produces highly redundant results," this paper reformulates "generating a set of images for a prompt" as a Quadratic Integer Programming (QIP) selection problem. It selects a subset from a large candidate pool to simultaneously maximize individual quality (unary term) and intra-group diversity (binary term). By observing that "intermediate predictions serve as reliable previews of final images," the authors introduce Progressive Pruning, reducing complexity from \(O(MT)\) to \(O(M+KT)\). This approach consistently outperforms baselines like CFG, Interval Guidance, and Particle Guidance on the quality-diversity Pareto frontier.

Scaling Laws for Diffusion Transformers

This paper systematically trains Diffusion Transformers (DiT) within a compute budget range of 1e17 to 6e18 FLOPs, fitting the first explicit scaling laws for DiT—where pre-training loss follows a power law relationship with compute. This enables precise prediction of optimal model size, data volume, and final generation quality (FID) for a given compute budget, and demonstrates that these power laws can extrapolate to 1.5e21 FLOPs and transfer across datasets.

Score Distillation Beyond Acceleration: Generative Modeling from Corrupted Data

This paper proposes Restoration Score Distillation (RSD), which distills a diffusion teacher trained only on corrupted observations into a one-step generator. It discovers that in corrupted data scenarios, distillation not only accelerates sampling but also significantly shifts the generated distribution closer to the clean image distribution.

SDErasure: Concept-Specific Trajectory Shifting for Concept Erasure via Adaptive Diffusion Classifier

SDErasure identifies that "the generation of each concept depends only on a small segment of critical denoising timesteps." It utilizes a Diffusion Classifier to adaptively select these critical steps for each concept to be erased. By performing trajectory shifting fine-tuning only on these steps and incorporating dual-path quality regulation losses, the method achieves thorough concept erasure while reducing MSCOCO FID from 9.51 to 6.74.

Secure Inference for Diffusion Models via Unconditional Scores

To address the slow inference of diffusion models under Secure Multi-Party Computation (MPC), this paper employs aggressive low-degree polynomial approximations to accelerate non-linear operators. It then utilizes "unconditional scores," computed without error in plaintext, to correct the conditional scores polluted by approximation errors, significantly recovering image quality with almost no additional overhead.

Seek-CAD: A Self-Refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek

Ours proposes Seek-CAD, the first training-free CAD parametric model generation framework based on locally deployed reasoning LLMs (DeepSeek-R1). It achieves self-refinement through step-by-step visual feedback and Chain-of-Thought (CoT) synergy, and designs a new SSR triplet design paradigm to support complex CAD model generation.

Self-Improving Loops for Visual Robotic Planning

The SILVR framework is proposed, which achieves continuous self-improvement of visual robotic planners on unseen tasks by iteratively fine-tuning an in-domain video generation model on self-collected online trajectories, reaching up to 285% performance gains in MetaWorld and on real robots.

SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation

Proposes SenseFlow, which scales Distribution Matching Distillation (DMD) to large-scale flow-based text-to-image models (SD 3.5 Large 8B / FLUX.1 dev 12B) via Implicit Distribution Alignment (IDA) and Intra-Segment Guidance (ISG), achieving high-quality 4-step image generation.

SERUM: Simple, Efficient, Robust, and Unifying Marking for Diffusion-based Image Generation

The SERUM watermarking method is proposed, adding unique watermark noise to the initial noise of diffusion models and training a lightweight detector to identify watermarks directly from generated images (bypassing expensive DDIM inversion). It achieves the highest detection rates under various attacks with extremely fast injection and detection, supporting multi-user scenarios.

SESaMo: Symmetry-Enforcing Stochastic Modulation for Normalizing Flows

SESaMo proposes a "stochastic modulation" mechanism, allowing Normalizing Flows to first map the prior distribution into a single mode of the target distribution, then use a stochastic variable-controlled symmetry transformation to spread probability mass across all equivalent modes based on learned weights. This enables precise enforcement of symmetries in data-free variational inference and, for the first time, learns "broken symmetry," achieving an Effective Sample Size (ESS) close to 1 on the 8-Gaussian mixture, complex \(\phi^4\) theory, and Hubbard model.

Shortcut Diffusion Training with Cumulative Consistency Loss: An Optimal Control View

This paper interprets the few-step generation training of shortcut diffusion as a controlled flow-matching process. It points out that the original self-consistency loss only penalizes the current step error and proposes the Cumulative Self-Consistency Loss, which accumulates future misalignments along the trajectory. This significantly improves image generation quality for one to four steps with almost the same training budget.

SIGMA-GEN: Structure and Identity Guided Multi-Subject Assembly for Image Generation

SIGMA-GEN unifies "what each subject looks like (identity)" and "where each subject is placed, its orientation, and occlusion (structure)" into two control maps. This enables a Diffusion Transformer to incorporate up to 10 identity-preserving subjects in a single forward pass. The authors curate a synthetic dataset, SIGMA-SET27K, with identity/mask/depth/2D/3D box annotations. SIGMA-GEN outperforms iterative baselines that insert subjects one-by-one in terms of identity fidelity, image quality, and inference speed.

SketchEvo: Enhancing Sketch-Guided Image Generation with Dynamic Drawing Processes

SketchEvo treats the dynamic sequence of drawing—from the first stroke to completion—as a source of diversity for preference optimization. During training, sketches with different completion levels are used as conditions to construct significantly distinct positive and negative pairs for aligning with human aesthetics. During inference, an initial sketch stroke-guided rollback mechanism is employed to strengthen semantic gain, thereby significantly improving the aesthetic quality of generated images while maintaining sketch fidelity.

SketchingReality: From Hand-Drawn Scene Sketches to Photo-Realistic Images

This paper proposes SketchingReality, a "semantic modulation + attention supervision" scheme that transforms abstract and distorted hand-drawn scene sketches (rather than neat edge maps) into images that are both faithful to sketch semantics and photo-realistic. It also introduces a training loss that does not require pixel-aligned ground truth images.

SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling

This work provides the first systematic study of privacy leakage in SMOTE, proposing two attacks, DistinSMOTE and ReconSMOTE. It demonstrates that SMOTE is inherently non-privacy-preserving and excessively exposes minority class records.

Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

Soft-Di[M]O relaxes the token distribution of one-step discrete image generators into differentiable expected embeddings. This allows the Di[M]O-distilled Masked Diffusion Model to integrate with GAN training, differentiable reward fine-tuning, and test-time embedding optimization, pushing the one-step FID to 1.56 on ImageNet-256 and outperforming teacher models in GenEval and HPS metrics for text-to-image tasks.

SoftCFG: Uncertainty-guided Stable Guidance for Visual Autoregressive Model

To address the "guidance diminishing" and "over-guidance" issues in Visual Autoregressive (AR) models using CFG, SoftCFG applies weighted perturbations to the value cache of the unconditional branch based on the confidence of each generated token and constrains cumulative perturbations with "Step-Normalization." This training-free and architecture-agnostic approach improves the FID of AR models on ImageNet 256×256 from 1.37 to 1.27, setting a new SOTA for AR models.

SONA: Learning Conditional, Unconditional, and Matching-Aware Discriminator

SONA decomposes the conditional GAN discriminator into two mutually orthogonal projection terms: "naturalness" and "alignment." These are trained respectively using SAN loss and two types of Bradley–Terry losses, balanced by a constrained adaptive weighting mechanism. This achieves higher sample quality and better conditional alignment in both class-conditional and text-to-image tasks.

SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation

Proposes the SongEcho framework, which achieves cover song generation through Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), generating new vocals and accompaniments while preserving the original song's melody contour.

Source-Guided Flow Matching

This paper proposes the SGFM framework, which reformulates the "guided generation" problem in flow matching as "sampling from a modified source distribution." By modifying only the source distribution while leaving the pre-trained vector field untouched, the method accurately recovers the target distribution, preserves the straight-line trajectories of optimal transport vector fields (enabling fast inference), and allows users to choose samplers (Importance Sampling / HMC / Optimization) as needed.

SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models

SPEED proposes a closed-form model editing method based on null space constraints. By utilizing three complementary techniques—Influence Prior Filtering (IPF), Directional Prior Augmentation (DPA), and Invariant Equality Constraint (IEC)—to refine the preservation set, it achieves scalable (erasing 100 concepts in 5 seconds), precise (zero semantic loss for non-target concepts), and efficient concept erasure.

SpikeGen: Decoupling "Rod-Cone" Visual Representations with a Latent Generative Framework

SpikeGen encodes visual information from spike cameras (rods, high temporal resolution) and RGB cameras (cones, high color/spatial resolution) into a shared VAE latent space. It then utilizes a modified MAR + per-token diffusion framework for generative fusion within this latent space. A single pre-trained model simultaneously achieves or exceeds SOTA performance across three tasks: conditional deblurring, dense frame reconstruction from spike streams, and new-view synthesis for high-speed scenes.

SPRINT: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers

SPRINT merges the shallow dense local features and deep sparse global features of the Diffusion Transformer using a residual approach, enabling DiT to be efficiently pre-trained at a 75% token dropping ratio and further reducing sampling costs through Path-Drop Guidance.

SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

Proposes Scaled Spatial Guidance (SSG), a training-free inference-time guidance method that enhances the coarse-to-fine hierarchical generation quality of visual autoregressive models through frequency-domain prior construction and semantic residual amplification.

Stage-wise Dynamics of Classifier-Free Guidance in Diffusion Models

Under the assumption of multi-modal (Gaussian mixture) conditional distributions, this paper decomposes the Classifier-Free Guidance (CFG) sampling process into three stages: "directional shift → modal separation → intra-modal contraction." By characterizing the effect of CFG on trajectories in each stage using three theorems, the authors provide a unified explanation for the long-standing empirical phenomenon where "stronger guidance improves alignment but degrades diversity." Consequently, a low-high-low time-varying guidance schedule is proposed to simultaneously enhance quality and diversity.

Steer Away From Mode Collisions: Improving Composition In Diffusion Models

To address concept omission or collision in multi-concept prompts for diffusion models, this paper proposes the "Mode Collision" hypothesis (mode overlap between joint and single-concept distributions). It introduces CO3 (Concept Contrasting Corrector), which steers generation away from degenerate modes by composing a corrected distribution \(\tilde{p}(x|C) \propto p(x|C) / \prod_i p(x|c_i)\) in the Tweedie mean space, achieving plug-and-play, gradient-free, and model-agnostic improvements in compositional generation.

Step-Aware Residual-Guided Diffusion for EEG Spatial Super-Resolution

Ours proposes SRGDiff, a step-aware residual-guided diffusion model that reformulates EEG spatial super-resolution as a dynamic conditional generation task, achieving high-fidelity reconstruction through step-wise residual direction correction and step-dependent affine modulation.

Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models

This paper proposes S²-Guidance, which utilizes randomly dropped transformer block sub-networks as weak models for self-guidance during the denoising process. This corrects suboptimal CFG predictions without additional training, consistently outperforming CFG and other advanced guidance strategies in text-to-image and text-to-video tasks.

STORK: Accelerating Diffusion and Flow Matching Sampling by Simultaneously Solving Stiffness and Structural Dependency

STORK introduces Stabilized Runge-Kutta (SRK) methods from numerical analysis—designed specifically for "stiff ODEs"—into diffusion and flow matching sampling. By utilizing Taylor expansion to compress the high Number of Function Evaluations (NFE) typical of SRK into "virtual NFEs," it yields a training-free solver that handles stiffness without structural dependence. At ultra-low budgets of 7–20 NFE, it consistently outperforms DPM-Solver++ and UniPC in FID.

Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization

Story-Iter transforms long story visualization from a "one-time dependency on fixed reference images" into a training-free external iterative process: it first generates the entire story via text, then repeatedly uses the full-length frames from the previous round as a global reference through the GRCA attention module to maintain character consistency and fine-grained text interaction, significantly outperforming existing paradigms on 100-frame long stories.

Strictly Constrained Generative Modeling via Split Augmented Langevin Sampling

Addressing the issue where generative models fail to strictly satisfy physical constraints in scientific sampling, this paper draws from the variational perspective of Langevin dynamics and Lagrangian duality to propose CASAL (Constrained Alternated Split Augmented Langevin). By using variable splitting to decouple "exploration" and "constraint satisfaction" into two separate variables and employing a dual variable for correction, the method maintains Langevin's exploration capability while strictly satisfying non-convex constraints. It can be applied zero-shot to pre-trained diffusion models and significantly outperforms projection and penalty methods in constrained field generation, data assimilation, and optimal control feasibility tasks.

Structured Flow Autoencoders: Learning Structured Probabilistic Representations with Flow Matching

This paper proposes Structured Flow Autoencoders, which integrate structured latent variables from probabilistic graphical models into conditional continuous normalizing flows. By employing Structured Conditional Flow Matching, it simultaneously learns high-fidelity generative distributions and interpretable posterior representations, achieving a superior balance between generative quality, sample diversity, and latent space structure compared to VAE / SVAE on image, RNA-seq, and sequential video data.

Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

This paper introduces the HistVis historical visual benchmark, generating 30,000 images of cross-era activities using three open-source text-to-image (TTI) diffusion models. It systematically reveals how models render the "past" as a synthetic history characterized by stereotypical associations, anachronisms, and distorted demographic distributions across three dimensions: implicit style associations, historical consistency, and demographic representation.

TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex

A Task-Amortized VAE (TAVAE) is proposed by extending the VAE formalism to explain contextual modulation in mouse V1 by flexibly learning task-specific priors over learned representations. This explains the bimodal population responses observed during orientation discrimination tasks when training stimuli and test stimuli are mismatched.

TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

TempFlow-GRPO identifies that "treating all denoising steps equally" in existing Flow-GRPO training is a core bottleneck. By using a triad of "process rewards via trajectory bifurcation + noise-level reweighting + seed grouping," it matches optimization intensity to the real exploration potential of each step, achieving SOTA on GenEval and PickScore with significantly fewer steps (GenEval 0.63→0.97, ~10× training efficiency).

Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions

The PCI (Prompt-Conditioned Intervention) framework is proposed to quantify when concepts are fixed in diffusion models by switching text prompts at different timesteps of the denoising trajectory, applying these findings to time-aware image editing.

Terminal Velocity Matching

This paper proposes Terminal Velocity Matching (TVM), which shifts Flow Matching from "matching velocity at the trajectory start" to "matching velocity at the trajectory end." This allows a single-stage training process to directly learn the displacement mapping between any two time steps with a provable upper bound on the 2-Wasserstein distance. Combined with a semi-Lipschitz architectural modification and a Flash Attention JVP kernel supporting backpropagation, it achieves SOTA results for from-scratch few-step generation on ImageNet-256 (1-step 3.29 FID, 4-step 1.99 FID).

Test-Time Iterative Error Correction for Efficient Diffusion Models

Ours proposes IEC (Iterative Error Correction), a test-time plug-and-play method that corrects the inference errors of efficient diffusion models through iterative refinement, reducing error accumulation from exponential growth to linear growth.

MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation

MADFormer mixes Autoregression (AR) and Diffusion along both the "token axis" and the "layer axis." It utilizes AR for one-time global conditioning between blocks and Diffusion for iterative refinement within blocks. By treating early Transformer layers as AR conditioners and later layers as diffusion denoisers, it serves as a controllable testbed to systematically answer "how to allocate compute between AR and Diffusion," improving FID by up to 60–75% under constrained inference compute.

The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models

This paper systematically investigates the impact of text prompt complexity on three critical dimensions of T2I model synthetic data: quality, diversity, and consistency. It proposes a new evaluation framework and discovers that prompt expansion, as an inference-time intervention, optimally balances diversity and aesthetic quality.

The Spacetime of Diffusion Models: An Information Geometry Perspective

This work proposes the concept of "spacetime" for diffusion models from an information geometry perspective. It demonstrates that standard pullback geometry degenerates into straight lines in diffusion models, introduces spacetime geometry based on the Fisher-Rao metric, and derives practically computable Diffusion Edited Distance (DiffED) and transition path sampling methods.

There and Back Again: On the Relation between Noise and Image Inversions in Diffusion Models

This work provides an in-depth analysis of the error mechanisms in DDIM inversion, discovering that latent encodings exhibit low diversity and high correlation in smooth image regions (e.g., sky). Tracing this to inaccurate noise predictions in the initial steps of inversion, the authors propose a simple fix by replacing the first few inversion steps with forward diffusion.

There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-Training

This paper proposes EPG (End-to-end Pixel-space Generative model), a two-stage framework consisting of a "self-supervised pre-trained encoder + end-to-end fine-tuned decoder." By completely discarding the VAE and training diffusion and consistency models directly in pixel space, it achieves 1.58 FID (75 NFE) on ImageNet-256. Using approximately 30% of the training compute of DiT, it outperforms DiT/SiT and, for the first time, trains a consistency model directly to 8.82 FID (1-step) without relying on a VAE or pre-trained diffusion models.

TIPO: Text to Image with Text Pre-sampling for Prompt Optimization

TIPO uses a 200M lightweight autoregressive language model to expand (instead of rewrite) simple user prompts into detailed prompts aligned with the text distribution of T2I model training. By leveraging a 30M image-text pair corpus and multi-task "text pre-sampling," it significantly enhances image quality, text alignment, and human preference while remaining faster and more efficient than RL or large-model solutions.

ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

The ToProVAR framework is proposed, which utilizes attention entropy to uniformly analyze the sparsity of VAR models across token, layer, and scale dimensions. It achieves up to 3.4× acceleration with near-zero loss in image quality, significantly outperforming FastVAR and SkipVAR.

Towards Better Optimization for Listwise Preference in Diffusion Models

This paper proposes Diffusion-LPO, extending DPO preference alignment for diffusion models from "pairwise comparisons" to "full ranked lists." By deriving a listwise objective using the Plackett-Luce model, it ensures every image is superior to all lower-ranked images in a list. It consistently outperforms pairwise Diffusion-DPO across text-to-image, image editing, and personalized alignment tasks (achieving a >12% PickScore win rate improvement on SD1.5).

Towards Sequence Modeling Alignment Between Tokenizer and Autoregressive Model

This paper points out that tokens encoded by conventional image tokenizers exhibit bidirectional dependency, which fundamentally conflicts with the strictly unidirectional prediction paradigm of autoregressive (AR) models. The authors propose AliTok, which uses a causal decoder to constrain a bidirectional encoder, forcing the production of token sequences that are both semantically rich and highly predictable. This allows a standard decoder-only AR model with only 662M parameters to achieve a gFID of 1.28 on ImageNet-256, surpassing SOTA diffusion models for the first time while being 10× faster in sampling.

Training-Free Reward-Guided Image Editing via Trajectory Optimal Control

Reward-guided image editing is reformulated as a trajectory optimal control problem. The reverse process of diffusion/flow models is treated as a controllable trajectory, optimized via adjoint state iteration based on Pontryagin's Maximum Principle (PMP) across the entire trajectory. This achieves effective training-free reward-guided editing without reward hacking.

ColorCtrl: Training-Free Text-Guided Color Editing Based on Multi-Modal Diffusion Transformer

ColorCtrl is a training-free text-guided color editing method that decouples "structure" and "color" by directly manipulating MM-DiT attention maps and value tokens. It achieves precise color editing with virtually zero damage to geometry, material, and lighting consistency across various models (SD3, FLUX.1-dev, CogVideoX), while supporting word-level intensity adjustment.

Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations

A general framework is proposed that utilizes Rectified Flow to generate distributional rewards for training explanation-generating LLMs. By capturing the pluralistic and probabilistic nature of human judgment through Continuous Normalizing Flows (CNF), it is theoretically proven that CNF can effectively recover the true human reward distribution. This method significantly outperforms RLHF/RLAIF baselines on tasks such as SMAC, MMLU, and MathQA.

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

The denoising process of diffusion/flow models is reinterpreted as a search tree—starting from shared noise, branching only within scheduled SDE windows, and reusing public prefixes for ODE steps. By backpropagating leaf rewards along the tree to derive per-edge advantages for GRPO updates, this method achieves 2.4× faster training under the same sampling budget and consistently outperforms DanceGRPO/MixGRPO on the efficiency-reward Pareto frontier.

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

TwinFlow is proposed: by extending the flow matching time interval from \([0,1]\) to \([-1,1]\), "twin trajectories" are constructed to form self-adversarial signals, enabling one-step generation without discriminators or frozen teachers. It is the first to extend 1-NFE generation capabilities to the 20B-parameter Qwen-Image model; the 1-NFE GenEval of 0.86 approaches the original 100-NFE score of 0.87, while reducing inference costs by \(100\times\).

Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

Uni-X proposes an X-shaped architecture that is separated at both ends and shared in the middle to mitigate gradient conflicts between visual and textual modalities in Unified Multimodal Models (UMM). By setting shallow and deep layers as modality-specific while keeping middle layers parameter-shared, it matches or exceeds the performance of 7B AR-UMMs in image generation and multimodal understanding with only 3B parameters.

UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy

UniCalli unifies column-level generation and recognition of Chinese calligraphy into a multimodal Diffusion Transformer. Through asymmetric denoising, box map spatial priors, and joint training, the model generates entire columns of calligraphy with natural ligatures and layout rhythm while maintaining robust recognition capabilities across long-tail calligraphers and styles.

UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

To address inversion collapse and the failure of delayed injection caused by the "straight, non-intersecting trajectories" in flow matching models (SD3, FLUX), this paper proposes a training-free, model-agnostic predictor-corrector framework. Uni-Inv achieves high-fidelity inversion by constructing an implicit Euler closed-form solution via reusing previous-step velocities. Uni-Edit incorporates a correction step during the editing stage, combined with region-adaptive guidance and velocity fusion. This allows for strong editing performance while maintaining background consistency within 15 steps, achieving SOTA results in both reconstruction and PIE-Bench editing tasks.

RealUID: Supervising the Distillation of All Matching Models with Real Data (Without GAN)

RealUID unifies one-step distillation methods specifically designed for single frameworks (such as SiD, FGM, and IBMD) into a single min-max loss through a "linearization + inverse optimization" perspective. It designs a loss that injects real data directly into the distillation objective without relying on GANs or extra discriminators. On CIFAR-10, it reduces the FID of flow distillation from 2.58 to 1.98 (unconditional) and from 2.21 to 1.87 (conditional), with a convergence speed approximately 3 times faster.

Value Matching: Scalable and Gradient-Free Reward-Guided Flow Adaptation

The authors reformulate "reward adaptation for large-scale flow/diffusion models" as a stochastic optimal control (SOC) problem, learning only a small value network online while freezing the base model. This approach supports non-differentiable (black-box) rewards and allows for on-demand GPU memory adjustment, achieving comparable performance on image and molecular generation using less than 5% of the memory required by fine-tuning methods.

Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

To address the issue of sample corruption in Masked Diffusion Models (MDM) during few-step sampling caused by "independent prediction across dimensions," this paper proposes VADD. By introducing a Gaussian latent variable \(z\) into the denoising distribution and jointly training the denoising and recognition models via a Variational Autoencoder (VAE) framework, it implicitly models inter-dimensional correlations. This significantly improves the sample quality of few-step generation while maintaining the same sampling overhead as standard MDM.

Verification of the Implicit World Model in a Generative Model via Adversarial Sequences

This paper proposes an adversarial sequence generation method to verify the soundness of implicit world models in generative sequence models. Systematic evaluations across various adversarial strategies (IMO/BSO/AD) in the chess domain reveal that all tested models are unsound. Results indicate that training methods and dataset selection significantly impact soundness, and linear board state probes lack causal influence in most models.

VFScale: Intrinsic Reasoning through Verifier-Free Test-time Scalable Diffusion Model

VFScale proposes a verifier-free test-time scalable diffusion model. By employing MRNCL loss and KL regularization to improve the energy landscape, the intrinsic energy function serves as a verifier. Combined with hybrid MCTS denoising for efficient searching, a model trained on \(6 \times 6\) mazes can solve 88% of \(15 \times 15\) mazes, whereas standard diffusion models fail completely.

ViPO: Visual Preference Optimization at Scale

Addressing the "scaling ceiling" of preference optimization in visual generation, this work advances both algorithms and data: it proposes Poly-DPO, requiring only two lines of code and one hyperparameter \(\alpha\) to achieve "confidence-aware" training robust to noisy preferences, and constructs ViPO, a million-scale, category-balanced, 1024px preference dataset. The two components are mutually validating—Poly-DPO automatically degrades to standard DPO (\(\alpha \to 0\)) when data quality is high, while on noisy data, Poly-DPO achieves a 6.87-point improvement over Diffusion-DPO on GenEval.

Visual Autoregressive Modeling for Instruction-Guided Image Editing

VAREdit is proposed, redefining instruction-guided image editing as a multi-scale prediction problem. It addresses scale mismatch in fine-scale conditioning through the Scale-Aligned Reference module, significantly outperforming diffusion-based methods in edit faithfulness and efficiency.

VisualPrompter: Semantic-Aware Prompt Optimization with Visual Feedback for Text-to-Image Synthesis

VisualPrompter is a training-free prompt engineering framework for text-to-image synthesis. It utilizes an LLM to decompose user prompts into atomic concepts, employs a VLM to verify these concepts against generated images to identify "missing" elements, and performs atomic-level expansion and reorganization specifically for these missing concepts. By rewriting prompts into sentences preferred by the model without compromising user intent, it achieves new SOTA on both DSG and TIFA text-to-image alignment benchmarks.

VLM-Guided Adaptive Negative Prompting for Creative Generation

This paper proposes a training-free VLM-guided adaptive negative prompting method that continuously identifies conventional concepts emerging in the image during the diffusion denoising process and accumulates them into negative prompts to push the generation trajectory away, thereby generating novel images that still belong to the target category.

VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis

Addressing the task of "fusing two object images into a brand-new hybrid object," this paper proposes VMDiff. It constructs semantic noise carrying dual-object information via guided denoising and inversion at the noise level (concatenation rather than interpolation), fuses two embeddings into a single coherent representation using spherical interpolation at the latent level, and automatically tunes parameters through a similarity-score-driven zero-order search. This simultaneously resolves the chronic issues of "objects appearing side-by-side without true fusion" and "one object overpowering the other."

VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By Value Sign Flip

To address the inability of distilled few-step (1-8 steps) diffusion/flow-matching models to utilize CFG for negative prompts, this paper proposes Value Sign Flip (VSF). By flipping the sign of negative prompt values within the attention mechanism, VSF achieves token-level, adaptive cancellation of undesirable content across layers, steps, and regions. With nearly zero extra overhead, it improves negative compliance from 0.32–0.38 to 0.42–0.55, even outperforming CFG in non-few-step models.

W-Edit: A Wavelet-based Frequency-aware Framework for Text-driven Image Editing

W-Edit decomposes diffusion features into multi-scale frequency bands using wavelet transforms, injecting the "low-frequency for structure, high-frequency for detail" prior into the attention K/V of pre-trained DiTs. This achieves a training-free balance between structure preservation and local modification, reducing FID to 65.44 and increasing CLIP score to 31.84 on PIE-Bench, outperforming previous training-free editing methods.

Weak-to-Strong Diffusion with Reflection

W2SD proposes alternating "strong model denoising + weak model inversion" (reflection) during diffusion sampling. It uses the estimable "weak-to-strong gap" between a pair of off-the-shelf models to approximate the unobservable "strong-to-ideal gap," pulling the sampling trajectory toward the real data distribution without training. It significantly improves human preference and aesthetic quality across various settings (Image/Video, UNet/DiT/MoE), achieving up to a 90% HPSv2 win rate on Juggernaut-XL.

WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

WeTok is a discrete visual tokenizer that employs "Grouped Lookup-free Quantization (GQ)" to bypass the memory explosion of entropy loss by partitioning large codebooks into smaller groups. It further utilizes a "Generative Decoder (GD)" to transform the decoder from a deterministic regression model into a noise-conditioned GAN generator, enabling the reconstruction of fine details even at high compression ratios. On ImageNet 50k with 400% compression, it achieves a zero-shot rFID of 0.12, surpassing continuous tokenizers such as FLUX-VAE (0.18) and SD-VAE 3.5 (0.19).

What Exactly Does Guidance Do in Masked Discrete Diffusion Models

This paper provides the first rigorous characterization of classifier-free guidance (CFG) in masked discrete diffusion models under low-dimensional (\(1D/2D\)) analytical settings. It demonstrates that CFG moves probability mass from "inter-class overlapping regions" to "class-exclusive regions," and the convergence speed of reverse sampling dynamics toward the target distribution accelerates doubly exponentially with respect to the guidance strength \(w\).

What Matters for Representation Alignment: Global Information or Spatial Structure?

This paper systematically proves that Representation Alignment (REPA) accelerates diffusion model training not by relying on the global semantic information of the target representation (ImageNet linear probe accuracy), but rather on the spatial self-similarity structure between its patch tokens. Based on this, the authors propose iREPA with only 4 lines of code (convolutional projection + spatial normalization), which consistently accelerates REPA convergence across 27 encoders, various model scales, and training recipes.

When Scores Learn Geometry: Rate Separations under the Manifold Hypothesis

This work reveals the scale separation between geometric and distributional information in score learning under the manifold hypothesis—manifold geometric information intensity is \(\Theta(\sigma^{-2})\), which is \(O(\sigma^{-2})\) times stronger than distributional information. This proves that the success of diffusion models primarily stems from learning the data manifold rather than the full distribution, and proposes a one-line code modification to generate uniform distributions on the manifold.

Why Adversarially Train Diffusion Models?

This paper reformulates adversarial training from classifiers into an "equivariant smoothing" regularizer for diffusion models, enabling the denoising network to generate samples along cleaner and more stable score fields even when training data is highly contaminated or sampling trajectories are attacked.

WILD-Diffusion: A WDRO-inspired Training Method for Limited Data Diffusion Models

This paper introduces Wasserstein Distributionally Robust Optimization (WDRO) into diffusion model training. By iteratively generating "worst-case" samples within a Wasserstein uncertainty set centered on the limited data distribution, the training support set is dynamically expanded. This approach reduces FID by more than 10% when using only 20% of the data and provides a plug-and-play training framework with convergence guarantees.

WithAnyone: Toward Controllable and ID Consistent Image Generation

To address the "copy-paste" artifact where models directly overlay reference faces onto outputs in identity-customized generation, this paper constructs MultiID-2M, a paired multi-person dataset of 500,000 images, and proposes MultiID-Bench, a benchmark capable of quantifying copy-paste. By utilizing paired training and an ID contrastive loss with extended negative samples, the authors develop WithAnyone (based on FLUX). It achieves the lowest copy-paste score in its class while maintaining the highest SimGT, effectively breaking the "more accurate similarity leads to more severe copying" trade-off.

WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark

Addressing implicit editing instructions that provide causes without results (e.g., "throw a ball at a cactus"), this paper constructs WorldEdit, an 11k high-quality dataset emphasizing real-world causal transformations, along with the WorldEdit-Test benchmark. By employing a two-stage fine-tuning of Bagel via "CoT Supervised Fine-Tuning + Flow-GRPO Reinforcement Learning (including inverse causal verification reward)," the authors elevate open-source causal editing performance to levels near GPT-4o and Nano-Banana.