🎨 Image Generation¶

🔬 ICLR2026 · 154 paper notes

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers: This paper presents the first systematic analysis of conditional embeddings in diffusion Transformers, revealing extreme angular similarity (inter-class cosine similarity >99%) and dimensional sparsity (only 1–2% of dimensions carry semantic information). Pruning 2/3 of low-magnitude dimensions leaves generation quality virtually unchanged, exposing a hidden semantic bottleneck in conditional embeddings.
AlignTok: Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models: This paper proposes AlignTok, which aligns pretrained visual foundation encoders (e.g., DINOv2) into continuous tokenizers for diffusion models. Through a three-stage alignment strategy—semantic latent space establishment → perceptual detail supplementation → decoder refinement—AlignTok constructs a semantically rich latent space, achieving gFID 1.90 on ImageNet 256×256 in only 64 epochs, converging faster and yielding better generation quality than VAEs trained from scratch.
Amortising Inference and Meta-Learning Priors in Neural Networks (BNNP): This paper proposes BNNP (Bayesian Neural Network Process), a neural process that treats BNN weights as latent variables and the BNN itself as the decoder. Through layer-wise amortised variational inference, BNNP jointly learns BNN priors and inference networks across multiple datasets. It is the first work to empirically answer "Does a good prior eliminate the need for a good approximate inference method?"—the answer is no; there is no free lunch.
Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation: AsynDM assigns different timestep schedules to different pixels—denoising prompt-relevant regions more slowly—so that these regions can leverage cleaner contextual references, thereby significantly improving semantic alignment in text-to-image generation without requiring any fine-tuning.
Autoregressive Image Generation with Randomized Parallel Decoding: This paper proposes ARPG, a visual autoregressive model built upon a "guided decoding" framework that decouples positional guidance (query) from content representation (key-value), enabling fully randomized-order training and generation with efficient parallel decoding. On ImageNet-1K 256×256, ARPG achieves 1.94 FID in 64 steps with over 20× throughput improvement and over 75% memory reduction.
Beyond Confidence: The Rhythms of Reasoning in Generative Models: This paper proposes the Token Constraint Bound (\(\delta_{\text{TCB}}\)) metric, which quantifies the largest perturbation to an LLM's hidden state that preserves the next-token prediction, measuring local prediction robustness and revealing instabilities that traditional perplexity fails to capture.
Blueprint-Bench: Comparing Spatial Intelligence of LLMs, Agents and Image Models: Blueprint-Bench evaluates AI spatial reasoning through the task of "generating 2D floorplans from apartment interior photographs": the inputs (photos) are fully within the training distribution, while the task (spatial reconstruction) is out-of-distribution. The benchmark evaluates LLMs including GPT-5, Claude 4 Opus, Gemini 2.5 Pro, and Grok-4; image generation models including GPT-Image and NanoBanana; and agent systems including Codex CLI and Claude Code. Results show that the vast majority of models perform at or below a random baseline, revealing a systematic blind spot in current AI spatial intelligence.
Branched Schrödinger Bridge Matching: This paper proposes BranchSBM, a framework that extends Schrödinger Bridge Matching to branching scenarios by parameterizing multiple time-dependent velocity fields and growth processes. It models bifurcating dynamics from a single initial distribution to multiple target distributions, significantly outperforming single-branch methods on tasks such as LiDAR surface navigation and single-cell perturbation modeling.
Bridging Degradation Discrimination and Generation for Universal Image Restoration: BDG performs fine-grained degradation discrimination via multi-angle multi-scale gray-level co-occurrence matrices (MAS-GLCM), and designs a three-stage diffusion training pipeline (generation → bridging → restoration) to seamlessly integrate degradation discrimination with generative priors, achieving significant fidelity improvements on all-in-one restoration and real-world super-resolution tasks.
Bridging Generalization Gap of Heterogeneous Federated Clients Using Generative Models: FedVTC proposes that, in model-heterogeneous federated learning, each client generates synthetic data via a Variational Transposed Convolution network (VTC) from aggregated feature distribution statistics to fine-tune the local model. Without requiring a public dataset, the method significantly improves generalization while reducing communication and memory overhead.
CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models: This paper proposes Consistency Mid-Training (CMT), which inserts a lightweight intermediate training stage between a pretrained diffusion model and flow map post-training. By training the model to map arbitrary points on ODE trajectories back to clean samples, CMT yields trajectory-aligned initialization, reducing training cost by up to 98% while achieving state-of-the-art two-step generation quality.
Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition: This paper proposes General Policy Composition (GPC), which at test time convexly combines the distribution scores of multiple pretrained diffusion/flow policies without additional training, yielding a composite policy that surpasses any individual parent policy. Theoretical analysis proves that convex combination improves single-step score error, and this improvement propagates to the full sampling trajectory via a Grönwall bound.
Compositional amortized inference for large-scale hierarchical Bayesian models: This paper extends compositional score matching (CSM) to hierarchical Bayesian models, addressing numerical instability under large numbers of data groups via a novel error-damping estimator and mini-batch strategy. It achieves, for the first time, amortized inference over hierarchical models exceeding 750,000 parameters (250,000+ data groups), validated on a real-world fluorescence lifetime imaging application.
Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution: This paper proposes Concept-TRAK, which extends influence functions from image-level to concept-level attribution by designing concept-specific training losses (DPS reward) and utility losses (CFG guidance). The method substantially outperforms TRAK, D-TRAK, and DAS on synthetic, CelebA-HQ, and AbC benchmarks, with particularly significant advantages in OOD settings where novel concept combinations are evaluated.
Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss: This paper theoretically analyzes the advantage of autoregressive diffusion loss models over conditional diffusion models in correcting condition errors (exponential decay of gradient norms), and proposes a condition refinement method based on optimal transport (Wasserstein Gradient Flow) to address the "condition inconsistency" problem in autoregressive generation, achieving FID 1.31 on ImageNet (based on MAR).
Conditionally Whitened Generative Models for Probabilistic Time Series Forecasting: This paper proposes CW-Gen (Conditionally Whitened Generative Models), which replaces the standard Gaussian terminal distribution in diffusion models and flow matching by jointly estimating the conditional mean and a sliding-window covariance matrix. The authors provide theoretical guarantees showing that sampling quality is necessarily improved when the estimator satisfies sufficient conditions, and demonstrate consistent improvements in multivariate time series probabilistic forecasting across 5 datasets × 6 generative models.
Conjuring Semantic Similarity: This paper proposes a vision-imagination-based measure of textual semantic similarity by computing the Jeffreys divergence between the reverse SDEs induced by a text-conditioned diffusion model under two text prompts. The metric is directly computable via Monte-Carlo sampling and, for the first time, quantifies the alignment between the semantic space learned by diffusion models and human annotations.
Consistent Text-to-Image Generation via Scene De-Contextualization: This paper identifies the root cause of identity (ID) shift in T2I models as scene contextualization — the injection of contextual information from scene tokens into ID tokens — and proposes a training-free method, Scene De-Contextualization (SDeC), that leverages SVD singular value directional stability analysis to identify and suppress latent scene–ID associations in prompt embeddings, enabling per-scene identity-consistent generation.
Contact-Guided 3D Genome Structure Generation of E. coli via Diffusion Transformers: This paper proposes DiffBacChrom — a conditional diffusion Transformer (CrossDiT) that generates ensembles of 3D genome conformations for E. coli from Hi-C contact maps. The method employs a ResNet VAE to maintain bin-aligned latent encodings, a Transformer encoder with cross-attention for Hi-C conditioning, and flow-matching training. The generated ensembles exhibit high agreement with input Hi-C data in terms of distance-decay curves \(P(s)\) and SCC metrics, while preserving conformational diversity.
Contact Wasserstein Geodesics for Non-Conservative Schrödinger Bridges: This paper proposes the Non-Conservative Generalized Schrödinger Bridge (NCGSB)—built on contact Hamiltonian mechanics to allow time-varying energy—and introduces the Contact Wasserstein Geodesic (CWG), which reformulates the bridge problem as geodesic computation on a finite-dimensional Jacobi metric. A ResNet parameterization achieves near-linear complexity and supports guided generation. CWG substantially outperforms iterative SB solvers on manifold navigation, molecular dynamics, and image generation tasks.
ContextBench: Modifying Contexts for Targeted Latent Activation: This paper introduces ContextBench, a benchmark comprising 715 tasks for evaluating methods that automatically generate fluent inputs capable of activating specific latent features, and proposes two EPO-enhanced variants—LLM-assisted and diffusion-model inpainting—that Pareto-dominate standard EPO in the trade-off between activation strength and linguistic fluency.
Continual Unlearning for Text-to-Image Diffusion Models: A Regularization Perspective: This paper presents the first systematic study of continual unlearning for text-to-image (T2I) diffusion models. It identifies that existing unlearning methods suffer from "utility collapse" under sequential unlearning requests due to cumulative parameter drift, and proposes a suite of plug-in regularization strategies (L1/L2 norm, selective fine-tuning, model merging) along with a semantics-aware gradient projection method to mitigate this issue.
Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations: This paper proposes Contractive Diffusion Policies (CDPs), which introduce contraction regularization into the diffusion sampling ODE to suppress the accumulation of score matching errors and solver errors. With minimal modification and a single hyperparameter \(\gamma\), CDPs improve the robustness of diffusion-based policies in offline learning settings.
COSMO-INR: Complex Sinusoidal Modulation for Implicit Neural Representations: Through harmonic distortion analysis and Chebyshev polynomial approximation, this paper rigorously proves that odd/even symmetric activation functions exhibit systematic attenuation in post-activation spectra. It proposes modulating activation functions with complex sinusoidal terms \(e^{j\zeta x}\) to preserve full spectral support, and introduces the COSMO-RC activation function alongside a regularized prior embedder architecture. The method achieves an average PSNR gain of +5.67 dB over the strongest baseline on Kodak image reconstruction and +3.45 dB on NeRF.
CREPE: Controlling Diffusion with Replica Exchange: This paper proposes CREPE, an inference-time control method for diffusion models based on Replica Exchange (Parallel Tempering), serving as the computational dual of SMC — it operates in parallel across denoising steps while generating samples serially. CREPE offers high sample diversity, supports online refinement, and handles a variety of tasks including temperature annealing, reward tilting, model composition, and CFG debiasing.
DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment: This paper addresses the sparse reward problem in Flow Matching + GRPO alignment by estimating step-wise reward gains as dense rewards via ODE denoising rollouts of intermediate latents, and adaptively calibrating per-timestep noise injection in the SDE sampler based on dense rewards to regulate exploration. The method outperforms Flow-GRPO on three tasks: human preference alignment, compositional generation, and text rendering.
Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability: This paper demonstrates that norm-based memorization detection metrics are valid only under isotropic log-probability distributions and fail in low-noise anisotropic regimes. A denoising-free detection metric is proposed that combines high-noise norms with low-noise angular alignment (cosine similarity), surpassing existing denoising-free methods on SD v1.4/v2.0 while being over 5× faster.
DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation: This paper proposes DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation, comprising InkVAE (which learns a structured latent space via dual regularization from OCR and style classification) and InkDiT (which performs conditional denoising generation in the latent space). DiffInk substantially outperforms the state of the art on Chinese handwriting generation (AR 94.38% vs. 91.48%) while achieving an 800× speedup.
Diffusion Alignment as Variational Expectation-Maximization: This paper formalizes diffusion model alignment as a variational EM algorithm: the E-step employs test-time search (soft-Q-guided sampling with importance sampling) to explore multimodal, high-reward trajectories, while the M-step distills the search results into model parameters via forward-KL minimization. The approach simultaneously achieves high reward and high diversity on both image generation and DNA sequence design tasks.
Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models: This paper proposes Diffusion Blend, which achieves multi-preference alignment at inference time by blending the backward diffusion processes of multiple reward-finetuned models. DB-MPA supports arbitrary linear combinations of rewards; DB-KLA enables dynamic KL regularization control; DB-MPA-LS eliminates inference overhead via stochastic LoRA sampling. The paper theoretically derives error bounds for the blending approximation and empirically approaches the MORL oracle upper bound.
Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function: This paper proposes SQDF (Soft Q-based Diffusion Finetuning), which fine-tunes diffusion models under a KL-regularized RL framework via a training-free differentiable soft Q-function approximation and reparameterized policy gradients. Three complementary components—a discount factor, a consistency model, and an off-policy replay buffer—collectively optimize the target reward while effectively mitigating reward over-optimization, preserving sample naturalness and diversity.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process: DiffusionNFT proposes a fundamentally new online RL paradigm for diffusion models: rather than performing policy optimization over the reverse sampling process (as in GRPO), it performs contrastive training on positive and negative samples via a flow matching objective over the forward process, defining an implicit policy improvement direction. The method is 3–25× faster than FlowGRPO and requires no CFG.
Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild: This paper proposes DrPose, which applies direct reward fine-tuning to maximize PoseScore—a metric measuring skeletal consistency between multi-view latent images and ground-truth 3D poses—combined with KL regularization to prevent reward hacking. Together with the DrPose15K dataset (15K diverse poses sampled from the Motion-X dataset and animated via the MIMO video generator), DrPose significantly improves 3D human reconstruction quality under challenging poses such as dynamic movements and acrobatics.
Directional Textual Inversion for Personalized Text-to-Image Generation: This paper identifies a norm inflation problem in token embeddings learned by Textual Inversion (TI), which degrades text alignment under complex prompts. The proposed Directional Textual Inversion (DTI) fixes the embedding norm at an in-distribution scale and optimizes only the direction on the unit hypersphere via Riemannian SGD, regularized by a von Mises-Fisher prior, substantially improving prompt faithfulness.
Discrete Adjoint Matching: This paper proposes Discrete Adjoint Matching (DAM), which derives adjoint variables for discrete state spaces from a purely statistical perspective (rather than from control theory), extending the continuous-domain Adjoint Matching framework to discrete generative models based on continuous-time Markov chains (CTMCs). The approach enables effective fine-tuning of diffusion-based LLMs (LLaDA-8B), improving accuracy on Sudoku from 11.5% to 89.2%.
DistillKac: Few-Step Image Generation via Damped Wave Equations: This paper replaces the Fokker-Planck equation with the damped wave equation (telegrapher's equation) and its stochastic Kac representation as the probabilistic flow foundation for generative models, enabling finite-speed propagation. An endpoint distillation method is proposed for few-step generation, achieving FID=4.14 in 4 steps and FID=5.66 in 1 step on CIFAR-10.
Diverse Text-to-Image Generation via Contrastive Noise Optimization: This paper proposes Contrastive Noise Optimization (CNO), which applies an InfoNCE contrastive loss over the Tweedie denoised prediction space to optimize initial noise vectors as a preprocessing step, improving the generation diversity of diffusion models while maintaining fidelity, without modifying the sampling process or model parameters.
Does FLUX Already Know How to Perform Physically Plausible Image Composition?: This paper proposes SHINE, a training-free image composition framework that leverages the intrinsic physical priors of pretrained T2I models (e.g., FLUX) via three components — Manifold-Steered Anchor Loss, Degradation-Suppression Guidance, and Adaptive Background Blending — to achieve high-quality object insertion under complex lighting conditions (shadows, water reflections, etc.).
Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study: Through rigorous prompt-level paired statistical testing, this work finds that transferring semantic noise initialization (golden noise) from the image domain to video diffusion models yields a marginally positive but statistically insignificant trend on temporal metrics (p≈0.17). Noise-space diagnostics reveal that insufficient directional stability and spatiotemporal frequency structure discrepancies are the root causes.
DoFlow: Flow-based Generative Models for Interventional and Counterfactual Forecasting: This paper proposes DoFlow, a causal generative model based on continuous normalizing flows (CNF) that unifies observational, interventional, and counterfactual time series forecasting over a causal DAG. The model additionally supports anomaly detection via explicit likelihood estimation, and is validated on both synthetic and real-world medical datasets.
DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing: The first framework to incorporate the strong generative priors of FLUX (DiT) into drag editing. By replacing conventional point-level supervision with region-level affine supervision, combined with gradient-mask hard constraints and adapter-enhanced inversion, the method substantially improves drag editing quality.
Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing: This paper identifies a role imbalance in existing unified multimodal models, where the understanding module merely acts as a translator while the generation module is forced to simultaneously serve as both "designer" and "painter." By constructing the DIM dataset (14M long-context text-image pairs + 233K CoT editing blueprints), design responsibilities are transferred to the understanding module. The resulting 4.6B-parameter model surpasses models five times its size.
Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction: This paper proposes Dual-Solver, which generalizes multi-step samplers for diffusion models via three sets of learnable parameters — prediction type interpolation \(\gamma\), integration domain selection \(\tau\), and residual adjustment \(\kappa\) — and learns these parameters using the classification loss of frozen pretrained classifiers (MobileNet/CLIP) without requiring teacher trajectories. The method consistently outperforms DPM-Solver++ and related approaches in the low-NFE regime (3–9 NFE).
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?: This paper proposes T2I-CoReBench, the first comprehensive benchmark that systematically evaluates both compositional ability (Composition) and reasoning ability (Reasoning) of T2I models. It covers 12 evaluation dimensions, 1,080 high-difficulty prompts, and approximately 13,500 checklist questions. Large-scale evaluation of 38 models reveals that reasoning capability lags far behind compositional capability, constituting the primary bottleneck in current T2I generation.
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing: This paper constructs EditReward-Data, a high-quality dataset of 200K expert-annotated preference pairs, and trains the EditReward reward model, which achieves state-of-the-art human alignment across multiple image editing evaluation benchmarks. The model is further validated as a data filter that substantially improves downstream editing model performance.
EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling: This paper proposes the first systematic "benchmark evaluation → reward modeling → reinforcement learning training" pipeline for image editing: constructing the EditReward-Bench benchmark, training the EditScore reward model series (7B–72B, surpassing GPT-5), and successfully applying it to Online RL training to significantly improve editing model performance.
Efficient Adversarial Attacks on High-dimensional Offline Bandits: This paper exposes a security vulnerability in offline multi-armed bandit (MAB) evaluation frameworks: an attacker can completely hijack a bandit's decision-making behavior by applying imperceptibly small perturbations to publicly available reward model weights. The required perturbation norm decreases as input dimensionality increases (\(\widetilde{\mathcal{O}}(d^{-1/2})\)), rendering image-based generative model evaluation pipelines particularly vulnerable.
Eliminating VAE for Fast and High-Resolution Generative Detail Restoration: By replacing the VAE encoder and decoder with ×8 pixel-(un)shuffle operations, this work converts latent-space diffusion super-resolution (GenDR) into pixel-space super-resolution (GenDR-Pix). Combined with multi-stage adversarial distillation and a PadCFG inference strategy, the method achieves 2.8× speedup and 60% memory reduction with negligible visual degradation, enabling 4K image restoration within 1 second using only 6 GB of VRAM for the first time.
Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning: This paper proposes FLOP (Fast Learning of Order and Parents), a score-based causal discovery algorithm for linear models. By introducing fast parent selection and iterative Cholesky-based score updates, FLOP substantially reduces runtime, rendering Iterated Local Search (ILS) practical. It achieves near-perfect graph recovery on standard causal discovery benchmarks, reestablishing discrete search as a principled and competitive approach in causal discovery.
Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance: This paper proposes ERK-Guid, which leverages the order-difference error of embedded Runge-Kutta solvers as a guidance signal to adaptively correct local truncation error (LTE) in stiff regions, improving diffusion model sampling quality without additional network evaluations.
Event-T2M: Event-level Conditioning for Complex Text-to-Motion Synthesis: This paper proposes the Event-T2M framework, which decomposes text prompts into event-level atomic actions and injects them into a Conformer-based diffusion model via a TMR encoder and an Event-level Cross-Attention (ECA) module, significantly improving generation quality and semantic alignment for complex multi-event motion synthesis.
Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models: This paper proposes SpatialGenEval, a benchmark comprising 1,230 long, information-dense prompts spanning 10 spatial sub-domains, for systematically evaluating the spatial intelligence of 23 state-of-the-art T2I models. The benchmark reveals that spatial reasoning is the primary bottleneck. The authors additionally construct the SpatialT2I dataset to enable data-centric improvement of spatial intelligence.
Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model: This paper proposes ECAD (Evolutionary Caching to Accelerate Diffusion models), which employs a genetic algorithm to automatically search for optimal caching schedules along the speed–quality Pareto frontier. Without modifying model parameters and using only 100 calibration prompts, ECAD achieves 2–3× inference speedup while maintaining or even improving generation quality.
Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search: This paper proposes Bias-Guided Prompt Search (BGPS), which combines LLM decoding guidance with a diffusion model intermediate-layer attribute classifier to automatically discover interpretable text prompts that maximally expose hidden social biases in T2I models—revealing residual biases even in debiased models.
Factuality Matters: When Image Generation and Editing Meet Structured Visuals: The first systematic study on the generation and editing of structured images (charts, mathematical figures, diagrams, tables, etc.), contributing a 1.3M-pair code-aligned training dataset with CoT reasoning annotations, a unified VLM+diffusion model architecture, and the StructBench benchmark with 1,700+ samples. The work reveals that reasoning capability is the key bottleneck for current models in handling structured visual content.
SSCP: Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning: This paper proposes the Single-Step Completion Policy (SSCP), which compresses multi-step generative policies into single-step inference by predicting a "completion vector" (the normalized direction from any intermediate state to the target action) within a flow matching framework. On D4RL, SSCP matches multi-step diffusion/flow policies while achieving 64× faster training and 4.7× faster inference, and extends to GCRL to flatten hierarchical policies.
Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation: This paper proposes Flow2GAN, a two-stage training framework that first employs an improved Flow Matching objective to learn generative capabilities, then fine-tunes with a GAN to achieve few-step (1/2/4 steps) high-fidelity audio generation, combined with a multi-resolution network architecture that processes Fourier coefficients at different time-frequency resolutions.
Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning: By injecting controllable noise into flow matching training to broaden policy coverage, and combining an entropy-guided sampling mechanism to dynamically balance exploration and exploitation during online fine-tuning, FINO significantly improves sample efficiency in offline-to-online RL under limited interaction budgets.
FlowCast: Advancing Precipitation Nowcasting with Conditional Flow Matching: This work is the first to apply Conditional Flow Matching (CFM) as an end-to-end probabilistic generative model for precipitation nowcasting. By learning a direct noise-to-data mapping in a compressed latent space, the proposed method surpasses diffusion-based models in both predictive accuracy and probabilistic performance with significantly fewer sampling steps.
FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching: FlowCast is a framework that introduces speculative decoding into Flow Matching models. It exploits the local smoothness of the velocity field to extrapolate future states using the current velocity prediction as a zero-cost draft, then selectively skips redundant steps via MSE-based verification, achieving >2.5× speedup without quality degradation.
Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control: This paper proposes Follow-Your-Shape, a training-free and mask-free shape-aware editing framework. It constructs a Trajectory Divergence Map (TDM) by computing token-level velocity discrepancies between inversion and editing trajectories to precisely localize editing regions, and employs a staged KV injection strategy to achieve large-scale shape transformations while strictly preserving the background.
Free Lunch for Stabilizing Rectified Flow Inversion: This paper proposes PMI (Proximal-Mean Inversion) and mimic-CFG, two training-free methods that stabilize Rectified Flow inversion by applying proximal gradient correction toward the historical mean of the velocity field. On PIE-Bench, both methods achieve state-of-the-art reconstruction and editing quality with fewer NFEs.
From Parameters to Behaviors: Unsupervised Compression of the Policy Space: Based on the manifold hypothesis, this paper proposes unsupervised compression of the policy space—training an autoencoder with a behavioral reconstruction loss (rather than a parameter reconstruction loss) to compress the high-dimensional policy parameter space \(\Theta \subseteq \mathbb{R}^P\) into a low-dimensional latent behavioral space \(\mathcal{Z} \subseteq \mathbb{R}^k\) (up to a 121801:1 compression ratio). Experiments on Mountain Car, Reacher, Hopper, and HalfCheetah demonstrate that the intrinsic dimensionality of the behavioral manifold is determined by environment complexity rather than network size, and that PGPE optimization in the latent space converges faster than PPO, SAC, and other SOTA baselines on 7 out of 8 tasks.
From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation: This paper proposes TensorAR, which upgrades standard AR image generation from next-token prediction to next-tensor prediction: each step predicts an overlapping tensor (a group of consecutive tokens), and subsequent tensors overlap with preceding ones to enable iterative refinement. A discrete diffusion noise mechanism is introduced to address training information leakage. TensorAR serves as a plug-and-play module compatible with AR models such as LlamaGen, Open-MAGVIT2, and Janus-Pro, consistently improving generation quality on both class-to-image and text-to-image tasks.
GenCP: Towards Generative Modeling Paradigm of Coupled Physics: This paper proposes GenCP, which reformulates coupled multiphysics simulation as a probability density evolution problem. It leverages flow matching to learn conditional velocity fields from decoupled data, and synthesizes coupled solutions at inference time via Lie-Trotter operator splitting—realizing "decoupled training, coupled inference" with theoretically bounded error guarantees.
GenDR: Lighten Generative Detail Restoration: GenDR is proposed as a lightweight single-step diffusion super-resolution model for generative detail restoration. It identifies the fundamental divergence between T2I and SR objectives (T2I requires multi-step + 4-channel vs. SR requires fewer steps + 16-channel) → builds a customized SD2.1-VAE16 foundation model (0.9B, extending the latent space via REPA representation alignment without increasing model size) → introduces CiD/CiDA consistency score identity distillation (integrating SR-specific priors into score distillation + adversarial learning + representation alignment) → delivers a minimal pipeline containing only UNet + VAE, achieving 77ms inference while surpassing existing SOTA on all quality and efficiency metrics.
Generalization of Diffusion Models Arises with a Balanced Representation Space: This paper represents a significant advance in the theory of diffusion model generalization. By analyzing the optimal solutions of two-layer nonlinear ReLU denoising autoencoders (DAEs), it provides a unified characterization of both memorization and generalization, and introduces a novel representation-centric perspective on generalization. The theoretical findings are consistently validated on EDM, DiT, and Stable Diffusion v1.4, and give rise to two practical applications: memorization detection and controllable editing. The work achieves both theoretical depth and practical utility.
Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training: This paper proposes the Generate Any Scene data engine, which systematically enumerates scene graphs from a visual element taxonomy comprising 28K objects × 1.5K attributes × 10K relations, and converts them into caption–VQA pairs. The engine supports four applications: self-improvement (SD1.5 +4%), targeted distillation (<800 samples, TIFA +10%), a scene-graph reward model (DPG-Bench +5% vs. CLIP), and content moderation enhancement.
Generating Directed Graphs with Dual Attention and Asymmetric Encoding: This paper proposes Directo, the first directed graph generation model based on Discrete Flow Matching (DFM), which captures directional dependencies of directed edges via a direction-aware dual attention mechanism and asymmetric positional encoding, while establishing a standardized evaluation benchmark for directed graph generation.
GeoDiv: Framework for Measuring Geographical Diversity in Text-to-Image Models: This paper proposes GeoDiv, a framework that leverages the world knowledge embedded in LLMs and VLMs to systematically evaluate the geographical diversity of T2I models along two dimensions — the Socioeconomic Visual Index (SEVI) and the Visual Diversity Index (VDI) — revealing systematic impoverishment biases in model outputs for countries such as India and Nigeria.
GGBall: Graph Generative Model on Poincaré Ball: This paper proposes GGBall, the first graph generation framework operating entirely on the Poincaré ball model. By combining a hyperbolic vector-quantized variational autoencoder (HVQVAE) with a Riemannian flow matching prior, GGBall achieves state-of-the-art performance on both hierarchical and molecular graph generation, reducing the average generation error by 18% on hierarchical graph benchmarks.
GLASS Flows: Efficient Inference for Reward Alignment of Flow and Diffusion Models: This paper proposes GLASS (Gaussian Latent Sufficient Statistic) Flows — a novel "flow within a flow" sampling paradigm that recasts the stochastic Markov transition \(p_{t'|t}(x_{t'} | x_t)\) as an internal ODE problem via Gaussian sufficient statistic reparameterization, reusing the pretrained denoiser without retraining. This enables Feynman-Kac Steering without sacrificing ODE efficiency or SDE stochasticity, consistently surpassing the Best-of-N ODE baseline on the FLUX text-to-image model and achieving a new state of the art in inference-time reward alignment.
Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion: This paper proposes HECRL, a hierarchical entity-centric offline goal-conditioned RL framework that combines a value-based GCRL agent with a factored subgoal diffusion model, achieving 150%+ success rate improvements on multi-entity long-horizon tasks.
HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation: HierLoc reformulates visual geolocation as an image-to-entity alignment problem in hyperbolic space, replacing 5M+ image embeddings with ~240K geographic entity embeddings. It achieves a 19.5% reduction in mean geodesic error and a 43% improvement in sub-region accuracy on OSV5M.
HOG-Diff: Higher-Order Guided Diffusion for Graph Generation: This paper proposes HOG-Diff, a graph diffusion framework that leverages higher-order topological structures (e.g., rings, triangles, motifs) as generative guidance. By extracting higher-order skeletons via Cell Complex Filtration (CCF) and combining them with a generalized OU diffusion bridge, the framework realizes coarse-to-fine progressive graph generation, achieving state-of-the-art performance on 8 benchmarks for both molecular and general graph generation.
Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning: Recall proposes the first multi-modal guided attack framework that optimizes adversarial image prompts in the latent space using a single reference image. Combined with the original text prompt, it exploits the image-conditioning channel of diffusion models and achieves an average ASR of 65%–97% across 10 SOTA unlearning methods, substantially outperforming text-only attack methods and exposing the vulnerability of current unlearning mechanisms to image-modality attacks.
Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment (CODA): This paper proposes CODA, a framework that addresses slot entanglement and weak alignment in diffusion-based object-centric learning by introducing register slots to absorb residual attention, fine-tuning cross-attention projections, and applying a contrastive alignment loss. CODA achieves substantial improvements in object discovery and compositional generation quality on both synthetic and real-world datasets.
Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies (UPO): This paper proposes Unmasking Policy Optimization (UPO), which formalizes the denoising process of Masked Diffusion Models (MDMs) as a KL-regularized Markov Decision Process and trains a lightweight unmasking policy model via reinforcement learning to replace heuristic schedulers such as max-confidence. Both theoretical analysis and empirical results demonstrate that the learned policy generates samples closer to the true data distribution.
Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models: This paper presents the first systematic comparison of Visual Autoregressive (VAR) models and diffusion models on compositional text-image alignment. Evaluating 6 T2I models across T2I-CompBench++ and GenEval benchmarks, it finds that Infinity-8B achieves state-of-the-art performance on nearly all compositional dimensions, demonstrating a clear architectural advantage of VAR models in compositional generation.
Intention-Conditioned Flow Occupancy Models: This paper proposes InFOM, which leverages flow matching to construct an intention-conditioned occupancy model. By applying variational inference to infer latent intentions from unannotated data, InFOM enables RL pre-training without labeled datasets, achieving a 1.8× improvement in median return and a 36% gain in success rate across 36 state-based tasks and 4 visual tasks.
JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation: This paper proposes JointDiff, a joint continuous-discrete diffusion framework that, for the first time, unifies Gaussian diffusion (for trajectories) and multinomial diffusion (for ball-possession events) in a single model. It further introduces a CrossGuid module to support weak possession guidance and text-guided semantic controllable generation, achieving state-of-the-art performance on multi-agent trajectory generation in sports scenarios.
Laplacian Multi-scale Flow Matching for Generative Modeling: This paper proposes LapFlow, which decomposes images into Laplacian pyramid residuals and models different scales in parallel via a Mixture-of-Transformers (MoT) architecture with causal attention, reducing computational cost while improving generation quality.
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency: This paper proposes rCM (score-regularized continuous-time consistency model), which for the first time scales continuous-time consistency distillation to 14B-parameter text-to-image/video models. By combining forward KL divergence (consistency) with reverse KL divergence (score distillation), rCM matches DMD2 in quality while preserving diversity, achieving 15–50× inference speedup.
Latent Diffusion Model without Variational Autoencoder: This paper proposes SVG, which replaces the VAE latent space with frozen DINOv3 self-supervised features for diffusion model training. A lightweight residual encoder supplements fine-grained details, enabling faster training, more efficient inference, and a unified visual representation applicable across tasks.
Learning a Distance Measure from the Information-Estimation Geometry of Data: This paper proposes the Information-Estimation Metric (IEM), a novel distance function induced by the geometry of the data probability density. IEM measures the distance between signals by comparing their score vector fields at multiple noise levels. Without any supervised training, IEM achieves perceptual judgment prediction performance competitive with fully supervised methods.
LLM2Fx-Tools: Tool Calling for Music Post-Production: This paper proposes LLM2Fx-Tools, the first framework that applies LLM tool calling to audio effect modules. It leverages a multimodal LLM to understand audio inputs, employs CoT reasoning to select effect types, determine processing order, and estimate parameters, enabling interpretable and controllable music post-production.
Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation: This paper proposes Locality-aware Parallel Decoding (LPD), which reduces the number of generation steps for 256×256 images from 256 to 20 by flexibly parallelizing autoregressive modeling architectures and employing a locality-aware generation order schedule, achieving at least 3.4× latency reduction.
Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection: HiRM introduces a concept erasure strategy that decouples the update location from the erasure target — updating only the first-layer weights of the CLIP text encoder while imposing erasure supervision on the high-level semantic representations at the final layer. By steering target concept representations toward random directions (HiRM-R) or semantically meaningful directions (HiRM-S), the method achieves effective erasure of styles, objects, and NSFW content on the UnlearnCanvas and NSFW benchmarks, with zero-shot transferability to the Flux architecture.
Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall: This paper identifies the "sampling wall" in discrete diffusion models—whereby rich categorical distribution information collapses into one-hot vectors after sampling—and proposes a Loopholing mechanism that introduces a deterministic latent pathway to propagate distribution information across steps. The approach reduces generation perplexity by up to 61%, substantially closing the gap with autoregressive models.
LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration: This paper proposes LVTINO, the first zero-shot video inverse problem solver built upon a Video Consistency Model (VCM) prior. By injecting measurement consistency constraints—without requiring automatic differentiation—into the VCM sampling process, LVTINO achieves perceptual quality and temporal consistency surpassing frame-wise image methods across multiple video inverse problems (super-resolution, deblurring, inpainting) with a minimal number of neural function evaluations (NFEs).
MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design: This paper proposes MAC-AMP, the first closed-loop multi-agent collaboration system that reformulates antimicrobial peptide (AMP) design as a coordinated multi-agent optimization problem, achieving multi-objective optimization through AI-simulated peer review and adaptive reward design.
Market Games for Generative Models: Equilibria, Welfare, and Strategic Entry: This paper formalizes a three-tier model–platform–user market game, analyzes the existence conditions of pure-strategy Nash equilibria under generative model competition, characterizes market structure and social welfare implications, and designs optimal entry strategies for model providers.
Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter: This paper proposes Mod-Adapter, a tuning-free multi-concept personalization method that predicts concept-specific modulation directions in the modulation space of DiT, enabling decoupled customization of both object and abstract concepts (pose, lighting, material, etc.), substantially outperforming existing methods on multi-concept personalization benchmarks.
Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs: This paper repositions "model collapse"—commonly regarded as a negative phenomenon—as a tool for machine unlearning, proposing the PMC method. By iteratively fine-tuning on retained data and the model's own generated outputs, PMC achieves targeted information removal without directly optimizing over the forget targets, and validates its effectiveness through both theoretical analysis and empirical experiments.
MOLM: Mixture of LoRA Markers: This paper proposes MOLM, a watermarking framework that reinterprets LoRA adapters as watermark carriers. A binary key-driven routing mechanism embeds verifiable and robust watermarks into a frozen generative model without per-key retraining.
Monocular Normal Estimation via Shading Sequence Estimation: This paper proposes RoSE, which reformulates monocular normal estimation as a shading sequence estimation problem. An image-to-video (I2V) generative model is used to predict shading sequences under multiple illuminations, and a simple ordinary least squares solver then converts the shading sequence into a normal map. RoSE achieves state-of-the-art performance on real-world benchmark datasets.
Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening: This paper proposes Motion Prior Distillation (MPD), an inference-time distillation method that distills motion residuals from the forward path into the backward path, fundamentally resolving the bidirectional motion prior conflict in time reversal sampling. MPD enables more coherent generative inbetweening without any additional training.
Multi-agent Coordination via Flow Matching: This paper proposes MAC-Flow, which first learns a centralized joint behavior distribution via Flow Matching, then distills it into decentralized single-step policies through IGM (Individual-Global-Max) decomposition combined with Q-value maximization for behavior-regularized training. Evaluated across 4 benchmarks, 12 environments, and 34 datasets, MAC-Flow achieves approximately 14.5× inference speedup over diffusion-based methods while maintaining coordination performance comparable to diffusion policies.
MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion: This paper introduces a new task termed multi-view customization and proposes the MVCustom framework, which leverages a video diffusion backbone with dense spatio-temporal attention for holistic frame consistency. At inference time, two novel techniques are introduced—depth-aware feature rendering and consistency-aware latent completion—achieving for the first time the simultaneous satisfaction of camera pose control, subject identity preservation, and cross-view geometric consistency.
Neon: Negative Extrapolation From Self-Training Improves Image Generation: Neon is proposed as a post-processing method requiring <1% additional training compute: the model is first fine-tuned on its own synthetic data (causing degradation), then negatively extrapolated away from the degraded weights. The paper proves that mode-seeking samplers cause anti-alignment between synthetic and real data gradients, so negative extrapolation is equivalent to optimizing toward the real data distribution. On ImageNet 256×256, xAR-L achieves SOTA FID of 1.02.
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models: This paper proposes NeuralOS, a dual-component architecture combining an RNN-based state tracker and a diffusion renderer, which directly predicts graphical interface frame sequences from user input events (mouse movement/click/keyboard), achieving for the first time the simulation of an operating system via neural generative models.
Next Visual Granularity Generation: This paper proposes the Next Visual Granularity (NVG) generation framework, which decomposes images into structured sequences at different granularity levels and generates from global layout to fine-grained details progressively, achieving consistent FID improvements over the VAR family.
No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings: This paper proposes MoFit, the first membership inference attack (MIA) framework for diffusion models under a caption-free setting. By constructing surrogate images and conditional embeddings that overfit to the target model, MoFit exploits the asymmetric sensitivity of member samples to conditioning mismatch to enable effective inference.
Offline Reinforcement Learning with Generative Trajectory Policies: This paper proposes Generative Trajectory Policies (GTP), which adopts a unified perspective treating diffusion models, flow matching, and consistency models as special cases of ODE solution mappings. GTP learns a complete continuous-time trajectory solution mapping and introduces two adaptation techniques—score approximation and advantage weighting—achieving state-of-the-art performance on the D4RL benchmark.
Pareto-Conditioned Diffusion Models for Offline Multi-Objective Optimization: This paper proposes Pareto-Conditioned Diffusion (PCD), which reformulates offline multi-objective optimization as a conditional sampling problem. PCD directly generates high-quality solutions conditioned on objective trade-offs without requiring explicit surrogate models, achieving the best overall consistency across diverse benchmarks.
PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models: This paper proposes PCPO, which corrects the disproportionate credit assignment inherent in policy gradient methods for diffusion/flow models via a stabilized objective reformulation and principled timestep reweighting, significantly accelerating convergence and mitigating model collapse.
PI-Light: Physics-Inspired Diffusion for Full-Image Relighting: This paper proposes π-Light (PI-Light), a two-stage full-image relighting framework. Stage 1 performs intrinsic property decomposition (albedo, normals, roughness, etc.) via a physics-guided diffusion model; Stage 2 synthesizes the relit image under target illumination via a physics-guided neural rendering module. Batch-aware attention and physics-inspired losses are introduced to achieve strong generalization to real-world scenes.
PolyGraph Discrepancy: a classifier-based metric for graph generation: This paper proposes PolyGraph Discrepancy (PGD), which approximates a variational lower bound on the Jensen-Shannon distance by training a classifier to distinguish real graphs from generated ones. PGD addresses three fundamental shortcomings of MMD-based metrics: the lack of an absolute scale, incomparability across descriptors, and high bias and variance under small sample sizes.
Pseudo-Nonlinear Data Augmentation: A Constrained Energy Minimization Viewpoint: Leveraging the dually flat structure of energy-based models and information geometry, this work proposes a training-free, efficient, and controllable data augmentation method that performs cross-modal augmentation on statistical manifolds via forward projection (encoding) and backward projection (decoding).
Purrception: Variational Flow Matching for Vector-Quantized Image Generation: This paper proposes Purrception, an image generation method that adapts Variational Flow Matching (VFM) to vector-quantized (VQ) latent spaces. By simultaneously computing a velocity field in the continuous embedding space and learning a categorical posterior distribution over codebook indices, Purrception bridges continuous transport dynamics with discrete supervision, achieving faster training convergence and FID scores competitive with state-of-the-art methods on ImageNet-1k 256×256.
Pyramidal Patchification Flow for Visual Generation: This paper proposes Pyramidal Patchification Flow (PPFlow), which employs larger patches at high-noise timesteps and smaller patches at low-noise timesteps, achieving 1.6–2.0× denoising speedup while preserving generation quality, without requiring any re-noising tricks.
QVGen: Pushing the Limit of Quantized Video Generative Models: This paper proposes QVGen, a quantization-aware training (QAT) framework for video diffusion models. It introduces auxiliary modules to reduce gradient norms and improve convergence, and designs a rank decay strategy to progressively eliminate the inference overhead of auxiliary modules during training. QVGen is the first method to achieve near full-precision video generation quality under 4-bit quantization.
RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation: This paper proposes RefAny3D, a 3D asset-referenced image generation framework that achieves precise geometric and texture consistency between generated images and 3D reference assets through a dual-branch generation strategy that jointly models RGB images and point maps.
Referring Layer Decomposition: This paper introduces the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image given flexible user-provided prompts (spatial, textual, or hybrid). It also constructs the RefLade dataset comprising 1.1 million samples and proposes an automated evaluation protocol.
RIDER: 3D RNA Inverse Design with Reinforcement Learning-Guided Diffusion: This paper proposes RIDER, the first framework to incorporate reinforcement learning into 3D RNA inverse design. It first pretrains a conditional diffusion model (RIDE) to learn sequence–structure relationships, then applies RL fine-tuning to directly optimize 3D structural similarity rather than native sequence recovery rate, achieving over 100% improvement across all 3D self-consistency metrics.
RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation: This paper proposes RMFlow, which appends a noise-injection refinement step after 1-NFE MeanFlow transport to compensate for single-step transport errors, while incorporating a maximum likelihood objective during training to minimize the KL divergence between the learned and target distributions. RMFlow achieves near-SOTA 1-NFE results on text-to-image generation, molecular generation, and time-series generation.
RNE: plug-and-play diffusion inference-time control and energy-based training: This paper proposes the Radon-Nikodym Estimator (RNE), which exploits density ratios between path distributions to reveal the fundamental relationship between marginal densities and transition kernels, providing a unified plug-and-play framework that simultaneously enables diffusion density estimation, inference-time control, and energy-based diffusion training.
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance: This paper proposes ProMoE, an MoE framework for Diffusion Transformers that introduces a two-stage router (conditional routing + prototype routing) and a routing contrastive loss to provide explicit semantic guidance, promoting expert specialization and significantly outperforming existing MoE and dense models on ImageNet.
SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions: This paper proposes SafeFlowMatcher, a safe planning framework that integrates flow matching with Control Barrier Functions (CBF). Through a predictor-corrector (PC) integrator, it decouples trajectory generation from safety certification, providing formal safety guarantees while preserving the efficiency of flow matching.
Sample-Efficient Evidence Estimation of Score-Based Priors for Model Selection: This paper proposes DiME, a model evidence estimator that integrates along the temporal marginals of the diffusion posterior. DiME requires neither prior scores nor density evaluations, and accurately estimates model evidence under diffusion model priors using as few as 20 posterior samples, enabling prior selection and model validation.
scDFM: Distributional Flow Matching for Robust Single-Cell Perturbation Prediction: This paper proposes scDFM, a generative framework based on conditional flow matching (CFM) that enforces distribution-level fidelity via MMD regularization and introduces the PAD-Transformer backbone to handle noisy and sparse single-cell data. On combinatorial perturbation prediction, scDFM reduces MSE by 19.6% over the strongest baseline CellFlow.
Seek-CAD: A Self-Refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek: This paper proposes Seek-CAD, the first training-free CAD parametric model generation framework based on a locally deployed reasoning LLM (DeepSeek-R1). It achieves self-refinement through the synergy of step-wise visual feedback and Chain-of-Thought (CoT), and introduces a novel SSR triplet design paradigm to support complex CAD model generation.
Self-Improving Loops for Visual Robotic Planning: This paper proposes SILVR, a framework that achieves continual self-improvement on unseen tasks by iteratively fine-tuning an in-domain video generation model on self-collected online trajectories. SILVR achieves up to 285% performance improvement on MetaWorld and real-robot benchmarks.
SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP: SeMoBridge is proposed as a lightweight semantic modality bridge that maps image embeddings into the text modality, converting unreliable intra-modal (image-to-image) comparisons into reliable inter-modal (text-to-image) comparisons, achieving state-of-the-art few-shot classification performance with minimal training overhead.
SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation: This paper proposes SenseFlow, which scales distribution matching distillation (DMD) to large-scale flow-based text-to-image models (SD 3.5 Large 8B / FLUX.1 dev 12B) via Implicit Distribution Alignment (IDA) and Intra-Segment Guidance (ISG), enabling high-quality 4-step image generation.
SERUM: Simple, Efficient, Robust, and Unifying Marking for Diffusion-based Image Generation: SERUM is a watermarking method that injects unique watermark noise into the initial noise of diffusion models and trains a lightweight detector to identify watermarks directly from generated images — without costly DDIM inversion — achieving state-of-the-art detection rates under diverse attacks with extremely fast injection and detection, while supporting multi-user scenarios.
SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling: This paper presents the first systematic study of privacy leakage in SMOTE, proposing two attacks—DistinSMOTE and ReconSMOTE—that demonstrate SMOTE is fundamentally non-privacy-preserving and disproportionately exposes minority-class records.
SoFlow: Solution Flow Models for One-Step Generative Modeling: This paper proposes Solution Flow Models (SoFlow), which directly learn the solution function \(f(x_t, t, s)\) of the velocity ODE (mapping \(x_t\) at time \(t\) to the solution at time \(s\)). Trained from scratch via a Flow Matching loss combined with a JVP-free solution consistency loss, SoFlow achieves a 1-NFE FID of 2.96 on ImageNet 256 (XL/2), outperforming MeanFlow (3.43).
SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation: This paper proposes the SongEcho framework, which achieves cover song generation via Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), generating new vocals and accompaniment while preserving the melodic contour of the original song.
SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models: SPEED proposes a closed-form model editing method based on null space constraints, refining the preservation set through three complementary techniques—Influence-based Prior Filtering (IPF), Directional Prior Augmentation (DPA), and Invariance Equality Constraint (IEC)—to achieve scalable (erasing 100 concepts within 5 seconds), precise (zero semantic loss on non-target concepts), and efficient concept erasure.
SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation: This paper proposes Scaled Spatial Guidance (SSG), a training-free inference-time guidance method that enhances the coarse-to-fine hierarchical generation quality of visual autoregressive models through frequency-domain prior construction and semantic residual amplification.
Steer Away From Mode Collisions: Improving Composition In Diffusion Models: To address concept missing and collision in multi-concept prompts for diffusion models, this paper proposes the "mode collision" hypothesis — that the modes of the joint distribution overlap with those of individual concept distributions — and introduces CO3 (Concept Contrasting Corrector). CO3 corrects sampling via a contrasting distribution \(\tilde{p}(x|C) \propto p(x|C) / \prod_i p(x|c_i)\) in Tweedie mean space to steer away from degenerate modes, achieving plug-and-play, gradient-free, and model-agnostic compositional generation improvement.
Step-Aware Residual-Guided Diffusion for EEG Spatial Super-Resolution: This paper proposes SRGDiff, a step-aware residual-guided diffusion model that reformulates EEG spatial super-resolution as a dynamic conditional generation task, achieving high-fidelity reconstruction via per-step residual direction correction and timestep-dependent affine modulation.
Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models: This paper proposes S²-Guidance, which constructs a weak model by randomly dropping transformer block activations during denoising to perform self-guidance, correcting the suboptimal predictions of CFG without additional training. The method consistently outperforms CFG and other advanced guidance strategies on text-to-image and text-to-video tasks.
TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex: This paper extends the VAE formalism to propose the Task-Amortized VAE (TAVAE), which explains contextual modulation in the primary visual cortex (V1) by learning task-specific priors over an already-learned representation. The framework accounts for bimodal population responses observed when test stimuli deviate from training stimuli in an orientation discrimination task.
Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions: This paper proposes the PCI (Prompt-Conditioned Intervention) framework, which quantifies when concepts become committed during diffusion model denoising by switching text prompts at different timesteps along the denoising trajectory, and applies these findings to temporally-aware image editing.
Test-Time Iterative Error Correction for Efficient Diffusion Models: This paper proposes IEC (Iterative Error Correction), a plug-and-play test-time method that iteratively corrects inference errors in efficient diffusion models, reducing error accumulation from exponential to linear growth.
The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models: This paper systematically investigates how prompt complexity affects three key dimensions of synthetic data generated by T2I models—quality, diversity, and consistency—proposes a new evaluation framework, and finds that prompt expansion, as an inference-time intervention, optimally balances diversity and aesthetic quality.
The Spacetime of Diffusion Models: An Information Geometry Perspective: This paper proposes a "spacetime" framework for diffusion models from an information-geometric perspective. It proves that the standard pullback geometry degenerates to straight lines in diffusion models, and introduces instead a spacetime geometry based on the Fisher-Rao metric, from which practically computable diffusion edit distances (DiffED) and transition path sampling methods are derived.
There and Back Again: On the Relation between Noise and Image Inversions in Diffusion Models: This paper conducts an in-depth analysis of the error mechanisms in DDIM inversion, revealing that latent encodings exhibit low diversity and high correlation in smooth image regions (e.g., sky), traces this phenomenon to inaccurate noise predictions in the early inversion steps, and proposes a simple fix that replaces the first few inversion steps with forward diffusion.
Towards Interpretable Visual Decoding with Attention to Brain Representations: This paper proposes NeuroAdapter, which segments fMRI signals into independent tokens by brain region and conditions Stable Diffusion directly via cross-attention, bypassing conventional CLIP/DINO intermediate embedding spaces. On NSD and other datasets, NeuroAdapter matches or surpasses existing methods on high-level semantic metrics. It further introduces the IBBI bidirectional interpretability framework, which for the first time dynamically reveals how different cortical regions drive image generation along the denoising trajectory.
Training-Free Reward-Guided Image Editing via Trajectory Optimal Control: This paper reformulates reward-guided image editing as a trajectory optimal control problem, treating the reverse process of diffusion/flow models as a controllable trajectory. By iteratively optimizing the entire trajectory via Pontryagin's Maximum Principle (PMP) with adjoint state methods, it achieves effective reward-guided editing without training and without reward hacking.
Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations: This paper proposes a general framework that leverages Rectified Flow to generate distributional rewards for training explanation-generating LLMs. By employing continuous normalizing flows (CNF) to capture the pluralistic and probabilistic nature of human judgments on explanations, the framework provides theoretical guarantees that CNF can effectively recover the true human reward distribution. It significantly outperforms RLHF/RLAIF baselines on SMAC, MMLU, MathQA, and other tasks.
TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows: TwinFlow is proposed: by extending the flow matching time interval from \([0,1]\) to \([-1,1]\), twin trajectories are constructed to form self-adversarial signals, enabling one-step generation without any discriminator or frozen teacher. This is the first work to scale 1-NFE generation to a 20B-parameter model (Qwen-Image), achieving 1-NFE GenEval of 0.86 — approaching the original 100-NFE score of 0.87 — while reducing inference cost by 100×.
Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models: Uni-X proposes an X-shaped architecture with separated ends and a shared middle to mitigate gradient conflicts between visual and textual modalities in Unified Multimodal Models (UMMs). By designating shallow and deep layers as modality-specific and sharing intermediate layers, a 3B-parameter model matches or surpasses 7B AR-UMMs on both image generation and multimodal understanding.
Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow: DualFlow proposes the first unified framework for dyadic interactive/reactive 3D motion generation under text+music multi-modal conditions via Rectified Flow and Retrieval-Augmented Generation (RAG). It introduces contrastive flow matching and synchronization loss, achieving 2.5% FID improvement and 76% R-precision improvement on the MDD dataset, with 2.5× faster inference.
Unsupervised Conformal Inference: Bootstrapping and Alignment to Control LLM Uncertainty: This paper proposes an unsupervised conformal inference framework (BB-UCP) that achieves distribution-free finite-sample coverage guarantees for LLM generation under label-free, API-compatible conditions, via Gram matrix interaction energy scoring, batch bootstrap calibration, and conformal alignment, effectively detecting and filtering hallucinated outputs.
Verification of the Implicit World Model in a Generative Model via Adversarial Sequences: This paper proposes an adversarial sequence generation method to verify the soundness of implicit world models in generative sequence models. Through systematic evaluation in the chess domain using multiple adversarial strategies (IMO/BSO/AD), it finds that all tested models are unsound, while training objectives and dataset choice significantly affect soundness. Furthermore, linear board-state probes exhibit no causal role in most models.
Verifier-Constrained Flow Expansion for Discovery Beyond the Data: This paper proposes Flow Expander (FE), which expands the coverage of pretrained flow models in probability space via verifier-constrained entropy maximization, enabling the generation of design samples beyond the training data distribution while maintaining validity. FE increases diversity in molecular conformation design while preserving chemical validity.
VFScale: Intrinsic Reasoning through Verifier-Free Test-time Scalable Diffusion Model: VFScale proposes a test-time scalable diffusion model that requires no external verifier. By introducing an MRNCL loss and KL regularization to improve the energy landscape, the model's intrinsic energy function serves as a verifier. Combined with hybrid MCTS denoising for efficient search, a model trained on 6×6 mazes achieves 88% success on 15×15 mazes, where standard diffusion models fail entirely.
Visual Autoregressive Modeling for Instruction-Guided Image Editing: This paper proposes VAREdit, which reformulates instruction-guided image editing as a multi-scale prediction problem. By introducing a Scale-Aligned Reference module to address the scale mismatch in finest-scale conditioning, VAREdit significantly outperforms diffusion-based methods in both editing fidelity and inference efficiency.
When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models: This paper is the first to reveal and systematically study "backdoor modality collapse" in multimodal diffusion models — a phenomenon where the backdoor effect degenerates to rely on a single modality (typically text) during multimodal backdoor attacks. Two novel Shapley-value-based metrics, TMA and CTI, are proposed to quantify modality contribution and cross-modal interaction, uncovering a "winner-takes-all" dynamic and negative interaction.
When Scores Learn Geometry: Rate Separations under the Manifold Hypothesis: Under the manifold hypothesis, this paper reveals a scale separation between geometric and distributional information in score learning — manifold geometry contributes at order \(\Theta(\sigma^{-2})\), which dominates distributional information by a factor of \(O(\sigma^{-2})\). This establishes that the success of diffusion models stems primarily from learning the data manifold rather than the full distribution, and a one-line code modification suffices to generate the uniform distribution on the manifold.
Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials: Zatom-1 is the first end-to-end fully open-source foundation model that unifies generative modeling and property prediction for 3D molecules and materials via multimodal flow matching. Using a standard Transformer architecture, it directly models discrete atom types and continuous 3D geometry in Euclidean space, achieving positive transfer learning across chemical domains.