🎨 Image Generation¶
🧪 ICML2026 · 22 paper notes
📌 Same area in other venues: 📷 CVPR2026 (209) · 🔬 ICLR2026 (149) · 🤖 AAAI2026 (78) · 🧠 NeurIPS2025 (241) · 📹 ICCV2025 (213)
🔥 Top topics: Diffusion Models ×4 · Image Editing ×2 · Alignment/RLHF ×2
- Adversarial Flow Models
-
The authors add an optimal transport regularization \(\|G(z)-z\|^2\) to the GAN training objective, constraining the "arbitrary transport map" of GANs to the Wasserstein-2 optimal transport map. This enables stable adversarial training and end-to-end one-step generation on pure transformers for the first time. On ImageNet-256, 1NFE FID reaches 2.38 (XL/2) and 1.94 (112 layers).
- Anomaly-Preference Image Generation (APO)
-
The authors reformulate "few-shot anomaly image generation" as a "preference optimization problem without manual annotation": real anomalies serve as positive samples, while the denoising bias of the reference model at the same timestep acts as an implicit negative sample. A DPO-style loss aligns the diffusion model to the anomaly distribution. Time-aware LoRA rank adjustment (TACA) preserves structural diversity, and hierarchical CFG controls text-anomaly alignment strength. On benchmarks like MVTec, both fidelity and diversity are improved.
- Caracal: Causal Architecture via Spectral Mixing
-
Caracal replaces the \(\mathcal{O}(L^2)\) attention in Transformers with an \(\mathcal{O}(L \log L)\) Multi-Head Fourier (MHF) module, achieving strict causal masking in the frequency domain via a "pad-FFT-multiply-iFFT-truncate" pipeline. It completely removes positional encoding, relying solely on standard FFT operators (without custom CUDA kernels like Mamba), and matches the performance of Llama / Mamba / Mamba-2 / Jamba across all model scales from Tiny to Large.
- CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation
-
CARD uses "radix \(r\) decomposition" to bijectively map molecular 3D coordinates into a coarse-to-fine discrete-continuous mixed token sequence, enabling a cross-system transferable autoregressive Transformer to serve as a "zero free energy proposal" and directly estimate the absolute free energy of any molecular system via BAR. On solvation tasks for 70 novel systems, it matches the accuracy of classical MFES while being about 40 times faster in inference.
- CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning
-
This work addresses the issue of "editing models making unintended changes in non-editing regions" by constructing the CoCoEdit-40K local editing dataset, introducing a pixel-level similarity reward to complement the MLLM reward, and designing a region-regularized RL objective (constraining non-editing regions for high-reward samples, forcing changes in editing regions for low-reward samples). This approach improves both editing scores and PSNR/SSIM for FLUX.1 Kontext and Qwen-Image-Edit, breaking the existing trade-off between editing capability and content consistency.
- Conditional Diffusion Sampling
-
This paper proposes Conditional Diffusion Sampling (CDS): by deriving a class of conditional stochastic interpolants, it obtains an exact closed-form SDE for the unnormalized target distribution (without neural network fitting), and then efficiently samples the initial distribution of this SDE using Parallel Tempering (PT)—combining PT's global exploration with the local refinement of the diffusion process. On 8 target distributions and 4 task types, CDS outperforms traditional MCMC, training-free MCMC, and neural samplers with fewer density evaluations.
- Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers
-
This paper uses linear probes to discover that in MM-DiT (FLUX / SD3.5), certain attention heads in intermediate layers naturally encode a binary signal in the key vectors of text tokens, indicating whether a target concept will appear. Based on this, the authors propose Omission Signal Intervention (OSI): during inference, they inject the mean difference direction between "omission" and "existence" classes into the key vectors of the top-K heads with strength \(\alpha\sigma\boldsymbol{\theta}\), thereby activating the model's "self-awareness" of missing concepts and prompting completion. On FLUX, GenEval 6-object accuracy improves from 0.18 → 0.40, without any fine-tuning.
- End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
-
EOSTok jointly trains a 1D ViT tokenizer and an autoregressive model in a single-stage end-to-end pipeline. The newly proposed APR (Autoregressive Prediction Reconstruction) loss enables gradients from "next-token prediction" to flow back to the pixel space, preventing codebook collapse. "Implicit alignment" injects DINOv2 semantics into the 1D latent space without disrupting the 1D autoregressive structure. Ultimately, EOSTok achieves an FID of 1.48 on ImageNet 256 without guidance (SOTA).
- Exploring and Exploiting Stability in Latent Flow Matching
-
This work systematically characterizes the "trajectory stability" of Latent Flow Matching (LFM)—under the same noise seed, pruning 75% of the data, changing model size, or altering training seeds still produces nearly identical images. This property is then translated into two practical algorithms: (1) Using balanced-clustering pruning, 50% of CelebA-HQ data can be pruned with a slight FID improvement, and 75% can be pruned on ImageNet; (2) A Coarse-to-Fine two-stage generation, combining DiT-XL/2 (675M) and DiT-S/2 (33M), achieves 2.15× faster inference.
- Factored Classifier-Free Guidance
-
This paper identifies an "attribute amplification" failure mode of CFG in counterfactual generation with diffusion models—using a single global \(\omega\) amplifies not only the target attribute but also unintended ones. The authors propose FCFG: grouping attributes by causal graph and assigning independent guidance weights to each group, which significantly reduces non-target attribute drift and improves counterfactual reversibility on CelebA-HQ / EMBED / MIMIC-CXR.
- GenExam: A Multidisciplinary Text-to-Image Exam
-
GenExam treats "drawing exams" as the gold standard for evaluating the comprehensive reasoning-understanding-generation abilities of T2I models. It provides 1,000 questions across 10 disciplines, each with a ground-truth image and fine-grained scoring points. Even the strongest closed-source model, Nano Banana Pro, achieves only 70.2% strict score, while most open-source T2I/unified MLLMs score below 3%.
- Implicit Preference Alignment for Human Image Animation
-
The authors propose Implicit Preference Alignment (IPA): a post-training method that requires only "good samples" and does not need to construct good/bad pairs. By maximizing the KL gap with a pretrained reference model, IPA equivalently maximizes an implicit reward. Combined with a HALO module that weights hand masks in the loss, this enables large-scale video DiT models to significantly improve hand fidelity in human animation using only 93 selected samples.
- Krause Synchronization Transformers
-
The authors introduce the Krause bounded confidence consensus model into Transformers, replacing global softmax similarity with "distance-RBF + local window + top-k sparsity." They theoretically prove this encourages multi-cluster synchronization rather than global collapse, and demonstrate superior performance and over 30% compute savings on ViT, autoregressive image generation, and LLMs.
- Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
-
This paper targets rectified flow (RF) text-to-image models and proposes PNAPO—an offline preference optimization framework that saves both the "prior noise used for generation" and "winner/loser images" as sextuplets. Leveraging the RF straight-line trajectory assumption for trajectory estimation and dynamic regularization scheduling, PNAPO outperforms Diffusion-DPO on SD3-M/FLUX while reducing training compute to 1/12.
- Riemannian Generative Decoder
-
This work addresses the challenge that Riemannian VAEs require hand-crafted, complex probability densities for each manifold. It proposes the Riemannian Generative Decoder (RGD), which entirely discards the encoder and treats each sample's latent as a free parameter, trained directly with a Riemannian optimizer (RiemannianAdam). It introduces "input noise inversely scaled by local metric" as a geometric regularizer. On three real biological datasets—synthetic branching diffusion tree, human mitochondrial DNA, and cell cycle scRNA-seq—RGD recovers more faithful geometry and achieves superior numerical stability over VAE baselines in high dimensions.
- SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning
-
The authors identify the "attention collapse" issue in MLLM-based editing reward models—where the model, instead of comparing the original and edited images, collapses attention onto sink tokens and makes blind judgments. They propose SpatialReward: an 8B model first predicts bounding boxes of edited regions, then uses these box tokens as anchors for interleaved cross-image reasoning. With a 260K-sample spatially-aware dataset and two-stage GRPO training, the method achieves SOTA on three reward benchmarks and boosts OmniGen2's GEdit-Bench score by +0.90 (twice the improvement of GPT-4.1).
- Speculative Coupled Decoding for Training-Free Lossless Acceleration of Autoregressive Visual Generation
-
This work identifies that the root cause of limited acceleration in Speculative Jacobi Decoding (SJD) for autoregressive visual generation is the near-zero probability of collision between draft tokens across consecutive iterations due to independent sampling. By simply replacing independent sampling with Maximal/Gumbel Coupling (a one-line modification, no extra training), image generation can be accelerated up to \(4.2\times\) and video generation up to \(13.6\times\), while strictly preserving the output distribution identical to original AR decoding.
- Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges
-
SDB reframes modality translation as "selecting a coupling from all joint distributions \(\mathcal{P}\) satisfying marginal constraints," stacking marginal matching (WTA + capacity constraint) and both endpoint-level and trajectory-level cycle consistency on top of LDDBM. Paired supervision becomes merely an optional heuristic, enabling training under zero-paired, semi-paired, and fully-paired regimes. Even with full pairing, SDB outperforms paired-only baselines (e.g., FFHQ→CelebA-HQ PSNR improves from 25.6 to 25.9).
- The Coupling Within: Flow Matching via Distilled Normalizing Flows
-
This paper proposes NFM (Normalized Flow Matching), which uses the "accurate data→noise bijection" produced by a pretrained autoregressive normalizing flow (NF) such as TarFlow as the noise-data pairing for Flow Matching. This approach simultaneously advances FM's convergence speed and low-step FID, and, in turn, achieves inference speeds several orders of magnitude faster than the NF teacher.
- Threshold-Guided Optimization for Visual Generative Models
-
The authors remove the paired preference assumption of DPO, proving that the optimal strategy for KL regularization essentially compares each sample's reward to an intractable instance-dependent baseline \(\tau^*(x)=\beta\log Z(x)\). They propose replacing it with a global threshold \(\tau\) estimated from a score percentile, and introduce a confidence weight proportional to \(|s-\tau|\). This enables stable alignment of diffusion models and MaskGIT using only scalar scores (without paired preferences), consistently outperforming Diffusion-DPO / KTO / DSPO across five reward models and three test sets.
- Visual Implicit Autoregressive Modeling
-
This work embeds Deep Equilibrium (DEQ) implicit fixed-point layers into the next-scale autoregressive framework of VAR, using Jacobian-Free Backpropagation to achieve constant memory training, compressing the 2B parameters of VAR-d30 down to 770M. During inference, the number of iterations per scale becomes a "tunable knob"—on ImageNet-256, FID 2.16/sFID 8.07 are maintained, while peak memory on a single 4090 drops from 19.24GB to 8.53GB and throughput increases from 15.16 to 32.08 img/s.
- Watch Your Step: Information Injection in Diffusion Models via Shadow Timestep Embedding
-
This paper reveals that the often-overlooked "timestep embedding" in diffusion models is in fact an unused information side channel. By extending the training timestep range to a "shadow interval" and binding a different data distribution to this interval, it is possible—without changing the scheduler interface—for the same diffusion model to generate normal images in the explicit interval and "hidden" images in the shadow interval. This enables both covert backdoor attacks and model watermark verification. The paper also provides a mutual coherence theoretical analysis based on sinusoidal positional encoding, explaining why two disjoint intervals can carry independent information.